Claude Token Counter, now with model comparisons

Posted by twapi 1 day ago

Comments

Comment by vfalbor 1 day ago

This is perfectly legitimate. It's something I've been denouncing day after day. Company X charges you 10dolar per token, while company Y charges you 7dolar, yet company X is cheaper because of the tokenizer they use. The token consumption depends on the tokenizer, and companies create tokenizers using standard algorithms like BPE. But they're charging for hardware access, and the system can be biased to the point that if you speak in English, you consume 17% less than if your prompt is written in Spanish, or even if you write with Chinese characters, you'll significantly reduce your token consumption compared to English speakers. I've written about this several times on HN, but for whatever reason, every time I mention it, they flag my post.

Comment by akdor1154 1 day ago

I have often wondered if Chinese is a much 'better' language for LLMs - every character is a token, boom you're done. No weird subword nonsense, no strange semantics being applied to arbitrary chunks of words.. I feel like there must be benefits to being able to have the language tokenized in what must be very close to 1:1.

Comment by vfalbor 23 hours ago

Yes, it is. In fact, I made a small application to reduce the token consumption for translating from one language to another, and I even invented a language called Tokinensis, which is a mix of different languages, and I ran my own tests with savings of 30%. Chinese is amazing because they encapsulate a ton of information in a single symbol, so you can save a ton of tokens.

Comment by svnt 17 hours ago

Are you able to use the language practically? How would that work? You prompt it in english but tell it to work in tokinensis? And then translate back at the end?

Comment by vfalbor 17 hours ago

Yep, actually, is a mixture that works. I actually run for my day to day, and I can save tokens, maybe not that I will expected, but it works, you can try if you wish https://translation.tokenstree.com.

Comment by touristtam 19 hours ago

Interested; I came across a post that was mentioning using Kanji for specific use to reduce context.

Comment by vfalbor 17 hours ago

Maybe in future there will be some "Tokenensis" but in kanjis which could concentrate a lot of info into little space.

Comment by jmalicki 17 hours ago

I'm not sure what the state of the art is today, but 15 years ago I worked on a cross-lingual search engine - a challenge with Chinese was that ngram-like models for detecting common language errors (such as typos) were simply ineffective due to this.

We found a lot of gain by having ranking features based on Pinyin to detect typos/misspellings due to homophones (and similar sounding words). I was investigating stroke decomposition to try to be able to detect near homographs, but wasn't able to find any good libraries at the time.

I could imagine the homophone issue is especially relevant for spoken input to LLMs. LLMs are good enough that they're usually right, so it's probably less of an issue, but in English I can have crazy typos and everything just works, I am curious how well that would work for Chinese, since I suspect it's a harder problem by far due to the lack of subword tokens?

Comment by DANmode 13 hours ago

If you’re enjoying this thread, you have to make a stop over here:

https://github.com/JuliusBrussee/caveman

Comment by kouteiheika 1 day ago

> Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6

Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)

Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):

    Kill -> [70074]
    Killed -> [192794]
    kill -> [25752]
    k|illed -> [74, 7905]
    <space>kill -> [15874]
    <space>killed -> [17372]

You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.

This is not necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:

   <capitalize>|kill
   <capitalize>|kill|ed
   kill|
   kill|ed
   <space>|kill
   <space>|kill|ed

Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.

So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.

Comment by ipieter 1 day ago

There is currently very little evidence that morphological tokenizers help model performance [1]. For languages like German (where words get glued together) there is a bit more evidence (eg a paper I worked on [2]), but overall I start to suspect the bitter lesson is also true for tokenization.

[1] https://arxiv.org/pdf/2507.06378

[2] https://pieter.ai/bpe-knockout/

Comment by sigmoid10 1 day ago

I never understood why people want this in the first place. Sure, making this step more human explainable would be nice and possibly even fix some very particular problems for particular languages, but it directly goes against the primary objective of a tokenizer: Optimizing sequence length vs. vocabulary size. This is a pretty clear and hard optimization target and the best you can do is make sure that your tokenizer training set more closely mimics your training and ultimately your inference data. Putting english or german grammar in there by force will only degrade every other language in the tokenzier, while we already know that limiting additional languages will hurt overall model performance. And the belief that you can encode a dataset of trillions of tokens into a more efficient vocabulary than a machine is kind of weird tbh. People have also accepted since the early convnet days that the best encoding representation for images in machine learning is not a human understandable one. Same goes for audio. So why should text be any different? If you really think so, you might also wanna have a go at feature engineering images. And it's not like people haven't tried that. But they all eventually learned their lesson.

Comment by wongarsu 1 day ago

We usually build the tokenizer by optimizing for one goal (space-efficient encoding of text), then use it in a model that is trained for an entirely different goal (producing good text, "reasoning", "coding", etc). It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

That's what all these attempts boil down to. They don't presume to be able to find a more space-efficient encoding by hand, they assume that the optimization goal for the tokenizer was wrong and they can do better by adding some extra rules. And this isn't entirely without precendent, most tokenizers have a couple of "forced" tokens that were not organically discovered. Moving around how digits are grouped in the tokenizer is another point where wins have been shown.

This is where projects like nanochat are really valuable for quickly and (relatively) cheaply trying out various tweaks

Comment by sigmoid10 1 day ago

>It is not immediately clear that the optimization goal for the tokenizer is actually the one that best serves the training of the llm.

Except that is exactly what research has shown. Besides, the tokenizer's training goal is literally just to encode text efficiently with fewer tokens by increasing the vocabulary, which obviously directly benefits the attention mechanism if you look at the dimensions of involved matrices. The biggest issues so far have stemmed from variances between tokenizer and LLM training sets [1] and the fact that people primarily work with character based text and not word-part based text (even though that gets muddy when you look at what is actually happening in the brain) when doing anything in writing.

[1] https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

Comment by amelius 23 hours ago

If you want to make it more human-explainable, then ditch the entire tokenizer and just feed the models raw characters. Because now there is nothing to explain.

Comment by sigmoid10 19 hours ago

Then that means you need at least 4x the compute to achieve the same results as state of the art. Meaning that if I can train my frontier model with my normal tokenizer in 3 months, it will take you a year. When major releases across all competing providers are measured in months, there's simply no incentive to do that just to capture these fringe edge cases.

Comment by amelius 18 hours ago

Yes, OK. But all the tutorials start with explaining how a tokenizer works. This is not necessary. And in fact makes the message of why a tokenizer is necessary not come across as well.

Comment by rrr_oh_man 1 day ago

I feel it's a case of "This random word generator can't possibly be smarter than I?!"

Comment by dannyw 1 day ago

LLMs are explicitly designed to handle, and also possibly 'learn' from different tokens encoding similar information. I found this video from 3blue1brown very informative: https://www.youtube.com/watch?v=wjZofJX0v4M

Also, think about how a LLM would handle different languages.

Comment by fooker 1 day ago

This is how language models have worked since their inception, and has been steadily improved since about 2018.

See embedding models.

> they removed the tokenizer altogether

This is an active research topic, no real solution in sight yet.

Comment by friendzis 1 day ago

This is such a superficial, English-centric take, but it might as well be true. It seems to me that in non-english languages the models, especially chatgpt, have suffered in the declension department and output words in cases that do not fit the context.

I have just ran an experiment: I have taken a word and asked models (chatgpt, gemini and claude) to explode it into parts. The caveat is that it could either be root + suffix + ending or root + ending. None of them realized this duality and have taken one possible interpretation.

Any such approach to tokenizing assumes context free (-ish) grammar, which is just not the case with natural languages. "I saw her duck" (and other famous examples) is not uniquely tokenizable without a broader context, so either the tokenizer has to be a model itself or the model has to collapse the meaning space.

Comment by orbital-decay 1 day ago

Current models understand different tokenization variants perfectly, e.g. leading space vs no leading space vs one character per token. It doesn't even affect evals and behchmarks. They're also good at languages that have very flexible word formation (e.g. Slavic) and can easily invent pretty natural non-existent words without being restricted by tokenization. This ability took a bit of a hit with recent RL and code generation optimizations, but this is not related to tokenization.

>None of them realized this duality and have taken one possible interpretation.

I suspect this happens due to mode collapse and has nothing to do with the tokenization. Try this with a base model.

Comment by frb 1 day ago

I was looking into morpheme tokenization approach, but went even more radical with building a semantic primitive tokenizer [1], i.e. kill, killed, killer would all share the same semantic connection and tokens, e.g. [KILL], [KILL, BEFORE], [KILL, SOMEONE].

It’s based on semantic primitives (Wierzbicka NSM) and emoji (the fun idea that got me interested in this in the first place).

So far I’ve tested 6 iterations and it trains and responds well with a 10k vocab, but the grammar came out rougher. Working on 8th iteration, mainly to improve the grammar and language. Turns out the smaller vocab couldn’t be maintained and all improvements get us back in the ballpark of the 32k vocab size. Further testing is still outstanding for this week.

[1] https://github.com/frane/primoji

Comment by anonymoushn 1 day ago

their old tokenizer performed some space collapsing that allowed them to use the same token id for a word with and without the leading space (in cases where the context usually implies a space and one is not present, a "no space" symbol is used).

Comment by nl 1 day ago

This is almost certainly wrong.

Case sensitive language models have been a thing since way before neural language models. I was using them with boosted tree models at least ten years ago, and even my Java NLP tool did this twenty years ago (damn!). There is no novelty there of course - I based that on PG's "A Plan for Spam".

See for example CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.fe...

The bitter lesson says that you are much better off just adding more data and learning the tokenizer and it will be better.

It's not impossible that the new Opus tokenizer is based on something learnt during Mythos pre-training (maybe it is *the learned Mythos tokenizer?%), and it seems likely that the Mythos pre-training run is the most data ever trained on.

Putting an inductive bias in your tokenizer seems just a terrible idea.

Comment by yorwba 1 day ago

Anthropic was already special-casing case-folding in their tokenizers before this recent change: https://transformer-circuits.pub/2025/attribution-graphs/met... "The tokenizer the model was trained with uses a special “Caps Lock” token" (⇪). Their visualizations for Claude 3.5 Haiku also show the Title Case token (↑).

This is similar to what the TokenMonster tokenizer does: https://github.com/alasdairforsythe/tokenmonster

Comment by kouteiheika 1 day ago

> This is almost certainly wrong.

So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

> Putting an inductive bias in your tokenizer seems just a terrible idea.

You're already effectively doing this by the sheer fact of using a BPE tokenizer, and especially with modern BPE-based LLM tokenizers[1]. I agree trying to bake this manually in a tokenizer is most likely not a good idea, but I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

[1] Example from Qwen3.6 tokenizer:

    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Isolated",
        "invert": false
      }
    ]
  },

Comment by nl 1 day ago

> So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?

Just modeling whitespace as its own token would seem to explain the increase.

> Qwen3.6 tokenizer: "pretokenizer"

That's the pre-tokenizer, not the tokenizer. That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

> I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.

The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. See the BPE paper: https://arxiv.org/abs/1508.07909

BPE already learns morphology because it sees the raw bytes.

Comment by kouteiheika 1 day ago

> That's the pre-tokenizer, not the tokenizer.

Yes, it's an extra tokenizer which runs before the learned tokenizer and injects an inductive bias into it.

> That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.

While it does indeed speed up training of the tokenizer, no, it isn't mostly just a performance optimization? It injects a clear cut inductive bias into the tokenizer (split by words, split by punctuation, don't merge words and numbers, etc. -- is that not an inductive bias?), and for some languages (e.g. Asian languages which don't use spaces) the "it's just for performance" argument doesn't make as much sense because there it has no spaces to split on, so the chunks of text are much longer (although it does still split on punctuation, etc.).

Can we not agree that the absolutist position of "Putting an inductive bias in your tokenizer seems just a terrible idea." (as in - any inductive bias) is not actually true, especially since people are actually doing it?

Note, I'm not actually arguing that hand-crafted morphological tokenizers are better. (Which is the straw man many people seem to be replying to.) I'm just arguing that it should be feasible to train your tokenizer in a more morphologically aware way, because BPE doesn't do that.

> The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. [..] BPE already learns morphology because it sees the raw bytes.

The reason everyone went to BPE is because of the bitter lesson (and because you don't have to hardcode your whole vocabulary, i.e. no UNK tokens), and not because it's particularly good at learning the morphology of the actual text. It's trivial to show countless examples where it fails to do so.

Comment by ctoth 19 hours ago

Poor Ed.

Comment by aliljet 1 day ago

This is the rugpull that is starting to push me to reconsider my use of Claude subscriptions. The "free ride" part of this being funded as a loss leader is coming to a close. While we break away from Claude, my hope is that I can continue to send simple problems to very smart local llms (qwen 3.6, I see you) and reserve Claude for purely extreme problems appropriate for it's extreme price.

Comment by KronisLV 1 day ago

> This is the rugpull that is starting to push me to reconsider my use of Claude subscriptions.

I'm still with them cause the model is good, but yes, I'm noticing my limits burning up somewhat faster on the 100 USD tier, I bet the 20 USD tier is even more useless.

I wouldn't call it a rugpull, since it seems like there might be good technical reasons for the change, but at the same time we won't know for sure if they won't COMMUNICATE that to us. I feel like what's missing is a technical blog post that tells uz more about the change and the tokenizer, although I fear that this won't be done due to wanting to keep "trade secrets" or whatever (the unfortunate consequence of which is making the community feel like they're being rugpulled).

Comment by arw0n 10 minutes ago

I'm on the 20 USD tier, and it works quite well for me. Basically I send one very carefully crafted task to the LLM per 4h limit, do a couple of minor questions, and the rest of the time I'm thinking/exploring/coding. I am producing around a tenth of the code of my colleagues, but around the same number of features.

Comment by rimliu 1 day ago

20 USD tier was useless from the start. You'd get to the limit in 30 minutes. Codex with 20 USD on the other hand...

Comment by adezxc 22 hours ago

Give Codex a month or two and it'll just do the same thing, though now with a million more users because Claude Code wasn't good enough.

Comment by Grp1 1 day ago

OpenAI has been doing the same thing gradually. Codex launched with generous Plus limits, then they introduced the $100 Pro tier, and Plus limits have quietly tightened since. With the same repetitive tasks I was running, consumption is noticeably higher now for the same output.

The pattern feels deliberate — make the $20 tier just uncomfortable enough that power users upgrade, without officially announcing the reduction. If it continues, $20 buys you a demo and $100 buys you actual work.

Comment by 1 day ago

Comment by londons_explore 1 day ago

I think an LLM that is a decent chunk smarter/better than other LLM's ought to be able to charge a premium perhaps 10x or 100x it's competitors.

See for example the price difference between taking a taxi and taking the bus, or between hiring a real lawyer Vs your friend at the bar who will give his uninformed opinion for a beer.

Comment by teaearlgraycold 1 day ago

For now you can run /model claude-opus-4-6

Comment by DeathArrow 1 day ago

Quality of answers from quantized models is noticeable worse than using the full model.

You'll be better using Qwen 3.6 Plus through Alibaba coding plan.

Comment by SoMomentary 23 hours ago

> Quality of answers from quantized models is noticeable worse than using the full model.

This is the very reason I've heard I shouldn't use Alibaba!

Comment by lifis 1 day ago

I'm really surprised that:

1. Anthropic has not published anything about why they made the change and how exactly they changed it

2. Nobody has reverse engineered it. It seems easy to do so using the free token counting APIs (the Google Vertex AI token count endpoint seems to support 2000 req/min = ~3million req/day, seems enough to reverse engineer it)

Comment by hk__2 1 day ago

> It seems easy to do so

What are you waiting for? ;)

Comment by Esophagus4 22 hours ago

Anyone have good tips or resources on token management best practices? Because I’ve hit the limiter with one single prompt now on Opus 4.7.

What I’m reading so far seems to be:

-selective use of models based on task complexity

-encoding large repos into more digestible and relevant data structures to reduce constant reingesting

-ask Claude to limit output to X tokens (as output tokens are more expensive)

-reduce flailing by giving plenty of input context

-use Headroom and RTK

-disable unused MCP, move stuff from CLAUDE.md to skills

But I’d love to learn if anyone has any good tips, links, or tools as I’m getting rate limited twice a day now.

Comment by slawr1805 22 hours ago

Thank you for sharing these tips! Just checked RTK and Headroom out and looks like Headroom actually uses RTK under the hood for CLI output compression: https://github.com/chopratejas/headroom#compared-to

Comment by Esophagus4 21 hours ago

Didn’t realize that, thank you!

Comment by ebcode 21 hours ago

I’m working on a tool that is a more token-efficient code search than grep. I don’t have hard numbers yet, but it’s been working for me to get longer sessions. https://github.com/ebcode/SourceMinder

Comment by Esophagus4 21 hours ago

Oh nice! Thank you, I will definitely give this a shot.

I was looking at tree-sitter myself for this task.

Comment by ebcode 20 hours ago

It's still in beta, and I'm hoping to get more feedback, so feel free to post in the issues or reach out directly if you run into any problems.

Comment by o10449366 22 hours ago

What was your single prompt? That seems highly unlikely.

Comment by Esophagus4 22 hours ago

“Ok Claude, I bet you can’t hit the usage limit in one shot… let’s see what you got you little token gremlin”

;)

I just asked it to do a security analysis in a subagent of an unmaintained browser extension and then go fix vulnerabilities it found so I can use it without worrying.

Comment by o10449366 17 hours ago

So it sounds like you were using it on auto mode then if it went ahead and fixed the vulnerabilities without additional turns? If so, that isn't really a single prompt.

Comment by Esophagus4 13 hours ago

It got limited while it was reading the repo before starting to fix, but yes you’re right. I think I was in auto mode or at least approving all the exploration.

Comment by vikgmdev 1 hour ago

[dead]

Comment by RITESH1985 1 day ago

Token counting matters a lot when agents are running long action chains. The hidden cost is retry loops — when an agent action times out and the agent retries, it re-sends the full context including all previous tool call results. A single failed payment call can cost 3x the tokens of a successful one. Observability at the token level is one thing, but you also need observability at the action level — did this side effect actually execute or did it fail silently?

Comment by hyperpape 20 hours ago

This is a great piece of data, but only a piece of the actual question that we need to answer, which is:

For a given input, how many tokens will be used for an answer, and how high quality will that answer be?

Measuring the tokenizer is just one input into the cost-benefit tradeoff.

Comment by hugodan 1 day ago

Aren't these increases offset by the quality of the responses and reducing the iterations needed to fine-tune the responses?

Comment by Majromax 22 hours ago

Only for the range of tasks where 4.7 performs well but 4.6 performed suboptimally. If both models can one-shot the task without retries, then the number of iterations is already at the lower bound.

This also applies at the sub-task level. If both models need to read three files to figure out which one implements the function they need to modify, then the token tax is paid for all three files even though "not the right file" is presumably an easy conclusion to draw.

This is also related to the challenge of optimizing subagents. Presumably the outer, higher-capacity model can perform better with everything in its context (up to limits), but dispatching a less-capable subagent for a problem might be cheaper overall. Anthropic has a 5:1 cost on input tokens between Opus and Haiku, but Google has 8:1 (Gemini Pro : Flash Lite) and OpenAI has 12:1 (GPT 4.2 : 4.2 nano).

Comment by anentropic 1 day ago

Likely only some of the time

Comment by great_psy 1 day ago

Is there any provided reason from anthropic why they changed the tokenizer ?

Is there a quality increase from this change or is it a money grab ?

Comment by Aurornis 1 day ago

The tokenizer is an important part of overall model training and performance. It’s only one piece of the overall cost per request. If a tokenizer that produces more tokens also leads to a model that gets to the correct answer more quickly and requires fewer re-prompts because it didn’t give the right answer, the overall cost can still be lower.

Comparisons are still ongoing but I have already seen some that suggest that Opus 4.7 might on average arrive at the answer with fewer tokens spent, even with the additional tokenizer overhead.

So, no, not a money grab.

Comment by ChadNauseam 1 day ago

How would it be a money grab? If the new tokenizer requires more tokens to encode the same information, it costs them more money for inference. The point of charging per token is that the cost is proportional to the number of tokens. That's my understanding anyway

Comment by abrookewood 1 day ago

Because everyone burns through their limits much faster, forcing them to upgrade to higher limits or new tiers.

Comment by Jtarii 23 hours ago

I think someone would much sooner switch to a competitor than up their tier.

Comment by dandaka 22 hours ago

If model provider believes they have a better model, it can be a viable bet. But many (me included) started experimenting with other providers because of enshittification from Anthropic (price + uptime). Only to find, that Codex is not that worse in quality for a significantly more output per $.

Comment by simianwords 1 day ago

They could just increase the token cost no? There’s little need for cute conspiracies like these

Comment by sumeno 21 hours ago

They would have to tell people if they did that.

Comment by svnt 17 hours ago

There are no conspiracies where a corporation has profit incentive. There is perhaps a question of planning and initial intentionality, but the metrics and motivation to continue are clear enough.

Comment by msp26 1 day ago

Not necessarily with speculative decoding. Whitespace would be trivial to predict and they would petty much keep using the same amount of compute as before.

I don't think that's their primary motive for doing this but it is a side effect.

Comment by Symmetry 22 hours ago

If they wanted they could always just double the $/token. They don't seem to be able to keep up with their current demand and that's what companies normally do in that circumstance if they're looking to money grab, no need for the bankshot approach.

Comment by nl 1 day ago

It's a better model in my usage. I have benchmarks.

Comment by onchainintel 1 day ago

Many comparisons between 4.6 & 4.7 at https://tokens.billchambers.me/leaderboard My prompt was 40% more tokens using Opus 4.7.

Comment by ilioscio 19 hours ago

Anthropic was pulling ahead of their peers, but if they can't hear their customer's complaints about negatively changing value between releases they're going to undermine their position until no advantage is left.

Comment by mudkipdev 1 day ago

Why do you need an API key to tokenize the text? Isn't it supposed to be a cheap step that everything else in the model relies on?

Comment by kouteiheika 1 day ago

I'd guess it's because they don't want people to reverse engineer it.

Note that they're the only provider which doesn't make their tokenizer available offline as a library (i.e. the only provider whose tokenizer is secret).

Comment by stingraycharles 1 day ago

Anthropic is somewhat becoming the Apple of AI in terms of closed ecosystem. Not saying I blame them, I just don't like it as a customer.

The fact that it's impossible to get the actual thinking tokens anymore, but we have to do with a rewritten summary, is extremely off-putting. I understand that it's necessary for users, but when writing agentic applications yourself, it's super annoying not to have the actual reasoning of the agent to understand failure modes.

Comment by aftbit 20 hours ago

It's _not_ that it's necessary for users. It's that Anthropic got Opus 4.6 ripped off so hard by MiniMax that they no longer want to expose true thinking tokens to random developers. If you're one of the blessed class, you can still get real thinking tokens, but you need to be a major enterprise customer, like the companies that they gave Mythos access.

Comment by weird-eye-issue 1 day ago

To prevent abuse? It's a completely free endpoint so I don't understand your complaint.

Comment by tethys 3 hours ago

It may be free, but it cannot be used without credits.

  Error: {"type":"error","error":{"type":"invalid_request_error","message":"Your credit balance is too low to access the Anthropic API. Please go to Plans & Billing to upgrade or purchase credits."},"request_id":"req_011CaGaBf6uTHfbmdZ39nx1Z"}

Comment by weird-eye-issue 1 hour ago

Again it is to help prevent abuse I don't really see how this is a valid concern? Tokenization is actually fairly CPU intensive

Comment by simonw 1 day ago

I'd love it if that API (which I do not believe Anthropic charge anything for) worked without an API key.

Comment by tomglynch 1 day ago

Interesting findings. Might need a way to downsample images on upload to keep costs down.

Comment by simonw 1 day ago

Yeah that should work - it looks like the same pixel dimension image at smaller sizes has about the same token cost for 4.6 and 4.7, so the image cost increase only kicks in if you use larger images that 4.6 would have presumably resized before inspecting.

Comment by tpowell 1 day ago

I just asked Claude about defaulting to 4.6 and there are several options. I might go back to that as default and use --model claude-opus-4-7 as needed. The token inflation is very real.

Comment by sergiopreira 1 day ago

An interesting question is whether the tokenizer is better at something measurable or just denser. A denser tokenizer with worse alignment to semantic boundaries costs you twice, higher bill and worse reasoning. A denser tokenizer that actually carves at the joints of the model's latent space pays for itself in quality. Nobody outside Anthropic can answer which it is without their eval suite, so the rugpull read is fair but premature. Perhaps the real tell will be whether 4.7 beats 4.6 on the same dollar budget on the benchmarks you care about, not on the per-token ones Anthropic publishes.