Claude Token Counter, now with model comparisons
Posted by twapi 1 day ago
Comments
Comment by vfalbor 1 day ago
Comment by akdor1154 1 day ago
Comment by vfalbor 23 hours ago
Comment by svnt 17 hours ago
Comment by vfalbor 17 hours ago
Comment by touristtam 19 hours ago
Comment by vfalbor 17 hours ago
Comment by jmalicki 17 hours ago
We found a lot of gain by having ranking features based on Pinyin to detect typos/misspellings due to homophones (and similar sounding words). I was investigating stroke decomposition to try to be able to detect near homographs, but wasn't able to find any good libraries at the time.
I could imagine the homophone issue is especially relevant for spoken input to LLMs. LLMs are good enough that they're usually right, so it's probably less of an issue, but in English I can have crazy typos and everything just works, I am curious how well that would work for Chinese, since I suspect it's a harder problem by far due to the lack of subword tokens?
Comment by DANmode 13 hours ago
Comment by kouteiheika 1 day ago
Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)
Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):
Kill -> [70074]
Killed -> [192794]
kill -> [25752]
k|illed -> [74, 7905]
<space>kill -> [15874]
<space>killed -> [17372]
You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.This is not necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:
<capitalize>|kill
<capitalize>|kill|ed
kill|
kill|ed
<space>|kill
<space>|kill|ed
Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.
Comment by ipieter 1 day ago
Comment by sigmoid10 1 day ago
Comment by wongarsu 1 day ago
That's what all these attempts boil down to. They don't presume to be able to find a more space-efficient encoding by hand, they assume that the optimization goal for the tokenizer was wrong and they can do better by adding some extra rules. And this isn't entirely without precendent, most tokenizers have a couple of "forced" tokens that were not organically discovered. Moving around how digits are grouped in the tokenizer is another point where wins have been shown.
This is where projects like nanochat are really valuable for quickly and (relatively) cheaply trying out various tweaks
Comment by sigmoid10 1 day ago
Except that is exactly what research has shown. Besides, the tokenizer's training goal is literally just to encode text efficiently with fewer tokens by increasing the vocabulary, which obviously directly benefits the attention mechanism if you look at the dimensions of involved matrices. The biggest issues so far have stemmed from variances between tokenizer and LLM training sets [1] and the fact that people primarily work with character based text and not word-part based text (even though that gets muddy when you look at what is actually happening in the brain) when doing anything in writing.
[1] https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
Comment by amelius 23 hours ago
Comment by sigmoid10 19 hours ago
Comment by amelius 18 hours ago
Comment by rrr_oh_man 1 day ago
Comment by dannyw 1 day ago
Also, think about how a LLM would handle different languages.
Comment by fooker 1 day ago
See embedding models.
> they removed the tokenizer altogether
This is an active research topic, no real solution in sight yet.
Comment by friendzis 1 day ago
I have just ran an experiment: I have taken a word and asked models (chatgpt, gemini and claude) to explode it into parts. The caveat is that it could either be root + suffix + ending or root + ending. None of them realized this duality and have taken one possible interpretation.
Any such approach to tokenizing assumes context free (-ish) grammar, which is just not the case with natural languages. "I saw her duck" (and other famous examples) is not uniquely tokenizable without a broader context, so either the tokenizer has to be a model itself or the model has to collapse the meaning space.
Comment by orbital-decay 1 day ago
>None of them realized this duality and have taken one possible interpretation.
I suspect this happens due to mode collapse and has nothing to do with the tokenization. Try this with a base model.
Comment by frb 1 day ago
It’s based on semantic primitives (Wierzbicka NSM) and emoji (the fun idea that got me interested in this in the first place).
So far I’ve tested 6 iterations and it trains and responds well with a 10k vocab, but the grammar came out rougher. Working on 8th iteration, mainly to improve the grammar and language. Turns out the smaller vocab couldn’t be maintained and all improvements get us back in the ballpark of the 32k vocab size. Further testing is still outstanding for this week.
Comment by anonymoushn 1 day ago
Comment by nl 1 day ago
Case sensitive language models have been a thing since way before neural language models. I was using them with boosted tree models at least ten years ago, and even my Java NLP tool did this twenty years ago (damn!). There is no novelty there of course - I based that on PG's "A Plan for Spam".
See for example CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.fe...
The bitter lesson says that you are much better off just adding more data and learning the tokenizer and it will be better.
It's not impossible that the new Opus tokenizer is based on something learnt during Mythos pre-training (maybe it is *the learned Mythos tokenizer?%), and it seems likely that the Mythos pre-training run is the most data ever trained on.
Putting an inductive bias in your tokenizer seems just a terrible idea.
Comment by yorwba 1 day ago
This is similar to what the TokenMonster tokenizer does: https://github.com/alasdairforsythe/tokenmonster
Comment by kouteiheika 1 day ago
So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget?
> Putting an inductive bias in your tokenizer seems just a terrible idea.
You're already effectively doing this by the sheer fact of using a BPE tokenizer, and especially with modern BPE-based LLM tokenizers[1]. I agree trying to bake this manually in a tokenizer is most likely not a good idea, but I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.
[1] Example from Qwen3.6 tokenizer:
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
},
"behavior": "Isolated",
"invert": false
}
]
},Comment by nl 1 day ago
Just modeling whitespace as its own token would seem to explain the increase.
> Qwen3.6 tokenizer: "pretokenizer"
That's the pre-tokenizer, not the tokenizer. That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.
> I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account.
The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. See the BPE paper: https://arxiv.org/abs/1508.07909
BPE already learns morphology because it sees the raw bytes.
Comment by kouteiheika 1 day ago
Yes, it's an extra tokenizer which runs before the learned tokenizer and injects an inductive bias into it.
> That is mostly a performance optimization that lets the memory requirements for the BPE tokenizer be a lot less.
While it does indeed speed up training of the tokenizer, no, it isn't mostly just a performance optimization? It injects a clear cut inductive bias into the tokenizer (split by words, split by punctuation, don't merge words and numbers, etc. -- is that not an inductive bias?), and for some languages (e.g. Asian languages which don't use spaces) the "it's just for performance" argument doesn't make as much sense because there it has no spaces to split on, so the chunks of text are much longer (although it does still split on punctuation, etc.).
Can we not agree that the absolutist position of "Putting an inductive bias in your tokenizer seems just a terrible idea." (as in - any inductive bias) is not actually true, especially since people are actually doing it?
Note, I'm not actually arguing that hand-crafted morphological tokenizers are better. (Which is the straw man many people seem to be replying to.) I'm just arguing that it should be feasible to train your tokenizer in a more morphologically aware way, because BPE doesn't do that.
> The reason everyone went to BPE was because it was so dramatically better than morphology based tokenizers. [..] BPE already learns morphology because it sees the raw bytes.
The reason everyone went to BPE is because of the bitter lesson (and because you don't have to hardcode your whole vocabulary, i.e. no UNK tokens), and not because it's particularly good at learning the morphology of the actual text. It's trivial to show countless examples where it fails to do so.
Comment by ctoth 19 hours ago
Comment by aliljet 1 day ago
Comment by KronisLV 1 day ago
I'm still with them cause the model is good, but yes, I'm noticing my limits burning up somewhat faster on the 100 USD tier, I bet the 20 USD tier is even more useless.
I wouldn't call it a rugpull, since it seems like there might be good technical reasons for the change, but at the same time we won't know for sure if they won't COMMUNICATE that to us. I feel like what's missing is a technical blog post that tells uz more about the change and the tokenizer, although I fear that this won't be done due to wanting to keep "trade secrets" or whatever (the unfortunate consequence of which is making the community feel like they're being rugpulled).
Comment by arw0n 10 minutes ago
Comment by rimliu 1 day ago
Comment by adezxc 22 hours ago
Comment by Grp1 1 day ago
The pattern feels deliberate — make the $20 tier just uncomfortable enough that power users upgrade, without officially announcing the reduction. If it continues, $20 buys you a demo and $100 buys you actual work.
Comment by londons_explore 1 day ago
See for example the price difference between taking a taxi and taking the bus, or between hiring a real lawyer Vs your friend at the bar who will give his uninformed opinion for a beer.
Comment by teaearlgraycold 1 day ago
Comment by DeathArrow 1 day ago
You'll be better using Qwen 3.6 Plus through Alibaba coding plan.
Comment by SoMomentary 23 hours ago
This is the very reason I've heard I shouldn't use Alibaba!
Comment by lifis 1 day ago
1. Anthropic has not published anything about why they made the change and how exactly they changed it
2. Nobody has reverse engineered it. It seems easy to do so using the free token counting APIs (the Google Vertex AI token count endpoint seems to support 2000 req/min = ~3million req/day, seems enough to reverse engineer it)
Comment by hk__2 1 day ago
What are you waiting for? ;)
Comment by Esophagus4 22 hours ago
What I’m reading so far seems to be:
-selective use of models based on task complexity
-encoding large repos into more digestible and relevant data structures to reduce constant reingesting
-ask Claude to limit output to X tokens (as output tokens are more expensive)
-reduce flailing by giving plenty of input context
-use Headroom and RTK
-disable unused MCP, move stuff from CLAUDE.md to skills
But I’d love to learn if anyone has any good tips, links, or tools as I’m getting rate limited twice a day now.
Comment by slawr1805 22 hours ago
Comment by Esophagus4 21 hours ago
Comment by ebcode 21 hours ago
Comment by Esophagus4 21 hours ago
I was looking at tree-sitter myself for this task.
Comment by ebcode 20 hours ago
Comment by o10449366 22 hours ago
Comment by Esophagus4 22 hours ago
;)
I just asked it to do a security analysis in a subagent of an unmaintained browser extension and then go fix vulnerabilities it found so I can use it without worrying.
Comment by o10449366 17 hours ago
Comment by Esophagus4 13 hours ago
Comment by vikgmdev 1 hour ago
Comment by RITESH1985 1 day ago
Comment by hyperpape 20 hours ago
For a given input, how many tokens will be used for an answer, and how high quality will that answer be?
Measuring the tokenizer is just one input into the cost-benefit tradeoff.
Comment by hugodan 1 day ago
Comment by Majromax 22 hours ago
This also applies at the sub-task level. If both models need to read three files to figure out which one implements the function they need to modify, then the token tax is paid for all three files even though "not the right file" is presumably an easy conclusion to draw.
This is also related to the challenge of optimizing subagents. Presumably the outer, higher-capacity model can perform better with everything in its context (up to limits), but dispatching a less-capable subagent for a problem might be cheaper overall. Anthropic has a 5:1 cost on input tokens between Opus and Haiku, but Google has 8:1 (Gemini Pro : Flash Lite) and OpenAI has 12:1 (GPT 4.2 : 4.2 nano).
Comment by anentropic 1 day ago
Comment by great_psy 1 day ago
Is there a quality increase from this change or is it a money grab ?
Comment by Aurornis 1 day ago
Comparisons are still ongoing but I have already seen some that suggest that Opus 4.7 might on average arrive at the answer with fewer tokens spent, even with the additional tokenizer overhead.
So, no, not a money grab.
Comment by ChadNauseam 1 day ago
Comment by abrookewood 1 day ago
Comment by Jtarii 23 hours ago
Comment by dandaka 22 hours ago
Comment by simianwords 1 day ago
Comment by sumeno 21 hours ago
Comment by svnt 17 hours ago
Comment by msp26 1 day ago
I don't think that's their primary motive for doing this but it is a side effect.
Comment by Symmetry 22 hours ago
Comment by nl 1 day ago
Comment by onchainintel 1 day ago
Comment by ilioscio 19 hours ago
Comment by mudkipdev 1 day ago
Comment by kouteiheika 1 day ago
Note that they're the only provider which doesn't make their tokenizer available offline as a library (i.e. the only provider whose tokenizer is secret).
Comment by stingraycharles 1 day ago
The fact that it's impossible to get the actual thinking tokens anymore, but we have to do with a rewritten summary, is extremely off-putting. I understand that it's necessary for users, but when writing agentic applications yourself, it's super annoying not to have the actual reasoning of the agent to understand failure modes.
Comment by aftbit 20 hours ago
Comment by weird-eye-issue 1 day ago
Comment by tethys 3 hours ago
Error: {"type":"error","error":{"type":"invalid_request_error","message":"Your credit balance is too low to access the Anthropic API. Please go to Plans & Billing to upgrade or purchase credits."},"request_id":"req_011CaGaBf6uTHfbmdZ39nx1Z"}Comment by weird-eye-issue 1 hour ago
Comment by simonw 1 day ago
Comment by tomglynch 1 day ago
Comment by simonw 1 day ago
Comment by tpowell 1 day ago
Comment by sergiopreira 1 day ago
Comment by potter098 1 day ago
Comment by cubefox 1 day ago
Comment by chattermate 1 day ago
Comment by jug 21 hours ago
Comment by yogigan 1 day ago
Comment by alvis 21 hours ago