Prompt caching for cheaper LLM tokens
Posted by samwho 3 days ago
Comments
Comment by est 1 day ago
Comment by samwho 23 hours ago
Comment by yomismoaqui 18 hours ago
EDIT: You have some minor typos in the post (psuedocode)
Comment by Havoc 22 hours ago
Was looking at modifying outgoing requests via proxy and wondering whether that's harming caching. Common coding tools presumably have a shared prompt across all their installs so universal cache would save a lot
Comment by moebrowne 22 hours ago
> Prompt caches are not shared between organizations. Only members of the same organization can access caches of identical prompts.
https://platform.openai.com/docs/guides/prompt-caching#frequ...
Comment by maxloh 16 hours ago
With the cache limited to the same organization, the chances of it actually being reused would be extremely low.
Comment by qeternity 12 hours ago
On the API side imagine you are doing document processing and have a 50k token instruction prompt that you reuse for every document.
It’s extremely viable and used all the time.
Comment by jonhohle 12 hours ago
Comment by qeternity 11 hours ago
It took a while for companies to start metering it and charging accordingly.
Also companies invested in hierarchical caches that allow longer term and cross cluster caching.
Comment by IanCal 16 hours ago
Comment by babelfish 16 hours ago
Comment by samwho 22 hours ago
Comment by weird-eye-issue 20 hours ago
With OpenAI at least you can specify the cache key and they even have this in the docs:
Use the prompt_cache_key parameter consistently across requests that share common prefixes. Select a granularity that keeps each unique prefix-prompt_cache_key combination below 15 requests per minute to avoid cache overflow.
Comment by psadri 16 hours ago
Comment by weird-eye-issue 5 hours ago
Let's say you have a chatbot with hundreds of active users, their requests could get routed to different machines which would mean the implicit caching wouldn't work
If you set the cache key to a user id then it would be more likely each user's chat could get cached on subsequent requests
Comment by samwho 21 hours ago
Comment by gwern 14 hours ago
Comment by reitzensteinm 12 hours ago
But that's only going to work if the cache looks like: "h", "hu", "hun", ..., "hunter2"
If just "hunter2" is in the cache, you won't get any signal until you stumble on exactly that password. And that's before getting into the block size granularity of the caches discussed elsewhere in this thread.
That's not to say timing attacks aren't possible. I haven't looked at Claude Code's prompt generation, but there's no intrinsic reason why you couldn't do things like figure out what open source code and research papers your competitors are loading into context.
Sharing caches between orgs would be an incredible misstep.
Comment by jgeralnik 11 hours ago
Comment by reitzensteinm 10 hours ago
This won't be the case in any non toy implementation, as it would be unneccessary and slow.
Comment by jgeralnik 4 hours ago
Comment by IanCal 13 hours ago
Comment by jgeralnik 11 hours ago
Comment by gunalx 21 hours ago
Comment by weird-eye-issue 20 hours ago
Comment by samwho 21 hours ago
Comment by sroussey 10 hours ago
A local model running alone on your machine will 100% always return the exact same thing and the internal state will be exactly the same and you can checkpoint or cache that to avoid rerunning to that point.
But… conditions can be different, and batching requests tends to affect other items in flight. I believe Thinking Machines had an article about how to make a request deterministic again without performance going to complete crap.
I tend to think of things this way (completely not what happens though): what if you were to cache based on a tensor as the key? To generate a reasonably sized key what is an acceptable loss of precision to retrieve the same cache knowing that there is inherent jitter in the numbers of the tensor?
And then the ever so slight leak of information. But also multiplied since there are internal kv caches for tokens and blah blah blah.
Comment by dustfinger 18 hours ago
Comment by dustfinger 15 hours ago
- Product logic / decision rules, such as: when to refund, how to triage tickets
- Internal taxonomies, schemas, or tool interfaces
- Safety and policy guardrails (which adversaries could try to route around)
- Brand voice, strategy, or proprietary workflows
That is just off the top of my head.
Comment by duggan 22 hours ago
Even just moving it to the bottom helped move a lot of our usage into cache.
Probably went from something like 30-50% cached tokens to 50-70%.
Comment by willvarfar 22 hours ago
So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?
Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?
Comment by GeneralMayhem 22 hours ago
My favorite not-super-accurate mental model of what's going on with attention is that the model is sort of compressing the whole preceding context into each token. So the word "tell" would include a representation not just of the concept of telling, but also of what it is that's supposed to be told. That's explicitly what you don't want to cache.
> So if I were running a provider I would be caching popular prefixes for questions across all users
Unless you're injecting user context before the question. You can have a pre baked cache with the base system prompt, but not beyond that. Imagine that the prompt always starts with "SYSTEM: You are ChatGPT, a helpful assistant. The time is 6:51 ET on December 19, 2025. The user's name is John Smith. USER: Hi, I was wondering..." You can't cache the "Hi, I was wondering" part because it comes after a high-entropy component (timestamp and user name).
Comment by samwho 22 hours ago
There’s been some research into how to cache chunks in the middle, but I don’t think any of the providers are doing it yet because it needs the prompt to be structured in a very specific way.
Comment by moebrowne 22 hours ago
> Caching is available for prompts containing 1024 tokens or more.
No mention of caching being in blocks of 1024 tokens thereafter.
Comment by IanCal 13 hours ago
Comment by WillAdams 21 hours ago
It's a pain having to tell Copilot "Open in pages mode" each time it's launched, and then after processing a batch of files run into:
https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limi...
Comment by holbrad 21 hours ago
https://t3.chat/share/j2tnfwwful https://t3.chat/share/k1xhgisrw1
Comment by dangoodmanUT 19 hours ago
Comment by toobulkeh 19 hours ago
ngrok.ai
Comment by who-shot-jr 15 hours ago
Comment by samwho 15 hours ago
These are all built with React and CSS animations (or the Web Animations API where I needed it). I’m not very good at React so the code is a real mess. 2 of the components also use threejs for the 3D bits.
For the stuff on my personal site, which simonw graciously linked to in another reply, you can see all the code behind my work at https://github.com/samwho/visualisations
Comment by simonw 15 hours ago
Comment by samwho 15 hours ago
Comment by aitchnyu 1 day ago
Comment by samwho 1 day ago
The product has grown a lot since the mid 2010s. Still got free localhost tunnelling, but we also have a whole bunch of production-grade API gateway tooling and, as of recently, AI gateway stuff too.
Comment by tomhow 22 hours ago
[see https://news.ycombinator.com/item?id=45988611 for explanation]
Comment by walterbell 18 hours ago
How was the term "rug" chosen, e.g. in the historical context of newspaper folds?
Comment by coderintherye 1 day ago
I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.
Comment by samwho 1 day ago
Comment by simedw 3 days ago
I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…
Comment by samwho 3 days ago
Yes, I recently wrote https://github.com/samwho/llmwalk and had a similar experience with cache vs no cache. It’s so impactful.
Comment by mrgaro 1 day ago
Comment by samwho 1 day ago
I’m really glad you liked it, and seriously the resources I link at the end are fantastic.
Comment by ThePyCoder 1 day ago
Comment by samwho 1 day ago
Comment by wesammikhail 1 day ago
Great work. Learned a lot!
Comment by samwho 1 day ago
Comment by wesammikhail 21 hours ago
Comment by stingraycharles 23 hours ago
Where do people get the idea from that temperature affects caching in any way? Temperature is about next token prediction / output, not input.
Comment by semi-extrinsic 23 hours ago
Comment by wesammikhail 21 hours ago
It´s a semantics issue where the word caching is overloaded depending on context. For people that are not familiar with the inner workings of llm models, this can cause understandable confusion.
Comment by NooneAtAll3 23 hours ago
Comment by Youden 2 days ago
Comment by samwho 23 hours ago
Comment by bkor 18 hours ago