Speculative KV coding: losslessly compressing KV cache by up to ~4×
Posted by kkm 5 days ago
Comments
Comment by zozbot234 3 days ago
Comment by saagarjha 2 days ago
Comment by zozbot234 2 days ago
Comment by zozbot234 2 days ago
Comment by somnial 1 day ago
Comment by hypfer 3 days ago
Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta
And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice
___
If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.
Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.
That would be massive. Especially since the thing has so much compute to spare.
Though, all depending on the size of that predictor model I guess?
Comment by oceanplexian 2 days ago
KV can be trivially stored on ram or even a spinning disk and retrieved on the order of milliseconds. See LM cache for vLLM for example. In fact it’s so easy it kinda shocks me when Claude Code will sit and recompute my entire KV on a new session after a couple of hours, I guess Anthropic infra is not as optimized as it would seem.
Think about the problem from first principles:
Storing a few GB per user at scale isn’t that hard and was solved years ago. Let’s say I have 20 chat sessions open and the session persists for a day or two, this seems negligible to me as a systems design problem.
Comment by xlayn 2 days ago
I guess for a 300B parameter or more and couple million users with the price of storage increasing as part of ramagedon this is also not viable...
Comment by zozbot234 2 days ago
Comment by lostmsu 2 days ago
Comment by zozbot234 2 days ago
Comment by lostmsu 1 day ago
Comment by zozbot234 1 day ago
Comment by lostmsu 1 day ago
Comment by zozbot234 1 day ago
Comment by lostmsu 1 day ago
Comment by zozbot234 1 day ago
Comment by lostmsu 1 day ago
Comment by fc417fc802 1 day ago
Comment by lostmsu 1 day ago
Batching could allow you to generate 10 tokens for 10 different conversations at the time, but it also means that you need to load different experts for different tokens, so it does not help as much as it does for dense models.
Comment by fc417fc802 1 day ago
On top of that (as previously pointed out by zoz) for a single user running a single overarching task the choice of experts is expected to be highly biased.
Comment by lostmsu 1 day ago
Why? Why do you think that's the case? Part of the training is balancing load between experts.
> so in aggregate there's a speedup.
Yes. 2x. Over theoretical under 1 tok/s
Comment by fc417fc802 1 day ago
That is a fair point. That expectation may have been misplaced on my part. I'm not sufficiently familiar with the details of MoE training.
> The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass.
> 20+ tokens per second of 27B before any batching.
Does the model fit in RAM or not? What is your justification for your stated expectation that the unbatched model will perform 20x faster than the aggregate tps (note, not the single stream tps) of the batched model?
My expectation is that if the unbatched model is 20 tps and batching provides a 2x speedup then each individual stream will be slower but the aggregate throughput should rise to 40 tps. What do you believe me to be missing here?
Comment by lostmsu 17 hours ago
Comment by zozbot234 23 hours ago
The training balances expert choice across the entire scope of the model. Experiments have consistently shown that within a given session or topic (taken in a broad sense) expert choice is biased in a way that's likely to make caching useful and reuse across a user-specific batch realistic.
Comment by killerstorm 2 days ago
Comment by 5kg 2 days ago
Comment by btown 2 days ago
While this might seem to be true for casual users, I recall that one of the reasons for Anthropic's recent changes for only retaining KV cache for an hour or so, was that many users just have one massive ongoing session that they continue on with multiple unrelated queries (as one would in a single-thread "group chat"). And this is hard to distinguish from someone who wants that context for their seemingly-unrelated query to apply tone etc.
So in practice, there are many casual users who are typing their Google-esque searches against a 100k+ token context window - and it's at that point where things balloon into 300GB+ KV caches to maintain.
I wouldn't be surprised if we see new UX's around subsidized plans starting to encourage resetting the context window more often.
Comment by zozbot234 2 days ago
Comment by nicman23 2 days ago
Comment by jbellis 2 days ago
Comment by ssivark 2 days ago
The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.
Comment by mycall 2 days ago
Comment by 0-_-0 3 days ago
Comment by wongarsu 2 days ago
For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so
Comment by 0-_-0 2 days ago
Comment by syllogistic 2 days ago
Comment by mirekrusin 3 days ago
Comment by saagarjha 2 days ago
Comment by mirekrusin 2 days ago
MoE is more hardcoded, pre determined, speculation is much more dynamic, malleable after training.
This paper actually proposes direction of aligning architecture to aid speculation as future work.
Comment by doctorpangloss 2 days ago
Comment by monster_truck 2 days ago
Comment by boutell 2 days ago
Comment by liuliu 2 days ago
Comment by zzzoom 2 days ago
Comment by haeseong 2 days ago
Comment by porridgeraisin 3 days ago