Speculative KV coding: losslessly compressing KV cache by up to ~4×

Posted by kkm 5 days ago

Comments

Comment by zozbot234 3 days ago

The problem with this approach is that even recomputing a "draft" of the KV cache is still quadratic in context length. Maybe you can get some constant savings by always recomputing the earliest tokens, but it's not a good tradeoff as context sizes grow.

Comment by saagarjha 2 days ago

Sure, but any classical attention mechanism is quadratic in context length.

Comment by zozbot234 2 days ago

But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.

Comment by zozbot234 2 days ago

BTW, I forgot to mention that you can make this work in a way, but only if your model architecture generalizes the context and attention mechanism such that it's no longer a pure sequence. So you could have a large amount of distinct "early" token sequences, with each being self-contained and not depending on any other tokens, e.g. your source code files might be such. Then later parts of the context would of course depend on all of those files as usual. This makes prefill for the earlier context both reusable and cheaply recomputable throughout, at the cost of losing some dependencies that would've been previously accounted for: your model becomes faster and more efficient, but perhaps not quite as smart.

Comment by somnial 1 day ago

true, but no reason the predictor model couldn't use linear attention (i.e. mamba, GDN etc) to predict KV caches

Comment by hypfer 3 days ago

TL;DR (and please correct me if I got it wrong):

Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. The other way round then just predicts the values again, applies the delta, and you have the full correct value while just storing the delta

And this works because you're never looking at the whole k/v cache but always just a slice. So you just need a memory buffer of the size of the slice

___

If this works out and I've understood correctly, that _I think_ would mean that a 24GB RTX 4090 could fit 256k q8 context next to Qwen3.6-27B at IQ4_NL.

Or, alternatively, something like 208k context (matching claude api limits of 200k in some plans) with a slightly larger quant like UD-Q4_K_XL.

That would be massive. Especially since the thing has so much compute to spare.

Though, all depending on the size of that predictor model I guess?

Comment by oceanplexian 2 days ago

A lot of this is over my head but why would you do compression when GPU time is the most expensive thing in the world right now?

KV can be trivially stored on ram or even a spinning disk and retrieved on the order of milliseconds. See LM cache for vLLM for example. In fact it’s so easy it kinda shocks me when Claude Code will sit and recompute my entire KV on a new session after a couple of hours, I guess Anthropic infra is not as optimized as it would seem.

Think about the problem from first principles:

Storing a few GB per user at scale isn’t that hard and was solved years ago. Let’s say I have 20 chat sessions open and the session persists for a day or two, this seems negligible to me as a systems design problem.

Comment by xlayn 2 days ago

I created a patch for llama.cpp to store on disk instead of deleting the kv cache as well as the checkpoints... there is this bug on llama.cpp if you have more than one instance going on of chats... and that causes the kv cache to be lost between changes of chat... And I can tell you, using Qwen3.627B after one day of use you can have 120-200Gb of chats on disk. And yes it's way way faster, even if you get it from a spinning disk it's still faster than re-computing the whole thing...

I guess for a 300B parameter or more and couple million users with the price of storage increasing as part of ramagedon this is also not viable...

Comment by zozbot234 2 days ago

Qwen 27B maxes out at a 16GB context. A nice thing about DeepSeek V4, especially Flash, is that its context size stays tiny even at 1M tokens! Which in turn opens up wide batching on common consumer platforms.

Comment by lostmsu 2 days ago

DeepSeek V4 Flash is 160GB while Qwen 27B is about 27GB. You can't even run DS Flash on consumer platforms, let alone batch it.

Comment by zozbot234 2 days ago

These are the sizes of model weights, not the KV cache. The former are a sparse (for MoE models) read workload that can be streamed from SSD.

Comment by lostmsu 1 day ago

You can't batch MoE

Comment by zozbot234 1 day ago

You need wider batches to get effective reuse of experts in any given layer, but you absolutely can. DeepSeek V4 has tiny KV caches that make this quite feasible. When targeting consumer platforms that only have a limited amount of compute headroom to begin with, the approach is quite reasonable.

Comment by lostmsu 1 day ago

Sounds like you're talking out of your butt instead of doing the math.

Comment by zozbot234 1 day ago

What do you mean by doing the math? If you repeatedly sample n_active experts out of n_total, why wouldn't you expect to get some meaningful probability of reuse/overlap once your batch grows past size 5 or so (for the sparsest MoE models in common use)? And you only need enough reuse to fill the compute headroom which is quite small on consumer platforms (we won't have huge TOPS numbers for the typical integrated GPU in Strix Halo or even the upcoming RTX Spark). Plus if you're a single user running multiple streams in parallel the choice of experts will be highly biased leading to more reuse.

Comment by lostmsu 1 day ago

Yeah, that's what talking out of your butt is literally. "theoretical", no ballparks, ignorant assumptions about expert reuse.

Comment by zozbot234 1 day ago

There's been some very rough experiments with batching on Apple Silicon (and that's not a highly suitable platform since the compute/thermals bottleneck hits sooner than elsewhere) that seem to be broadly consistent with what I argued, showing as much as 2x total decode throughput with an 8-wide batch. That's substantial in this context.

Comment by lostmsu 1 day ago

Assuming you magically use all 128GiB of xRAM you need to read ~32GiB per token in batched mode. On a good SSD that would be 1/3 tokens per second. Cool, 2x that you can do 2/3 tokens per second. Let's assume you are lucky and can actually do 6/7 tokens per second. That's still an extremely far cry from 20+ tokens per second of 27B before any batching.

Comment by fc417fc802 1 day ago

I don't understand where your numbers are coming from. Why is there a 20x (40x?) slowdown of tps after batching?

Comment by lostmsu 1 day ago

Before batching. The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass. That's why it is impractically slow.

Batching could allow you to generate 10 tokens for 10 different conversations at the time, but it also means that you need to load different experts for different tokens, so it does not help as much as it does for dense models.

Comment by fc417fc802 1 day ago

But IIUC the point is that each expert gets used for more than just the one token. So yes, the tps of a given thread takes a hit because now you're sometimes going to schedule in unrelated experts and it will have to pause. But overall you're utilizing the hardware much more efficiently and so in aggregate there's a speedup.

On top of that (as previously pointed out by zoz) for a single user running a single overarching task the choice of experts is expected to be highly biased.

Comment by lostmsu 1 day ago

> the choice of experts is expected to be highly biased

Why? Why do you think that's the case? Part of the training is balancing load between experts.

> so in aggregate there's a speedup.

Yes. 2x. Over theoretical under 1 tok/s

Comment by fc417fc802 1 day ago

> Why do you think that's the case? Part of the training is balancing load between experts.

That is a fair point. That expectation may have been misplaced on my part. I'm not sufficiently familiar with the details of MoE training.

> The slowdown is because the model does not fit the xRAM so experts will have to be read from SSD on every forward pass.

> 20+ tokens per second of 27B before any batching.

Does the model fit in RAM or not? What is your justification for your stated expectation that the unbatched model will perform 20x faster than the aggregate tps (note, not the single stream tps) of the batched model?

My expectation is that if the unbatched model is 20 tps and batching provides a 2x speedup then each individual stream will be slower but the aggregate throughput should rise to 40 tps. What do you believe me to be missing here?

Comment by lostmsu 17 hours ago

27B does, the op was talking about consumer use of DS v4 Flash, that's 160GB.

Comment by zozbot234 23 hours ago

> Why? Why do you think that's the case? Part of the training is balancing load between experts.

The training balances expert choice across the entire scope of the model. Experiments have consistently shown that within a given session or topic (taken in a broad sense) expert choice is biased in a way that's likely to make caching useful and reuse across a user-specific batch realistic.

Comment by killerstorm 2 days ago

While prefill is bottlenecked by GPU compute time, decode might be bottlenecked by GPU memory bandwidth, as you basically need to go through entire KV cache for each new token. So compression can make it faster - you will use more GPU compute but less memory bandwidth for attention calculation

Comment by 5kg 2 days ago

Host to device bandwidth (ram to vram) is 128Gb/s for PCIe Gen 6. VRAM to GPU bandwidth is 1.8Tb/s for GDDR 7 (5090), and 8Tb/s for HBM3e (B200). So it can be faster to recompute than offload kv cache.

Comment by btown 2 days ago

> a few GB per user at scale

While this might seem to be true for casual users, I recall that one of the reasons for Anthropic's recent changes for only retaining KV cache for an hour or so, was that many users just have one massive ongoing session that they continue on with multiple unrelated queries (as one would in a single-thread "group chat"). And this is hard to distinguish from someone who wants that context for their seemingly-unrelated query to apply tone etc.

So in practice, there are many casual users who are typing their Google-esque searches against a 100k+ token context window - and it's at that point where things balloon into 300GB+ KV caches to maintain.

I wouldn't be surprised if we see new UX's around subsidized plans starting to encourage resetting the context window more often.

Comment by zozbot234 2 days ago

300GB of context for a single session is huge though. Modern local models max out at a whole lot less than that.

Comment by nicman23 2 days ago

edge

Comment by jbellis 2 days ago

Because you need kv proportional to context length during inference of a single token to avoid quadratic recomputation. So compressing the kv lets you handle longer contexts in the same amount of vram.

Comment by ssivark 2 days ago

Note that any cache (eg LRU-eviction) is just a specific speculative model for future usage :-)

The cache can be backed by hardware/lookup, or by a cheap computation. The line between functions and data is really blurry.

Comment by mycall 2 days ago

Would you say it is homoiconic, similar to LISP where the syntax of the language is the AST; so, data can become code (Macros) and code can be data (the S-Expression)?

Comment by 0-_-0 3 days ago

You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it.

Comment by wongarsu 2 days ago

The tradeoff gets better the bigger your primary model, and probably with bigger batch sizes. The KV cache can consume a lot of expensive VRAM, and the VRAM and compute costs of the predictor model become a small fraction of the cost of the primary model

For serving a 1T model with 16 concurrent requests this could make a lot of sense. For a 8B model with a single request far less so

Comment by 0-_-0 2 days ago

This can't be used to save VRAM in practice. To generate a new token with the primary model, you first need to decompress the cache, which involves regenerating the whole sequence from scratch. I.e. generate 1 million tokens with the small model to generate 1 with the large.

Comment by syllogistic 2 days ago

How do these results compare with the engram based approach from deepseek?

Comment by mirekrusin 3 days ago

If “speculative” approach works so well in different contexts why not make it first class and use everywhere, possibly recursively?

Comment by saagarjha 2 days ago

Speculation is only worth it if you can profit from it. Not every context allows this or has a similar idea of what can be speculated.

Comment by mirekrusin 2 days ago

It works very well on dense models, imho great alternative to MoE. As verification is cheaper than generation it could be fundamental, first class primitive, maybe even to recurse on it, do live distillation during inference etc.

MoE is more hardcoded, pre determined, speculation is much more dynamic, malleable after training.

This paper actually proposes direction of aligning architecture to aid speculation as future work.

Comment by doctorpangloss 2 days ago

Multi-token prediction is a good enhancement to training. It isn't necessarily useful for inference. Other speculative decoding like EAGLE is. It is specific to the technology and the authors of these things write about it.

Comment by monster_truck 2 days ago

There is no compression taking place here.

Comment by boutell 2 days ago

Isn't that nitpicking? It's a smaller representation of the data, if you have a certain appetite for decompression time. It could conceivably be worth it. I think it would make a great level 2 cache for older chats.

Comment by liuliu 2 days ago

It is a “research note”. It might not pan out, and you might say it doesn’t deserve the attention on the internet. But it did suggest something that resembles of compression, just no experiment done for that.

Comment by zzzoom 2 days ago

Isn't the delta fed to an arithmetic coder?

Comment by haeseong 2 days ago

[dead]

Comment by porridgeraisin 3 days ago

I am yet to do a "deep dive" into the results, but what a well written article. An LLM could _never_ write so crisply.