High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction
Posted by jchandra 2 days ago
Comments
Comment by jchandra 2 days ago
Most methods (Top-K, sliding window) prune tokens. This works on average, but fails selectively — a few tokens cause large errors when removed.
I tried reframing the problem as approximating the attention function: Attn(Q, K, V)
Prototype: - entropy → identify weak tokens - OLS → reconstruct their contribution - SVD → compress them
Early results show lower error than Top-K at low memory, sometimes even lower memory overall.
This is still a small research prototype, would appreciate feedback or pointers to related work.
Comment by cowartc 9 hours ago
Comment by jchandra 7 hours ago
Comment by bee_rider 10 hours ago
It would be sort of surprising if an SVD-based opportunity was missed (since it is such a familiar tool). But, your entropy and least-squares ideas are necessary to set that up, so I guess it makes sense that you’d find some new territory here.
Comment by jchandra 7 hours ago
On downsides: definitely a few. The biggest one is latency - SVD is fairly heavy, so even though it’s amortized (runs periodically, not per token), it still adds noticeable overhead. It’s also more complex than simple pruning, and I haven’t validated how well this holds on real downstream tasks yet.
This is very much a research prototype right now more about exploring a different tradeoff space than something ready for production.
Comment by jbellis 7 hours ago
The commentary says that topk "degrades rapidly at low ratios" but the same can be seen for HAE (Entropy + OLS).
Comment by jchandra 6 hours ago
That said, the gains are modest right now, this is still a research prototype exploring the tradeoff, and there’s clearly more work to be done.
Comment by bee_rider 4 hours ago
Comment by rishabhaiover 7 hours ago
Comment by jchandra 6 hours ago
Comment by vivahir215 2 days ago
Comment by scythmic_waves 8 hours ago
> The primary trade-off observed is the increased calculation time for OLS and SVD steps. Consequently, the next phase of this work involves implementing these operations within custom Triton kernels to amortize latency. By viewing the cache through the lens of reconstruction fidelity rather than just memory capacity, we can develop more sustainable architectures for long-context inference.
Reading between the lines, the increase in latency was so significant that they didn't want to include it before they had a chance to try and optimize the problem away first.
Still interesting research. Hope they get good results!
Comment by jchandra 7 hours ago
Yeah, the latency hit is definitely real. That said, most of what I’ve run so far is CPU-bound, which likely exaggerates it quite a bit so I didn’t want to draw strong conclusions from that.
Would need proper GPU implementations to really understand where it lands.
Comment by jchandra 2 days ago
That said, it’s still heavier than Top-K. I haven’t benchmarked end-to-end latency yet; this is mainly exploring the accuracy vs memory tradeoff.
Comment by thw20 6 hours ago
Maybe the idea of Query calibration matrix Rxx is of interest to the author!
Comment by jchandra 6 hours ago
Comment by aesthesia 7 hours ago
Comment by jchandra 7 hours ago
I’ve started trying this out with actual models, but currently running things CPU-bound, so it’s pretty slow. Would ideally want to try this properly on GPU, but that gets expensive quickly
So yeah, still very much a research prototype — but validating this on real models/data is definitely the next step.