Do transformers need three projections? Systematic study of QKV variants

Posted by Anon84 5 days ago

Comments

Comment by amluto 5 days ago

Hint for authors: when discussing linear algebra (or really most other kinds of math), follow normal conventions. In this case, the convention would be that - (the minus sign) means subtraction. It does not mean "and also", especially when you sandwich it between two variables that represent matrices.

I read the paper with much head scratching all the way through sections 1 and 2 and part of 3 before I figured out that, no, really, the description "Q-K=V" does not mean "Q minus K equals V" (the head scratching was because a bunch of their descriptions and symmetry comments really make little sense if you think "Q minus K equals V"). If you want to say that "K equals V", please spell it "K=V" :)

I am curious whether it makes any sense at all to enforce a more general linear constraint on the query, key and value attention matrices along the line of Q-K=V.

It is an entertaining paper. I admit I'm surprised that K=V appears to work as well as it does -- it seems like it's almost enforcing a sort of model where the query is a guess as to what the value is and the attention head returns a (softmaxed) value that is closest to the query's guess. Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

Comment by kanbankaren 5 days ago

It confused me too.

A n-tuple notation would have been more readable and mathematically accurate like (Q=K, V), (Q, K=V), and (Q=K=V).

Comment by amemi 5 days ago

> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.

Comment by xiaoyu2006 5 days ago

Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.

Comment by joshuamoyers 4 days ago

I think the primary reason it works is because the difference between K and Q, which is not all that obvious is that it’s allowing the model to have an asymmetric relationship between tokens, so one token can attend to another without the reverse being true. It seems to me if you just have a single value that you’re representing symmetric relationship, which might degrade the quality of reasoning over a set of tokens, but also is probably possible.

Comment by joshuamoyers 4 days ago

it seems to be something that’s similar to the class of optimizations associated with with linear or state space attention when things models often do is once they figure out an optimization like this they create a ratio between full resolution blocks and blocks that have the optimization implemented.

Comment by ssivark 5 days ago

Would it have killed them to use a comma instead?!

Comment by sfink 5 days ago

Wha? Why didn't they use Q=K=V for that?

Comment by simsla 4 days ago

The notation is supposed to mean: you have a matrix Q, and also a shared K=V matrix.

I agree with GP that it's super confusing to us the minus sign as a delimiter between formulas. The tuple notation suggested elsewhere would be way clearer.

Comment by semiinfinitely 5 days ago

Its not a math paper

Comment by volemo 5 days ago

Does it not being an English philology paper mean they are free to spell “fish” as “ghoti”?

Comment by srean 5 days ago

Definitely an applied maths paper given that it has been published under CS/ML and been accepted at ICML.

Comment by semiinfinitely 4 days ago

Its not even applied math

Comment by canjobear 5 days ago

It’s not typeset in math mode so you can’t expect the hyphen to correspond to minus.

Comment by conformist 5 days ago

By this logic a lot of applied maths papers become “does not compile” :D

Comment by Sharlin 4 days ago

Cannot tell whether sarcasm or not.

Comment by Lerc 5 days ago

I can see why the QKV gets used but I can't help but think that thete's got to be a better mechanism with turning a pair of vectors into a new vector and a significance field.

Geometrically I imagine the process of attention like picking up a bunch of vectots and spinning and squishing them in many-D until you can find a crack where you can see all the way through, then leveraging that crack to seperate what you want.

I doubt that's strictly accurate, but it might be close enough that it makes me think that if you were doing that with a bunch of bananas, it would be much easier to find the way through if you could also bend the bunch so they were all straight.

It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.

Comment by ActorNightly 3 days ago

>I can't help but think that there's got to be a better mechanism

There is.

Transformers are basically autoencoders on the decode step - they take a compressed set of information and expand it into a 3 matrices which then get combined back into one matrix.

You can unroll the entire self attention step into fully connected layers, just with a lot of zeros for things that don't get multiplied together.

So it stands to reason that there is probably an optimal form of weights that does the same thing as current transformers.

Comment by WithinReason 5 days ago

> I can't help but think that thete's got to be a better mechanism

What matters is not how good it is in isolation, but how well it scales to giant datasets and supercomputers. So far attention scales the best. It's the most "brute force"-able mechanism

Comment by logicchains 5 days ago

>It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.

You can't make attention more specialized without making it less general, which makes LLMs worse as a universal approximator.

Comment by nullpoint420 5 days ago

It kinda reminds me of general relativity and gravity bending space-time. I'm sure I sound nuts right now, but the model fits in my head.

Comment by v9v 5 days ago

Somewhat relevant is a blog-post that likens attention to kernel smoothing: https://bactra.org/notebooks/nn-attention-and-transformers.h... (as discussed before in https://news.ycombinator.com/item?id=38756888)

Comment by xiaoyu2006 5 days ago

Will be great and amusing if it actually turns out that we have been doing transformer overly-complex. The code repo is missing tho...

Comment by ares623 5 days ago

Gets the juices flowing though..

Comment by foldl2022 5 days ago

Gemma-4 E2B/E4B models reuses K-V cache from other layers, which do things in a "transposed" way: not reuse Q/K/V matrices within a single layer, but reuse across different layers.

Comment by in-silico 5 days ago

These types of ablation studies are always good. However, I'm not sure how generalizable the language model findings here are.

Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).

This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.

I can't fault the authors since longer training runs cost money, but it warrants pointing out.

I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).

Comment by janalsncm 5 days ago

It’s a data point. I could imagine in a hardware constrained setting we might not care about training on enormous token counts, and on smaller devices it’s great if we can simplify the architecture.

I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.

Comment by Philpax 5 days ago

The Chinchilla scaling laws give you a minimum for the number of tokens you should be using for a given size: if you can't meet what they suggest for that size, you should shrink the size, as, otherwise, the capacity of the model is going to waste.

I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.

Comment by ACCount37 5 days ago

I wonder if some of those synthetics that specifically burn in attention inductive bias could help there - i.e. by getting attention to converge faster than it normally would?

Comment by semessier 5 days ago

V being collinear is obvious, the question is/was also which additional orthogonal projections such as camera position for vision would improve the transformer.

Comment by 5 days ago

Comment by hollosi 5 days ago

I would not be surprised if it turned out the exact attention mechanism does not really matter, similarly to the sigmoid, ReLU, GELU movement, only the speed on calculation - and QKV is pretty good at that on the GPUs.

Comment by nbardy 5 days ago

This has been my thought for a long time. I think all that matters from attention is that there is crosswise comparison going on.

You need some amount of parallel compute and some amount of global comparison.

And the rest is basically a ways to parameters and scale.

(This is in theory, in practice you can get a lot of small % stability and efficiency improvements that really compound in algorithmic details of model architecture)

Comment by soVeryTired 4 days ago

Can anyone explain to me why Q and K are both needed? They only ever appear as a pair, so why can’t you just define a matrix A = QK and learn that directly?

Comment by mattalex 4 days ago

Because the size of the attention matrix depends on the number of tokens (this is what makes attention N^2). If you don't care about having a flexible number of input tokens (e.g. in image processing) you can learn a fixed routing matrix. This is known as an MLP mixer https://arxiv.org/pdf/2105.01601 : you have one layer that processes each token in isolation ("vertical MLP") but ignores the inter-token connections, followed by a layer that combines between tokens ("horizontal MLP") that treats the internals of every token identically.

Comment by pseudo-usama 5 days ago

It's interesting to see people are still experimenting with the core concepts of transformers

Comment by 7e 5 days ago

More evidence that the original Transformer authors didn't really know what they were doing, but they did have access to more cheap compute than anyone else.

Comment by spindump8930 4 days ago

Can you share the specific part of this work that demonstrates better scaling than original transformers? Also note that many of the changes to that architecture, that have been proven in their use at actual scale, were brought about by members of the original team. Most notably Noam Shazeer.

Comment by jephs 5 days ago

I'm terribly sorry, but scaling curves or GTFO. Any random pile of linear algebra works fine-ish at small scales. Very few random piles of linear algebra push the Pareto envelope at large scales.

Comment by ketchup32613 5 days ago

Do you want to see scaling curves wrt data and param size? I agree that 1.2B and 10B tokens is not representative, but what scale of parameters and dataset sizes would be convincing?

Comment by zxexz 5 days ago

Not to sound facetious, but perhaps enough runs at different param/token sizings to define a curve?

Comment by WithinReason 5 days ago

Not every one can afford millions to publish a paper

Comment by spindump8930 4 days ago

That's why you do several small and medium scale tests, fit a curve, and ideally show that the trend persists at several scales. Not a single large or medium run - see the other comments down thread for example sizes.

Comment by Der_Einzige 4 days ago

This exact mentality is cancer for peer review/the industry. We all know who you are if you are using 1000+ TPUs, and yes you do get a "buff" to your peer review scores because people know where you work.

Fuck your scaling curves. More research labs need to #yolo and try stuff that doesn't have good scaling behavior proven yet. State Space models have continued to take forever to proliferate despite being objectively good because only the god dang Chinese understand that you actually need to #yolo sometimes like making some of your layer state space layers in Hunyuan-T1.

Comment by jephs 4 days ago

Scaling curves don't need to be drawn at particularly enormous parameter counts to be useful! If you can do a 300M and 1.2B run (like the authors do here), then you can do 150M, 300M, 600M, and 1.2B runs with only 50% more resources, and get a much better sense for whether effects seem to amplify or diminish as scale increases.

Comment by spindump8930 4 days ago

Exactly. Good peer reviewers understand that you can also move down on the scaling curve, not just up. Also laughable to try a "yolo" run without validating a scaling ladder/curve.

Comment by xuzhenpeng 5 days ago

[flagged]

Comment by afford-ai 5 days ago

[flagged]

Comment by DuduZhvania 4 days ago

[dead]

Comment by brianjmingus 5 days ago

[dead]

Comment by Hugsun 5 days ago

Is there anything of value in this project?

It sounds interesting at a glance, but it seems to be AI slop. So it's hard to tell if there are any interesting discoveries there, or just some worthless results described with performatively advanced language.

Comment by 5 days ago

Comment by dnnddidiej 5 days ago

No one got fired for choosing QKV I guess