Do transformers need three projections? Systematic study of QKV variants
Posted by Anon84 5 days ago
Comments
Comment by amluto 5 days ago
I read the paper with much head scratching all the way through sections 1 and 2 and part of 3 before I figured out that, no, really, the description "Q-K=V" does not mean "Q minus K equals V" (the head scratching was because a bunch of their descriptions and symmetry comments really make little sense if you think "Q minus K equals V"). If you want to say that "K equals V", please spell it "K=V" :)
I am curious whether it makes any sense at all to enforce a more general linear constraint on the query, key and value attention matrices along the line of Q-K=V.
It is an entertaining paper. I admit I'm surprised that K=V appears to work as well as it does -- it seems like it's almost enforcing a sort of model where the query is a guess as to what the value is and the attention head returns a (softmaxed) value that is closest to the query's guess. Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.
Comment by kanbankaren 5 days ago
A n-tuple notation would have been more readable and mathematically accurate like (Q=K, V), (Q, K=V), and (Q=K=V).
Comment by amemi 5 days ago
In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.
Comment by xiaoyu2006 5 days ago
Comment by joshuamoyers 4 days ago
Comment by joshuamoyers 4 days ago
Comment by ssivark 5 days ago
Comment by sfink 5 days ago
Comment by simsla 4 days ago
I agree with GP that it's super confusing to us the minus sign as a delimiter between formulas. The tuple notation suggested elsewhere would be way clearer.
Comment by semiinfinitely 5 days ago
Comment by volemo 5 days ago
Comment by srean 5 days ago
Comment by semiinfinitely 4 days ago
Comment by canjobear 5 days ago
Comment by conformist 5 days ago
Comment by Sharlin 4 days ago
Comment by Lerc 5 days ago
Geometrically I imagine the process of attention like picking up a bunch of vectots and spinning and squishing them in many-D until you can find a crack where you can see all the way through, then leveraging that crack to seperate what you want.
I doubt that's strictly accurate, but it might be close enough that it makes me think that if you were doing that with a bunch of bananas, it would be much easier to find the way through if you could also bend the bunch so they were all straight.
It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.
Comment by ActorNightly 3 days ago
There is.
Transformers are basically autoencoders on the decode step - they take a compressed set of information and expand it into a 3 matrices which then get combined back into one matrix.
You can unroll the entire self attention step into fully connected layers, just with a lot of zeros for things that don't get multiplied together.
So it stands to reason that there is probably an optimal form of weights that does the same thing as current transformers.
Comment by WithinReason 5 days ago
What matters is not how good it is in isolation, but how well it scales to giant datasets and supercomputers. So far attention scales the best. It's the most "brute force"-able mechanism
Comment by logicchains 5 days ago
You can't make attention more specialized without making it less general, which makes LLMs worse as a universal approximator.
Comment by nullpoint420 5 days ago
Comment by v9v 5 days ago
Comment by xiaoyu2006 5 days ago
Comment by ares623 5 days ago
Comment by foldl2022 5 days ago
Comment by in-silico 5 days ago
Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).
This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.
I can't fault the authors since longer training runs cost money, but it warrants pointing out.
I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).
Comment by janalsncm 5 days ago
I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.
Comment by Philpax 5 days ago
I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.
Comment by ACCount37 5 days ago
Comment by semessier 5 days ago
Comment by hollosi 5 days ago
Comment by nbardy 5 days ago
You need some amount of parallel compute and some amount of global comparison.
And the rest is basically a ways to parameters and scale.
(This is in theory, in practice you can get a lot of small % stability and efficiency improvements that really compound in algorithmic details of model architecture)
Comment by soVeryTired 4 days ago
Comment by mattalex 4 days ago
Comment by pseudo-usama 5 days ago
Comment by 7e 5 days ago
Comment by spindump8930 4 days ago
Comment by jephs 5 days ago
Comment by ketchup32613 5 days ago
Comment by zxexz 5 days ago
Comment by WithinReason 5 days ago
Comment by spindump8930 4 days ago
Comment by Der_Einzige 4 days ago
Fuck your scaling curves. More research labs need to #yolo and try stuff that doesn't have good scaling behavior proven yet. State Space models have continued to take forever to proliferate despite being objectively good because only the god dang Chinese understand that you actually need to #yolo sometimes like making some of your layer state space layers in Hunyuan-T1.
Comment by jephs 4 days ago
Comment by spindump8930 4 days ago
Comment by xuzhenpeng 5 days ago
Comment by afford-ai 5 days ago
Comment by DuduZhvania 4 days ago
Comment by brianjmingus 5 days ago
Comment by Hugsun 5 days ago
It sounds interesting at a glance, but it seems to be AI slop. So it's hard to tell if there are any interesting discoveries there, or just some worthless results described with performatively advanced language.
Comment by dnnddidiej 5 days ago