Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency
Posted by theanonymousone 4 days ago
Comments
Comment by simonw 4 days ago
uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu \
--prompt="Generate an SVG of a pelican riding a bicycle"
The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lmIt can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:
uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu --vision-backend gpu \
--attachment image.jpg --prompt describe
And for audio: uvx litert-lm run \
--from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
gemma-4-E2B-it.litertlm \
--backend=gpu --audio-backend cpu \
--attachment audio.wav --prompt transcribe
(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )Comment by reactordev 4 days ago
Comment by viccis 4 days ago
Comment by taffydavid 4 days ago
But seriously, wouldn't productive text on a 90s cell phone pass this test?
Comment by reactordev 3 days ago
It’s harder now because emojis and draw-to-type as well as pen input. We didn’t have these things 14 years ago when “I’ll be right back” could be expanded from “I’ll b ri ba”
Comment by yalok 4 days ago
Comment by reactordev 3 days ago
Comment by ranguna 3 days ago
Comment by simonw 4 days ago
Comment by reactordev 4 days ago
https://huggingface.co/google/gemma-4-E2B-it-qat-mobile-ct
But they could be cooking up a smaller one because the model card lists the Q_4 quants as being bigger than the mobile or text-only so I think we’ll need to wait for the Q_2_Distilled_Mobile_Textformer version. Still, just amazing work.
Comment by madduci 4 days ago
Comment by reactordev 3 days ago
Comment by rcarmo 4 days ago
Comment by __mharrison__ 4 days ago
Comment by NamlchakKhandro 4 days ago
It's slow and the PKG resolution is way too flat.
Comment by qwertox 4 days ago
Comment by satvikpendem 4 days ago
Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.
Comment by llmoorator 4 days ago
meaning Google quantized the model to 4 bit and stored the result in BF16 format for compatibility and convenience to downstream packers.
Like storing small 8 bit numbers in full 32 bit integers.
So it's not close to 100% of unquantized BF16.
I'm curious if anybody can explain why Google released 4 bit QAT Q4_0 is not exactly 100% of BF16 QAT Q4_0? seems like it should be just bit twiddling, no further quantization to convert between these two packings. Unsloth talks about "lattice alignment" being an issue.
That being said I hate it that smol model makers, like Google, Qwen, ... only show the BF16 benchmarks when they release a new models, knowing that what people really run are 4-8 bit quantizations, so it's really hard to understand how much you lose when you run 4 bit vs 6 bit...
Comment by coder543 4 days ago
You also misunderstand what is happening. Google did not do that. Google further trained the original model with an objective of minimizing error when quantized to 4-bit. The BF16 QAT is not an upscaled 4-bit model. When quantized to 4-bit, it should lose less accuracy than a typical 16-bit model loses when quantized to 4-bit, but the loss is not zero, because it is not based on a 4-bit model.
The Gemma 3 QAT report was a bit clearer:
https://developers.googleblog.com/en/gemma-3-quantized-aware...
"Instead of just quantizing the model after it's fully trained, QAT incorporates the quantization process during training. QAT simulates low-precision operations during training to allow quantization with less degradation afterwards for smaller, faster models while maintaining accuracy. Diving deeper, we applied QAT on ~5,000 steps using probabilities from the non-quantized checkpoint as targets. We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0."
The BF16 is just trained to be more resistant to simulated quantization, which helps when it is actually quantized. Google is not doing post-training on the 4-bit model directly.
Comment by 3abiton 4 days ago
Comment by dofm 3 days ago
I am a good WP developer so I kept prodding it and it kept insisting, and it explained with clarity. Turns out it is right and I was wrong, as I would have found out if I'd written the code myself.
I've been using this particular test for days, experimenting in ways to generate and prompt code. The 4-bit quantisation of the pre-QAT model does not catch this error. And nor can the Qwen 3.6 sparse model, which confidently blazed past it and never mentioned it.
(FWIW neither did plain ChatGPT; maybe Codex would)
Anecdotal, but there you go. I am somewhat weirded out by it.
Comment by ComputerGuru 4 days ago
Comment by coder543 4 days ago
Comment by satvikpendem 4 days ago
Comment by mft_ 4 days ago
Comment by ComputerGuru 3 days ago
Comment by scosman 4 days ago
Comment by slopinthebag 4 days ago
Comment by overfeed 4 days ago
Comment by jhatax 4 days ago
No knowledge, just speculation.
Comment by illusive4080 4 days ago
Comment by itake 3 days ago
Comment by trollbridge 4 days ago
Comment by robgough 3 days ago
Comment by jbarrow 4 days ago
Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!
Comment by minimaxir 4 days ago
It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.
Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.
Comment by Aurornis 4 days ago
The Q4_0 is a quantization aware training checkpoint. It's not a simple quantization of the original Gemma 4 12B.
Comment by netdur 4 days ago
Comment by refulgentis 4 days ago
- Gemma 4 2B/4B/27BE3B/31B
- Gemma 4 2B/4B/27BE3B/31B x "assistant" / MTP drafter models (i.e. multitoken prediction)
- Gemma 4 12B (2 days ago? 1?)
- Gemma 4 QAT 2B/4B/12B/27BE3B/31B x "assistant" models (i.e. multitoken prediction)
It probably sounds silly and really whiny in the abstract. It just causes a ton of work / confusion downstream that feels unnecessary.
Extremely glad for the output, not glad to have to chase it.
ex. llama.cpp currently supports the originals but not the MTP predictors but there is a patch for the MTP predictors but not for the small MoE models and I think it supports the 12B but maybe not media for it yet and now we have these too and the blog says there's GGUFs (llama.cpp models) but there isn't in any of the 12? repos I clicked through. and ~every consumer-facing local LLM app is built on llama.cpp or a fork of it.
Also if anyone at Google is taking feedback over to b/ or product, pleaseeee stop the "E"2B "E"4B thing, unless it's actually taking up less RAM on Android during CPU inference. I can't tell if I need to treat the 4B like an 8B (i.e. beyond most consumer hardware without a GPU) or a 4B (i.e. will run on most consumer hardware since 2021)
EDIT: And, yes, the QAT 12B x mmproj does not work with llama.cpp. I'm glad there's people who have the luxury of not having to, well, actually use these and treat me as whining :) I'll need to schedule another 4-8 hours of work for the 4th time, no fun!
Comment by ddarolfi 4 days ago
Comment by sumedh 4 days ago
This is exactly why Google has 10 messenger Apps.
Comment by nolist_policy 4 days ago
Comment by refulgentis 4 days ago
And you're absolutely right to point out they aren't products - I hoped that was clear - when you're building a product with them, you end up having to do the same build loop 4 times, in this instance :)
Comment by overfeed 4 days ago
Comment by ddarolfi 4 days ago
Comment by satvikpendem 4 days ago
Comment by RandyOrion 4 days ago
Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 600000, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty.
In contrast, for qwen 3.5 family, the largest activation is below 2000, making q8 or even lower-precision kv cache essentially free estates. Together with linear attention, which doesn't require kv cache, full 262k context window can be easily reached.
Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all.
In the end, a qat is a qat, and there are unseen efforts behind qat checkpoints. Thank you gemma team for releasing qat checkpoints.
Comment by RandyOrion 4 days ago
Together with bf16 related deliberate hardward degrades on consumer-level nvidia gpus, i.e., gtx 10, rtx 20, 30, 40, 50 series, things gets sour really quickly.
Comment by Catloafdev 4 days ago
Comment by taffydavid 4 days ago
Will advancements like this ultimately reduce the carbon footprint of AI?
Comment by goldenarm 4 days ago
Also Google Deepmins has a six month embargo on strategic papers, so I bet the juiciest quantization tech isn't public yet.
Comment by WhiteDawn 4 days ago
Comment by pfheatwole 4 days ago
- Safetensors: https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...
- GGUF: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/...
Note the README in the Unsloth list of files: llama.cpp is working on a PR to support the gemma4 drafters: https://github.com/ggml-org/llama.cpp/pull/23398. Also note the PR submitter didn't experience much speedup with 26B (seems typical that MoE models don't generally benefit from MTP).
Comment by dist-epoch 4 days ago
https://huggingface.co/google/gemma-4-26B-A4B-it-qat-q4_0-un...
Comment by dofm 4 days ago
(Pardon my ignorance; this stuff moves so fast)
Comment by thangalin 4 days ago
https://point.free/blog/gemma-4-on-a-2016-xeon/
Xeon, but could be useful for MTP on Mac.
Comment by dofm 2 days ago
Comment by dofm 4 days ago
I do have the Qwen 3.6 (35B) MTP implementation running (in LM Studio; it doesn't need a separate drafter), along with non-MTP Gemma 4 26B, and I can see that Unsloth Studio can run the new QAT, but I can't see how you can run the assistant/drafter. Yet.
It's just a constantly changing landscape. Don't get me wrong, it's fascinating and for various reasons I am pleased I can keep up even slightly, but eeeehhh :-)
Comment by int_19h 4 days ago
Comment by dofm 4 days ago
1) Gemma 4 MTP is too fresh for off-the-shelf software to use anyway
2) "you can convert them yourself" which is fine, obvs
Comment by superkuh 4 days ago
Comment by llminthefor 4 days ago
Comment by superkuh 1 day ago
Comment by netdur 4 days ago
The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!
Comment by ComputerGuru 4 days ago
Comment by linuxhansl 2 days ago
My optimal local setup now is gemmma4-qat and Q8_0 K/V cache quantization with 256k context windows. And that runs fine with 12GB VRAM and another 10GB in RAM.
Previously I tried with gemma4:26b-a4b-it-q4_K_M and qwen3.6:35b-a3b-q4_K_M, and they both would tie themselves into knots (especially qwen3.6 can take forever with incessant "but wait..." thinking loops.) More often than not, they would not finish the task.
It seems true these 4b QAT models are as precise as Q8_0 quantization (which is supposedly indistinguishable from bf16).
I am really excited about the prospect of local LLM inference.
Comment by jack_pp 4 days ago
Comment by dofm 2 days ago
https://huggingface.co/RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf
you can now do this: ./llama-server \
-hf google/gemma-4-26B-A4B-it-qat-q4_0-gguf \
--spec-draft-hf RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf:Q4_0 \
--spec-type draft-mtp \
--spec-draft-n-max 3
Really quick on an M1 Max, definitely seeing a speedup. Though interestingly the oMLX performance with the MLX community MTP models is much worse. I am not really sure why (the parallel model overhead not being worth the speculative gain, I suppose, but I do not understand this stuff anywhere near as much as I would like).Comment by somewhatrandom9 4 days ago
Comment by dist-epoch 4 days ago
Comment by Havoc 4 days ago
Comment by int_19h 4 days ago
Comment by Havoc 4 days ago
Comment by girvo 4 days ago
31b-it-assistant is what enables MTP
Comment by make3 3 days ago
Comment by cr3cr3 4 days ago
Comment by razighter777 4 days ago
Comment by arjun-mavonic 3 days ago
Comment by nazgul17 4 days ago
Comment by zkmon 4 days ago
Comment by SubiculumCode 4 days ago
Comment by nicman23 4 days ago
Comment by redox99 4 days ago
Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.
Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.
Comment by ilaksh 4 days ago
But also maybe a few Qwen 3.6 or Qwen 3.5 variants can fit and can handle some simple tasks.
Comment by redox99 4 days ago
When I tried E2B and E4B with Google Edge Gallery, and added a web search skill from the skill list, E2B would fail (get stuck in a loop), E4B would need a very specific instruction, "weather in [city name]" would not call the web search tool, I'd need "web search weather in [city name]". And the result was completely hallucinated and impossible. It claimed 14c and feels like 4c (which is impossible), and 10% humidity (which is almost impossible in this city)
Asking wikipedia level history questions (without any tool use), the results were awful as well.
Comment by satvikpendem 4 days ago
This is running on a server though, not sure how well it'd work on a phone, I should try that. I used AI Edge Gallery on Android and it doesn't seem too good at the web search tool but maybe the web search tool itself, being a community made tool, is pretty bad, because tool calling via Unsloth Studio seems to work just fine with the exact same Gemma models on desktop/server vs the phone.
Comment by redox99 4 days ago
I'm sure you can get some out of it if you babysit it with an optimized prompt, harness, etc and you can tolerate some failures. But when I try to run the ChatGPT prompts from my history, even if I pick the easier ones, it's hopeless.
I'd like to have a local agent on the phone with wikipedia level knowledge. But you probably need more like 30B params.
Comment by satvikpendem 4 days ago
Comment by redox99 4 days ago
I just tested "List the 5 most recent Argentina vice presidents" on E4B and it literally got all 5 wrong
Comment by satvikpendem 4 days ago
Try this on Unsloth Studio, they seem to have fixed Gemma tool calling.
Comment by redox99 4 days ago
Comment by Melatonic 4 days ago
Comment by redox99 4 days ago
-----------
As of my last update, here are the five most recent individuals to have served as Vice President of Argentina:
Sergio Massa (Served as Vice President from 2019 to 2023)
Martín Lousteau (Served as Vice President from 2015 to 2019)
Cristina Fernández de Kirchner (Served as Vice President from 2007 to 2015)
Néstor Kirchner (Served as Vice President from 2003 to 2007)
Eduardo Duhalde (Served as Vice President from 1999 to 2003)
Note on the list: The term "most recent" can be interpreted in two ways:
Most recent to have served: This list follows that interpretation, showing the last five people who held the office.
Most recent current officeholders: If you are asking for the current Vice President, that position is currently held by Juan Manuel Moreno (who was appointed in 2024).
If you are looking for the current Vice President, please let me know!
Comment by Kylejeong21 4 days ago
Comment by refulgentis 4 days ago
Comment by minimaxir 4 days ago
Comment by refulgentis 4 days ago
Comment by comparedge 4 days ago
Comment by Pixel-Labs 4 days ago
Comment by spacebacon 4 days ago
Comment by steno132 4 days ago
I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.
Comment by adam_arthur 4 days ago
I don't think many realize that most LLM embedded automation, pipelines, products will soon be able to run extremely cheaply on models < 100B parameters.
Frontier models will be used for coding/creation use cases, yes. But for all the pseudo-deterministic, pipeline, analysis style things there will be no practical benefit to running frontier models, only additional cost.
Gemma 4 26B outperforms most 100-200B models that I've tested for reasoning and structured output.
Gemma 4 12B can consistently select where to click on browser images given a minimal prompt, and do so very quickly.
Comment by dofm 4 days ago
Comment by steno132 4 days ago
If you're building a automation as a company you definitely won't want to take on the long term maintenance overhead of running your own models for some automation project.
Comment by adam_arthur 4 days ago
Your claim is effectively that companies don't care about operational/cloud costs. Even pre-LLM, companies regularly assessed and tried to pare down cloud spend.
Comment by sowbug 3 days ago
Comment by mikeocool 4 days ago
All 3 years?
Comment by steno132 4 days ago
Comment by victorbjorklund 3 days ago
Comment by Zambyte 4 days ago
Comment by steno132 4 days ago
I'd rather not have intensive compute needed shifted onto my personal machine which I want to use for something else.
Comment by satvikpendem 4 days ago
Comment by steno132 4 days ago
Comment by Zambyte 4 days ago
Comment by satvikpendem 4 days ago
As the sibling says this is why people want smaller but still performant models.
Comment by Zambyte 4 days ago
Comment by user2722 4 days ago
Comment by steno132 4 days ago
It would be selfish and unethical not to in my view. And ultimately the data is just being used in order to improve the models and benefit us, not for anything nefarious.
Comment by NicuCalcea 4 days ago
Comment by mannanj 4 days ago
The obsession is for leaving hostile and abusive entities, the corporations or the people who fund them that have a horrible track record in regards to ethicality, rights and respect & human dignity.
Comment by steno132 4 days ago
It's like using Gmail and expecting them not to train their AI models on your data - how can you expect that when they're giving you a secure, reliable, highly functional email client completely for free?
The digital economy only works if everyone pays their fair share. If you don't want to give your data then you are really harming everyone by slowing down AI development for everyone else.
Comment by klardotsh 4 days ago
If I pay you for a service, what implicit right should you have to then continue to profit in perpetuity by storing the data I paid you to process?
If LLMs were free your Gmail analogy might hold up. They aren’t, and so it doesn’t.
AI development can continue with the data folks opt into, or with the data AI companies incessantly scrape with reckless disregard for polite system loads. AI development does not require retaining all user inputs forever.
Comment by mannanj 4 days ago
My disinterest is in sharing my intellectual IP. Most people up to now, have never shared this much of their intellectual IP with a company. Name one product through human history before that got this much data and insight into human thinking and now can use your most intimate conversations, ideas and needs for non-training purposes?
You can't even opt out of that! At least for the training data you can opt-out.
Comment by satvikpendem 4 days ago
Comment by mannanj 3 days ago
"real" property or not. You agree that we have some right to our own outputs, right? Is that not dignity, to say "I want my outputs protected".
Seems like you think that your ideas should be free, as you called it information. How about you back that up with action... please send me all your most intimate, valuable ideas. Oh no, you don't feel comfortable? Then why are you sharing it with companies?
Comment by satvikpendem 3 days ago
Comment by mannanj 2 days ago
For example, I think ideas are incredibly valuable as we've discussed. I think theft of them is a norm and then disempowered opinions are conditioned upon the people to make stealing their output easier.
I also share my ideas with companies such as chatting with LLMs but I talk about it because I'm unhappy with it and think I should not forget (and neither should you) that ultimately a valuable asset is being handed to them free. No, actually, I'm paying them for that. And I think that shouldn't be forgotten. When a better alternative is available, or I'm ready, I'm out.
Comment by mannanj 4 days ago
Comment by satvikpendem 4 days ago
Comment by mannanj 3 days ago