Qwen3.6-35B-A3B: Agentic coding power, now open to all
Posted by cmitsakis 1 day ago
Comments
Comment by simonw 22 hours ago
It drew a better pelican riding a bicycle than Opus 4.7 did! https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
Comment by kelnos 19 hours ago
* It's sitting on the tire, not the seat.
* Is that weird white and black thing supposed to be a beak? If so, it's sticking out of the side of its face rather than the center.
* The wheel spokes are bizarre.
* One of the flamingo's legs doesn't extend to the pedal.
* If you look closely at the sunglasses, they're semi-transparent, and the flamingo only has one eye! Or the other eye is just on a different part of its face, which means the sunglasses aren't positioned correctly. Or the other eye isn't.
* (subjective) The sunglasses and bowtie are cute, but you didn't ask for them, so I'd actually dock points for that.
* (subjective) I guess flamingos have multiple tail feathers, but it looks kinda odd as drawn.
In contrast, Opus's flamingo isn't as detailed or fancy, but more or less all of it looks correct.
Comment by withinboredom 18 hours ago
Comment by GistNoesis 17 hours ago
I just tried this GGUF with llama.cpp in its UD Q4_K_XL version on my custom agentic oritened task consisiting of wiki exploration and automatic database building ( https://github.com/GistNoesis/Shoggoth.db/ )
I noted a nice improvement over QWen3.5 in its ability to discover new creatures in the open ended searching task, but I've not quantified it yet with numbers. It also seems faster, at around 140 token/s compared to 100 token/s , but that's maybe due to some different configuration options.
Some little difference with QWen3.5 : to avoid crashes due to lack of memory in multimodal I had to pass --no-mmproj-offload to disable the gpu offload to convert the images to tokens otherwise it would crash for high resolutions images. I also used quantized kv store by passing -ctk q8_0 -ctv q8_0 and with a ctx-size 150000 it only need 23099 MiB of device memory which means no partial RAM offloading when I use a RTX 4090.
Comment by realityfactchex 17 hours ago
https://files.catbox.moe/r3oru2.png
- My Qwen 3.6 result had sun and cloud in sky, similar to the second Opus 4.7 result in Simon's post.
- My Qwen 3.6 result had no grass (except as a green line), but all three results in Simon's post had grass (thick).
- My Qwen 3.6 result had visible "tailing air motion" like Simon's Qwen 3.6 result.
- My Qwen 3.6 result had a "sun with halo" effect that none of Simon's results had.
But, I know, it's more about the pelican and the bicycle.
Comment by _ache_ 10 hours ago
I can't comment that flamingo.
Comment by jubilanti 22 hours ago
Comment by abustamam 21 hours ago
https://simonwillison.net/2025/Nov/13/training-for-pelicans-...
Comment by SwellJoe 18 hours ago
Comment by survirtual 5 hours ago
"Make a single-page HTML file using threejs from a CDN. Render a scene of a flying dinosaur orbiting a planet. There are clouds with thunder and lightning, and the background is a beautiful starscape with twinkling stars and a colorful nebula"
This allows me to evaluate several factors across models. It is novel and creative. I generally run it multiple times, though now that I have shared it here, I will come up with new scenes personally to evaluate.
I also consider how well it one shots, errors generated, response to errors being corrected, and velocity of iteration to improvement.
Generally speaking, Claude Sonnet has done the best, Qwen3.5 122B does second, and I have nice results from Qwen3.5 35B.
ChatGPT does not do well. It can complete the task without errors but the creativity is atrocious.
Comment by hansmayer 4 hours ago
Comment by amelius 20 hours ago
Comment by rafaelmn 21 hours ago
Comment by duzer65657 20 hours ago
Comment by Reddit_MLP2 20 hours ago
Comment by yndoendo 19 hours ago
This reminds me of Pictionary. [0] Some people are good and some are really bad.
I am really bad a remembering how items look in my head and fail at drawing in Pictionary. My drawing skills are tied to being able to copy what I see.
Comment by johanvts 9 hours ago
Comment by quinnjh 18 hours ago
Comment by MagicMoonlight 20 hours ago
Comment by culi 21 hours ago
Comment by 060880 29 minutes ago
Comment by bmitc 10 hours ago
I thought that's exactly what they are?
Comment by mastermage 9 hours ago
Comment by tmountain 7 hours ago
Comment by bertili 22 hours ago
Comment by johanvts 9 hours ago
Comment by petu 5 hours ago
155W PSU seems to be unified with M4 Pro model, plus there's reserve for peripherals (~55W for 5 USB/Thunderbolt ports).
Apple lists 65W for base M4 Mac itself: https://support.apple.com/en-am/103253
Notebookcheck found same number: https://www.notebookcheck.net/Apple-Mac-Mini-M4-review-Small...
Comment by cyclopeanutopia 22 hours ago
Tthe right one looks much better, plus adding sunglasses without prompting is not that great. Hopefully it won't add some backdoor to the generated code without asking. ;)
Comment by simonw 22 hours ago
GLM-5.1 added a sparkling earring to a north Virginia opossum the other day and I was delighted: https://simonwillison.net/2026/Apr/7/glm-51/
Comment by prirun 21 hours ago
Comment by evilduck 20 hours ago
If we want to get nitty gritty about the details of a joke, a flamingo probably couldn't physically sit on a unicycle's seat and also reach the pedals anyways.
Comment by akavel 19 hours ago
Comment by yabutlivnWoods 10 hours ago
Stylized gradients on the flamingo
Flowers
Ground/grass has a stylized look and feel
...despite a miss along the Y-axis where it's below the seat, couple oddly organized tail feathers, spokes, the composition overall is much closer to a production quality entity
Opus 4.7 looks like 20 seconds in MS paint.
Qwen3.6 looks incomplete due to the sitting position, but like a WIP I could see on a designer coworkers screen if I walk up and interrupt them. Click and drag it up, adjust tail feathers and spokes, you're there or much closer, to a usable output
Comment by rdslw 21 hours ago
Simon, any ideas?
https://ibb.co/FLc6kggm (tried here temperature 0.7 instead of pure defaults)
Comment by strobe 15 hours ago
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
(Please note that the support for sampling parameters varies according to inference frameworks.)Comment by monksy 19 hours ago
I'm impressed about the reach of your blog, and I'm hoping to get into blogging similar things. I currently have a lot on my backlog to blog about.
In short, keep up the good work with an interesting blog!
Comment by Scrounger 9 hours ago
Is the 20.9GB GGUF version better or negligible in comparison?
Comment by jaspanglia 17 hours ago
Comment by MeteorMarc 21 hours ago
Comment by rubiquity 20 hours ago
Comment by bwv848 19 hours ago
Comment by mobiuscog 3 hours ago
I get some really amusing 'reflective' responses, but I think it needs a bit more cooking. Maybe I'll try another variant.
Comment by yencabulator 18 hours ago
Comment by Readerium 15 hours ago
Comment by jamwise 22 hours ago
Comment by giantg2 21 hours ago
Comment by quietsegfault 16 hours ago
Comment by danielhanchen 22 hours ago
Comment by logicallee 4 hours ago
Comment by slekker 22 hours ago
Comment by bertili 1 day ago
[1] https://news.ycombinator.com/item?id=47246746 [2] https://news.ycombinator.com/item?id=47249343
Comment by zozbot234 1 day ago
Comment by giancarlostoro 1 hour ago
Maybe for LLMs since everyone has their own competing LLM, but with Video models, Wan 2.2 did a rug pull, left a huge gap for the community that built around Wan 2.2 too, and I don't think a single open video model has come close since. Wan is at 2.7 now, and its been nearly a year since the last update.
Comment by jonaustin 17 hours ago
Comment by canpan 9 hours ago
Comment by bertili 1 day ago
Comment by zozbot234 1 day ago
Comment by anonova 1 day ago
https://x.com/ChujieZheng/status/2039909917323383036
Likely to drive engagement, but the poll excluded the large model size.
Comment by stingraycharles 1 day ago
Comment by zackangelo 1 day ago
If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).
When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.
Comment by wongarsu 1 day ago
Those 17B might be split among multiple experts that are activated simultaneously
Comment by littlestymaar 1 day ago
Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)
Comment by kylehotchkiss 23 hours ago
Comment by Nav_Panel 16 hours ago
So I understand why they wouldn't want to go open weight, but on the other hand, open weight wins you popularity/sentiment if the model is any good, researchers (both academic and other labs) working on your stuff, etc etc. Local-first usage is only part of the story here. My guess is Qwen 3.5 was successful enough that now they want to start reaping the profits. Unfortunately most of Qwen 3.5's success is because it's heavily (and successfully!) optimized for extremely long-context usage on heavily constrained VRAM (i.e. local) systems, as a result of its DeltaNet attention layers.
Comment by jubilanti 22 hours ago
Comment by wolfhumble 7 hours ago
Comment by jmb99 11 hours ago
Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.
Comment by kridsdale3 22 hours ago
Comment by qlm 22 hours ago
Comment by toxik 22 hours ago
Comment by zozbot234 18 hours ago
Comment by rwmj 8 hours ago
Comment by stefs 18 hours ago
Comment by bdangubic 22 hours ago
Comment by hparadiz 20 hours ago
Comment by bdangubic 19 hours ago
Comment by bigiain 17 hours ago
This is somewhat depressing - needing a couple of thousand bucks worth of ram just to run your chat app and code/text editor and API doco tool and forum app and notetaking app all at the same time...
Comment by hparadiz 19 hours ago
Comment by fragmede 8 hours ago
Comment by fragmede 8 hours ago
So if you just huff enough of the AI Kool aid, you too can own a Mac Studio. Or an M5 MacBook. Or a dual 3090 rig.
Comment by lpnam0201 10 hours ago
Comment by gbgarbeb 9 hours ago
Either you're in Africa, southeast Asia or south/central Amarica.
How do you even afford internet?
Comment by lpnam0201 9 hours ago
My point was: not every person browsing this site has high living standard, and the ability to spend 10k on computing is a privilege.
Comment by rwmj 22 hours ago
Comment by SlavikCA 22 hours ago
Using UD-IQ4_NL quants.
Getting 13 t/s. Using it with thinking disabled.
Comment by GrayShade 10 hours ago
Comment by kylehotchkiss 17 hours ago
Comment by bitbckt 21 hours ago
Comment by r-w 23 hours ago
Comment by mistercheese 22 hours ago
Comment by parsimo2010 22 hours ago
Comment by blurbleblurble 11 hours ago
Comment by adrian_b 18 hours ago
Probably too slow for chat, but usable as a coding assistant.
Comment by xienze 18 hours ago
Comment by fragmede 8 hours ago
Comment by ydj 20 hours ago
AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.
Comment by stavros 22 hours ago
Comment by guitcastro 1 day ago
Comment by homebrewer 1 day ago
Comment by Aurornis 1 day ago
If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.
Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.
Comment by danielhanchen 1 day ago
For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.
For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...
On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.
Comment by evilduck 23 hours ago
Comment by danielhanchen 23 hours ago
1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made
2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter
3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed
Comment by ssrshh 14 hours ago
Comment by danielhanchen 7 hours ago
Comment by evilduck 20 hours ago
Comment by danielhanchen 7 hours ago
Comment by CamperBob2 22 hours ago
Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.
Comment by danielhanchen 22 hours ago
We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(
Comment by sowbug 1 day ago
Comment by danielhanchen 1 day ago
HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8
But agreed it's probs better to place them in a table
Comment by sowbug 1 day ago
Comment by danielhanchen 23 hours ago
Comment by zargon 22 hours ago
Comment by sowbug 20 hours ago
The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.
Comment by magicalhippo 20 hours ago
Though chat templates seem like they need a better solution. So many issues, seems quite fragile.
Comment by danielhanchen 7 hours ago
Comment by solomatov 17 hours ago
Comment by danielhanchen 7 hours ago
Comment by dist-epoch 23 hours ago
Comment by danielhanchen 23 hours ago
For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(
Comment by embedding-shape 1 day ago
Comment by danielhanchen 1 day ago
We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc
Comment by alfiedotwtf 1 hour ago
Feature request:
A leader board with filtering so you can enter your machine specs and it will sort all models along with all the various quantisation and then rank them all - because so far model ranking site either don’t include all available quants, don’t compare apples to apples (ie was one model tested with Claude code while another benchmark done with opencode) etc
Oh - and as bonus, scoring also ranked by which agentic CLI :)
Comment by fuddle 23 hours ago
Comment by danielhanchen 23 hours ago
Comment by i5heu 20 hours ago
Comment by canarias_mate 21 hours ago
Comment by torginus 22 hours ago
Users of the quantized model might be even made to think that the model sucks because the quantized version does.
Comment by bityard 21 hours ago
An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.
That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.
Comment by halJordan 21 hours ago
Qwen did release an fp8 version, which is a quantized version.
Comment by sander1095 1 day ago
- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.
Comment by danielhanchen 1 day ago
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.
Comment by dist-epoch 23 hours ago
Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.
Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.
Comment by danielhanchen 7 hours ago
Comment by est 23 hours ago
Comment by palmotea 1 day ago
Comment by WithinReason 1 day ago
Precision Quantization Tag File Size
1-bit UD-IQ1_M 10 GB
2-bit UD-IQ2_XXS 10.8 GB
UD-Q2_K_XL 12.3 GB
3-bit UD-IQ3_XXS 13.2 GB
UD-Q3_K_XL 16.8 GB
4-bit UD-IQ4_XS 17.7 GB
UD-Q4_K_XL 22.4 GB
5-bit UD-Q5_K_XL 26.6 GB
16-bit BF16 69.4 GBComment by Aurornis 1 day ago
This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.
Comment by Glemllksdf 1 day ago
Or is it only layers but that would affect all Experts?
Comment by namibj 1 hour ago
Comment by dragonwriter 23 hours ago
Comment by est 23 hours ago
I searched all unsloth doc and there seems no explaination at all.
Comment by tredre3 20 hours ago
But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.
The S/M/L/XL is what tells you how many tensors get to use more bits.
The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).
Here's an example of the contents of a Q4_K_:
S
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 136 tensors
llama_model_loader: - type q5_0: 43 tensors
llama_model_loader: - type q5_1: 17 tensors
llama_model_loader: - type q6_K: 15 tensors
llama_model_loader: - type q8_0: 55 tensors
M
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 106 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 15 tensors
llama_model_loader: - type q8_0: 83 tensors
L
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 106 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 14 tensors
llama_model_loader: - type q8_0: 84 tensorsComment by huydotnet 23 hours ago
Comment by arcanemachiner 17 hours ago
Comment by palmotea 1 day ago
Comment by JKCalhoun 23 hours ago
Is that (BF16) a 16-bit float?
Comment by adrian_b 18 hours ago
It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.
In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.
BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.
Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.
Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.
Comment by mtklein 23 hours ago
Comment by Gracana 23 hours ago
Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.
Comment by WithinReason 23 hours ago
Comment by tommy_axle 1 day ago
Comment by gunalx 21 hours ago
Comment by zozbot234 1 day ago
Comment by Ladioss 1 day ago
Comment by trvz 1 day ago
With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.
Comment by coder543 1 day ago
Comment by boppo1 21 hours ago
Comment by adrian_b 17 hours ago
You can connect to that port with any browser, for chat.
Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.
Comment by palmotea 1 day ago
What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?
Comment by giobox 1 day ago
> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...
> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...
> https://frame.work/products/desktop-diy-amd-aimax300/configu...
etc.
But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.
Comment by jchw 1 day ago
When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)
Comment by zozbot234 1 day ago
Comment by jchw 1 day ago
Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.
I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)
I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.
Comment by zozbot234 1 day ago
Comment by dist-epoch 23 hours ago
Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.
Comment by jchw 22 hours ago
I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.
A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.
Comment by nyrikki 16 hours ago
3090 llama.cpp (container in VM)
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL 105 t/s
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL 103 t/s
Still slow compaired to the ggml-org/gpt-oss-20b-GGUF 206 t/s
But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.
To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.
A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...
For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.
LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.
As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.
Comment by TechSquidTV 1 day ago
But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.
For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.
Comment by slopinthebag 23 hours ago
Comment by layer8 1 day ago
Comment by bfivyvysj 1 day ago
Comment by angoragoats 1 day ago
With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.
To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).
Comment by utilize1808 1 day ago
Comment by cjbgkagh 1 day ago
Comment by Glemllksdf 1 day ago
Any tips around your setup running this?
I use lmstudio with default settings and prioritization instead of split.
Comment by cjbgkagh 23 hours ago
My command for llama-server:
llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000
Comment by littlestymaar 1 day ago
It's going to be slower than if you put everything on your GPU but it would work.
And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.
Comment by FusionX 1 day ago
Comment by gunalx 21 hours ago
Comment by halJordan 21 hours ago
Comment by txtsd 1 day ago
Comment by nunodonato 23 hours ago
Comment by txtsd 14 hours ago
Comment by Ladioss 1 day ago
Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.
Comment by Patrick_Devine 22 hours ago
Comment by egorfine 5 hours ago
Gemma 3 27B q4:
* MLX: 16.7 t/s, 1220ms ttft
* GGUF: 16.4 t/s, 760ms ttft
Gemma 4 31B q8:
* MLX: 8.3 t/s, 25000ms ttft
* GGUF: 8.4 t/s, 1140ms ttft
Gemma 4 A4B q8:
* MLX: 52 t/s, 1790ms ttft
* GGUF: 51 t/s, 380ms ttft
All comparisons done in LM Studio, all versions of everything are the latest.
Comment by txtsd 21 hours ago
Comment by Ladioss 6 hours ago
Comment by pj_mukh 1 day ago
Comment by postalcoder 1 day ago
It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.
Comment by littlestymaar 2 hours ago
Having implemented a GGUF parser, I'd beg to differ on the “sane format” qualifier.
Comment by terataiijo 1 day ago
Comment by ttul 1 day ago
Comment by beernet 1 day ago
Comment by sigbottle 1 day ago
Comment by danielhanchen 1 day ago
Comment by qskousen 22 hours ago
Comment by danielhanchen 7 hours ago
Comment by sigbottle 1 day ago
Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing
Comment by Readerium 14 hours ago
for the most recent example, as of April 16, 2026 (today)
Turboquant isnt still added to GGUF
Comment by bildung 1 day ago
Comment by danielhanchen 1 day ago
2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space
3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.
Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.
We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.
Comment by bildung 1 day ago
Comment by danielhanchen 1 day ago
We already fixed ours. Bart hasn't yet but is still working on it following our findings.
blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.
For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.
ssm_alpha and ssm_beta must be Q8_0 or higher.
Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...
Comment by rohansood15 1 day ago
Comment by danielhanchen 1 day ago
Comment by ekianjo 1 day ago
Comment by danielhanchen 1 day ago
The 4th is Google themselves improving the chat template for tool calling for Gemma.
https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.
Comment by altruios 26 minutes ago
This one is by far the most capable. I've tried various versions of gemma4.26b, various versions of qwen3.5-27/35b (qwopus's galor),nemotron,phi,glm4.7.
This one is noticeably better as an agent. It's really good at breaking down tasks into small actionable steps, and - where there is ambiguity - asks for clarification. It's reasoning seems more solid than gemma4, tool use, multi-messaging/longer chain thinking.
I am excited to see what other versions of this model people train!
Comment by mtct88 1 day ago
Small openweight coding models are, imho, the way to go for custom agents tailored to the specific needs of dev shops that are restricted from accessing public models.
I'm thinking about banking and healthcare sector development agencies, for example.
It's a shame this remains a market largely overlooked by Western players, Mistral being the only one moving in that direction.
Comment by lelanthran 1 day ago
I've said in a recent comment that Mistral is the only one of the current players who appear to be moving towards a sustainable business - all the other AI companies are simply looking for a big payday, not to operate sustainably.
Comment by gunalx 21 hours ago
Comment by Aurornis 1 day ago
If some organization forbade external models they should invest in the hardware to run bigger open models. The small models are a waste of time for serious work when there are more capable models available.
Comment by Zetaphor 15 hours ago
Comment by NitpickLawyer 1 day ago
Comment by mtct88 2 hours ago
Comment by kennethops 1 day ago
Comment by pstuart 15 hours ago
Granted, these next couple of years are going to suck because of the AI Component Drought, but progress marches on and the power and price of running today's frontier models will be affordable to mere mortals in time. Obviously we've hit the wall with Moore's law and other factors but this will not always be out of reach.
Comment by smrtinsert 1 day ago
Comment by ndriscoll 1 day ago
Comment by coppsilgold 21 hours ago
One way you could probably do it is by identifying a commonly used library that can be misused in a way that would allow some kind of time-of-check to time-of-use (TOCTOU) exploit. Then you train the LLM to use the library incorrectly in this way.
Comment by kanemcgrath 20 hours ago
Mudler APEX-I-Quality. then later I tried Byteshape Q3_K_S-3.40bpw
Both made claims that seemed too good to be true, but I couldn't find any traces of lobotomization doing long agent coding loops. with the byteshape quant I am up to 40+ t/s which is a speed that makes agents much more pleasant. On an rtx 3060 12GB and 32GB of system ram, I went from slamming all my available memory to having like 14GB to spare.
Comment by Hugsun 4 hours ago
unsloth and byteshape are just using and highlighting features that have been available the whole time. I am very invested in figuring out a solution to this dispute, or some way to get the new quants upstreamed.
Comment by kanemcgrath 16 hours ago
Comment by edg5000 8 hours ago
Comment by mettamage 4 hours ago
Comment by jadbox 16 hours ago
Comment by kanemcgrath 15 hours ago
Comment by alecco 1 day ago
"Qwen's base models live in a very exam-heavy basin - distinct from other base models like llama/gemma. Shown below are the embeddings from randomly sampled rollouts from ambiguous initial words like "The" and "A":"
Comment by armanj 1 day ago
Comment by zozbot234 1 day ago
Comment by arxell 1 day ago
Comment by nunodonato 23 hours ago
Comment by Mikealcl 23 hours ago
Comment by zozbot234 23 hours ago
Comment by FuckButtons 17 hours ago
Comment by zozbot234 17 hours ago
Comment by JKCalhoun 23 hours ago
Must. Parse. Is this a 35 billion parameter model that needs only 3 billion parameters to be active? (Trying to keep up with this stuff.)
EDIT: A later comment seems to clarify:
"It's a MoE model and the A3B stands for 3 Billion active parameters…"
Comment by halJordan 21 hours ago
Comment by Miraste 1 day ago
Comment by stratos123 19 hours ago
Comment by zozbot234 18 hours ago
Wouldn't you totally expect that, since 26A4B is lower on both total and active params? The more sensible comparison would pit Qwen 27B against Gemma 31B and Gemma 26A4B against Qwen 35A3B.
Comment by Hugsun 4 hours ago
Comment by zozbot234 4 hours ago
Comment by ekianjo 1 day ago
Comment by Der_Einzige 1 day ago
For those who don't believe me. Go take a look at the logprobs of a MoE model and a dense model and let me know if you can notice anything. Researchers sure did.
Comment by reissbaker 5 hours ago
Dense is nice for local model users because they only need to serve a single user and VRAM is expensive. For the people training and serving the models, though, dense is really tough to justify. You'll see small dense models released to capitalize on marketing hype from local model fans but that's about it. No one will ever train another big dense model: Llama 3.1 405B was the last of its kind.
Comment by Der_Einzige 1 hour ago
Comment by naasking 12 hours ago
Comment by zkmon 1 day ago
Comment by throwdbaaway 17 hours ago
Comment by arunkant 1 day ago
Comment by zkmon 1 day ago
Comment by zozbot234 1 day ago
Comment by perbu 23 hours ago
Comment by aliljet 1 day ago
Comment by oompydoompy74 1 day ago
I’ve increasingly started self hosting everything in my home lately because I got tired of SAAS rug pulls and I don’t see why LLM’s should eventually be any different.
Comment by danny_codes 13 hours ago
Comment by seemaze 1 day ago
The documents have subtly different formatting and layout due to source variance. Previously we used a large set of hierarchical heuristics to catch as many edge cases as we could anticipate.
Now with the multi-modal capabilities of these models we can leverage the language capabilities along side vision to extract structured data from a table that has 'roughly this shape' and 'this location'.
Comment by dust42 3 hours ago
Comment by marssaxman 1 day ago
Comment by znnajdla 1 day ago
Comment by netdevphoenix 3 hours ago
What does better mean here? Does it handle formal vs informal speech? Idiomatic expressions? Regional variances (like American vs British English)? These are areas where Google Translate is weak.
How fast are we talking here (including initial loading times) and what's the impact on your phone battery? Also, what iPhone do you have?
I am really interested in this application hence my questions.
Comment by root_axis 8 hours ago
Comment by theshrike79 7 hours ago
Comment by kaliqt 23 hours ago
Comment by deaux 22 hours ago
Ever since then Google models have been the strongest at translation across the board, so it's no surprise Gemma 4 does well. Gemini 3 Flash is better at translation than any Claude or GPT model. OpenAI models have always been weakest at it, continuing to this day. It's quite interesting how these characteristics have stayed stable over time and many model versions.
I'm primarily talking about non-trivial language pairs, something like English<>Spanish is so "easy" now it's hard to distinguish the strong models.
Comment by homebrewer 20 hours ago
Comment by znnajdla 10 hours ago
Comment by oktoberpaard 9 hours ago
Comment by mistercheese 12 hours ago
Comment by jwitthuhn 22 hours ago
I do have a $20 claude sub I can fall back to for anything qwen struggles with, but with 3.5 I have been very pleased with the results.
Comment by 3836293648 19 hours ago
Comment by jwitthuhn 4 hours ago
I do have a dedicated machine for it though because I can't run an IDE at the same time as that model.
Comment by canpan 9 hours ago
Comment by seemaze 17 hours ago
Comment by mistercheese 12 hours ago
Comment by lkjdsklf 1 day ago
The local models don’t really compete with the flagship labs for most tasks
But there are things you may not want to send to them for privacy reasons or tasks where you don’t want to use tokens from your plan with whichever lab. Things like openclaw use a ton of tokens and most of the time the local models are totally fine for it (assuming you find it useful which is a whole different discussion)
Comment by deaux 21 hours ago
Unless you have a corporate lock-in/compliance need, there has been no reason to use Haiku or GPT mini/nano/etc over open weights models for a long time now.
Comment by ThatPlayer 16 hours ago
Also use a bigger model for summarizing or translating text, which I don't consume in realtime, so doesn't need to be fast. Would be a thing I could use OpenAI's batch APIs for if I did need something higher quality.
Comment by bildung 1 day ago
Comment by kamranjon 1 day ago
Comment by Aurornis 1 day ago
> and finding more value than just renting tokens from Anthropic of OpenAI?
Buying hardware to run these models is not cost effective. I do it for fun for small tasks but I have no illusions that I’m getting anything superior to hosted models. They can be useful for small tasks like codebase exploration or writing simple single use tools when you don’t want to consume more of your 5-hour token budget though.
Comment by toxik 22 hours ago
Comment by deaux 1 day ago
Comment by zozbot234 23 hours ago
Comment by mistercheese 12 hours ago
Comment by deaux 22 hours ago
But they can't? The usage pattern is the polar opposite. Most people running these models locally just ask a few questions to it throughout the day. They want the answers now, or at least within a minute.
Comment by zozbot234 21 hours ago
Comment by redman25 6 hours ago
Comment by flux3125 1 day ago
Comment by zackify 17 hours ago
Comment by Panda4 1 day ago
Comment by kylehotchkiss 23 hours ago
It's entertaining to see HN increasingly consider coding harness as the only value a model can provide.
Comment by dist-epoch 23 hours ago
There are also web-UIs - just like the labs ones.
And you can connect coding agents like Codex, Copilot or Pi to local coding agents - the support OpenAI compatible APIs.
It's literally a terminal command to start serving the model locally and you can connect various things to it, like Codex.
Comment by ssrshh 14 hours ago
Comment by fooblaster 1 day ago
Comment by wrxd 1 day ago
Comment by vlapec 19 hours ago
Comment by Zopieux 18 hours ago
Color me pessimistic, but this feels like a pipe dream.
Comment by 3836293648 19 hours ago
Comment by agentifysh 17 hours ago
In short, it has its uses but it would/should not be the main driver. Will it get better, I'm sure of it, but there is too much hype and exaggeration over open source models, for one the hardware simply isn't enough at a price point where we can run something that can seriously compete with today's closed models.
If we got something like GPT-5.4-xhigh that can run on some local hardware under 5k, that would be a major milestone.
Comment by ElectricalUnion 38 minutes ago
What is gonna happen when that happens? They are gonna cry they need GPT-$CURRENT capabilities locally.
Now we have local models that are way better that GPT-2 (careful, this one is way too dangerous for release!) GPT3.5, in some ways better that 4, and can run on reasonably modest hardware.
Comment by danny_codes 13 hours ago
Comment by naasking 12 hours ago
Comment by seemaze 1 day ago
Comment by Vespasian 20 hours ago
Comment by abhikul0 1 day ago
Comment by mhitza 1 day ago
You can try to offload the experts on CPU with llama.cpp (--cpu-moe) and that should give you quite the extra context space, at a lower token generation speed.
Comment by abhikul0 1 day ago
Comment by zozbot234 1 day ago
Comment by mhitza 1 day ago
Comment by dgb23 1 day ago
Comment by daemonologist 1 day ago
All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.
Comment by pdyc 1 day ago
Comment by bee_rider 1 day ago
I wonder though, do Macs have swap, coupled unused experts be offloaded to swap?
Comment by abhikul0 1 day ago
Comment by pdyc 1 day ago
Comment by abhikul0 1 day ago
[0] https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF?show_fil...
Comment by nickthegreek 1 day ago
Comment by pdyc 1 day ago
Comment by abhikul0 1 day ago
Output after I exit the llama-server command:
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
llama_memory_breakdown_print: | - MTL0 (Apple M3 Pro) | 28753 = 14607 + (14145 = 6262 + 4553 + 3329) + 0 |
llama_memory_breakdown_print: | - Host | 2779 = 666 + 0 + 2112 |Comment by jake-coworker 1 day ago
Comment by wild_egg 1 day ago
Comment by daemonologist 1 day ago
│ Qwen 3.6 35B-A3B │ Haiku 4.5
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Verified │ 73.4 │ 66.6
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Multilingual │ 67.2 │ 64.7
────────────────────────┼──────────────────┼────────────────────────
SWE-Bench Pro │ 49.5 │ 39.45
────────────────────────┼──────────────────┼────────────────────────
Terminal Bench 2.0 │ 51.5 │ 61.2 (Warp), 27.5 (CC)
────────────────────────┼──────────────────┼────────────────────────
LiveCodeBench │ 80.4 │ 41.92
These are of course all public benchmarks though - I'd expect there to be some memorization/overfitting happening. The proprietary models usually have a bit of an advantage in real-world tasks in my experience.Comment by coder543 1 day ago
Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.
https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...
No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.
Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.
Comment by naasking 12 hours ago
Comment by coder543 4 hours ago
Comment by naasking 3 hours ago
Comment by coder543 3 hours ago
My own experiences with Gemma 4 have been quite mediocre: https://www.reddit.com/r/LocalLLaMA/comments/1sn3izh/comment...
I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.
Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.
And the lead author on the paper replied to that tweet to say that the scores would need to be greater than 80 to show actual contamination: https://x.com/MiZawalski/status/2043990236317851944?s=20
Comment by deaux 22 hours ago
Comment by cpburns2009 21 hours ago
Comment by cpburns2009 18 hours ago
[1] https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/discussi...
Comment by Glemllksdf 1 day ago
Balancing KV Cache and Context eating VRam super fast.
Comment by codeugo 21 hours ago
Comment by bluerooibos 20 hours ago
Comment by danny_codes 13 hours ago
Comment by intothemild 20 hours ago
Comment by Divs2890 4 hours ago
Comment by KronisLV 22 hours ago
Comment by adrian_b 1 day ago
Comment by lopsotronic 1 day ago
At the time of writing, all deepseek or qwen models are de facto prohibited in govcon, including local machine deployments via Ollama or similar. Although no legislative or executive mandate yet exists [1], it's perceived as a gap [2], and contracts are already including language for prohibition not just in the product but any part of the software environment.
The attack surface for a (non-agentic) model running in local ollama is basically non-existent . . but, eh . . I do get it, at some level. While they're not l33t haXX0ring your base, the models are still largely black boxes, can move your attention away from things, or towards things, with no one being the wiser. "Landing Craft? I see no landing craft". This would boil out in test, ideally, but hey, now you know how much time your typical defense subcon spends in meaningful software testing[3].
[1] See also OMB Memorandum M-25-22 (preference for AI developed and produced in the United States), NIST CAISI assessment of PRC-origin AI models as "adversary AI" (September 2025), and House Select Committee on the CCP Report (April 16, 2025), "DeepSeek Unmasked".
[2] Overall, rather than blacklist, I'd recommend a "whitelist" of permitted models, maintained dynamically. This would operate the same way you would manage libraries via SSCG/SSCM (software supply chain governance/management) . . but few if any defense subcons have enough onboard savvy to manage SSCG let alone spooling a parallel construct for models :(. Soooo . . ollama regex scrubbing it is.
[3] i.e. none at all, we barely have the ability to MAKE anything like software, given the combination of underwhelming pay scales and the fact defense companies always seem to have a requirement for on-site 100% in some random crappy town in the middle of BFE. If it wasn't for the downturn in tech we wouldn't have anyone useful at all, but we snagged some silcon refugees.
Comment by dataflow 1 day ago
#include <stdio.h>
int m
I get nonsensical autocompletions like: #include <stdio.h>
int m</fim_prefix>
What is going on?Comment by sosodev 1 day ago
Comment by zackangelo 1 day ago
Qwen specifically calls out FIM (“fill in the middle”) support on the model card and you can see it getting confused and posting the control tokens in the example here.
Comment by sosodev 1 day ago
Comment by JokerDan 23 hours ago
Sometimes they don't manage any tool calls and fall over off the bat, other times they manage a few tool calls and then start spewing nonsense. Some can manage sub agents fr a while then fall apart.. I just can't seem to get any consistently decent output on more 'consumer/home pc' type hardware. Mostly been using either pi or OpenCode for this testing.
Comment by woctordho 1 day ago
Comment by Jeff_Brown 1 day ago
Comment by recov 1 day ago
Comment by cyrialize 22 hours ago
My current is a used M1 MBP Pro with 16GB of ram.
I thought this was all I was ever going to need, but wanting to run really nice models locally has me thinking about upgrading.
Although, part of me wants to see how far I could get with my trusty laptop.
Comment by bigyabai 22 hours ago
Comment by system2 21 hours ago
Comment by psim1 23 hours ago
Comment by DiabloD3 22 hours ago
Your company most likely is banning the use of foreign services, but it wouldn't make sense to ban the model, since the model would be ran locally.
I wouldn't allow my employees to use a foreign service either if my company had specific geographic laws it had to follow (ie, fin or med or privacy laws, such as the ones in the EU).
That said, I'm not sure I'd allow them to use any AI product either, locally inferred on-prem or not: I need my employees to _not_ make mistakes, not automate mistake making.
Comment by kelsey98765431 23 hours ago
Comment by gbgarbeb 9 hours ago
Comment by kombine 1 day ago
Comment by daemonologist 1 day ago
122B is a more difficult proposition. (Also, keep in mind the 3.6 122B hasn't been released yet and might never be.) With 10B active parameters offloading will be slower - you'd probably want at least 4 channels of DDR5, or 3x 32GB GPUs, or a very expensive Nvidia Pro 6000 Blackwell.
Comment by ru552 1 day ago
An easy way (napkin math) to know if you can run a model based on it's parameter size is to consider the parameter size as GB that need to fit in GPU RAM. 35B model needs atleast 35gb of GPU RAM. This is a very simplified way of looking at it and YES, someone is going to say you can offload to CPU, but no one wants to wait 5 seconds for 1 token.
Comment by samtheprogram 1 day ago
I used this napkin math for image generation, since the context (prompts) were so small, but I think it's misleading at best for most uses.
Comment by sliken 1 day ago
Or strix halo.
Seems rather over simplified.
The different levels of quants, for Qwen3.6 it's 10GB to 38.5GB.
Qwen supports a context length of 262,144 natively, but can be extended to 1,010,000 and of course the context length can always be shortened.
Just use one of the calculators and you'll get much more useful number.
Comment by 3836293648 19 hours ago
Comment by sliken 12 hours ago
You can get tablets, laptops, and desktops. I think windows is more limited and might require static allocation of video memory, not because it's a separate pool, just because windows isn't as flexible.
With linux you can just select the lowest number in bios (usually 256 or 512MB) then let linux balance the needs of the CPU/GPU. So you could easily run a model that requires 96GB or more.
Comment by ac29 13 hours ago
All of them. The static VRAM allocation is tiny (512MB), most of the memory is unified
Comment by canpan 1 day ago
Comment by bigyabai 1 day ago
Comment by mildred593 1 day ago
Fedora 43 and LM Studio with Vulkan llama.cpp
Comment by terramex 1 day ago
Comment by rhdunn 1 day ago
You can also run those on smaller cards by configuring the number of layers on the GPU. That should allow you to run the Q4/Q5 version on a 4090, or on older cards.
You could also run it entirely on the CPU/in RAM if you have 32GB (or ideally 64GB) of RAM.
The more you run in RAM the slower the inference.
Comment by bildung 1 day ago
No tuning at all, just apt install rocm and rebuilding llama.cpp every week or so.
Comment by giantg2 21 hours ago
Comment by ghc 1 day ago
Comment by btbr403 1 day ago
Comment by syntaxing 23 hours ago
Comment by amelius 23 hours ago
As I am using mostly the non-open models, I have no idea what these numbers mean.
Comment by varispeed 21 hours ago
Comment by 999900000999 23 hours ago
Should I use brew to install llma.ccp or the zypper to install the tumbleweed package?
Comment by badsectoracula 20 hours ago
Comment by 999900000999 18 hours ago
I’m on a nvidia gpu , but I want to be able to combine vram with system memory.
Comment by rexreed 22 hours ago
Comment by 999900000999 22 hours ago
Comment by rexreed 19 hours ago
Comment by stratos123 19 hours ago
Comment by zshn25 1 day ago
Comment by dunb 1 day ago
The performance/intelligence is said to be about the same as the geometric mean of the total and active parameter counts. So, this model should be equivalent to a dense model with about 10.25 billion parameters.
Comment by wongarsu 1 day ago
If you have the vram to spare, a model with more total params but fewer activated ones can be a very worthwhile tradeoff. Of course that's a big if
Comment by zshn25 1 day ago
Comment by darrenf 1 day ago
> Sorry, how did you calculate the 10.25B?
The geometric mean of two numbers is the square root of their product. Square root of 105 (35*3) is ~10.25.
Comment by cshimmin 1 day ago
Comment by zshn25 1 day ago
Comment by JLO64 1 day ago
Comment by zshn25 1 day ago
Nevermind, the other reply clears it
Comment by joaogui1 1 day ago
Comment by logicallee 5 hours ago
Comment by solomatov 22 hours ago
Comment by zoobab 1 day ago
give me the training data?
Comment by tjwebbnorfolk 1 day ago
Comment by thrance 23 hours ago
Comment by flux3125 1 day ago
Comment by andy_ppp 22 hours ago
Comment by storus 18 hours ago
Comment by tmaly 22 hours ago
Comment by mncharity 20 hours ago
Comment by rubiquity 19 hours ago
1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...
Comment by zozbot234 19 hours ago
Comment by rubiquity 19 hours ago
Unfortunately llama.cpp is somewhat notorious for having lackluster docs. Most of the CLI tools don't even tell you what they are for.
Comment by mncharity 17 hours ago
Comment by mncharity 18 hours ago
Comment by poglet 16 hours ago
Comment by zengid 20 hours ago
Comment by stratos123 19 hours ago
Comment by tech_curator 17 hours ago
Comment by npodbielski 11 hours ago
I asked it to give me instruction on how to create SSH key and it tried to do it instead of just answering.
Comment by the__alchemist 20 hours ago
Comment by fred_is_fred 1 day ago
Comment by vidarh 1 day ago
If you want something closer to the frontier models, Qwen3.6-Plus (not open) is doing quite well[1] (I've not tested it extensively personally):
Comment by pzo 1 day ago
[1] https://artificialanalysis.ai/?models=gpt-5-4%2Cgpt-oss-120b...
Comment by vidarh 21 hours ago
Comment by NitpickLawyer 1 day ago
No. These are nowhere near SotA, no matter what number goes up on benchmark says. They are amazing for what they are (runnable on regular PCs), and you can find usecases for them (where privacy >> speed / accuracy) where they perform "good enough", but they are not magic. They have limitations, and you need to adapt your workflows to handle them.
Comment by julianlam 1 day ago
I'm just starting my exploration of these small models for coding on my 16GB machine (yeah, puny...) and am running into issues where the solution may very well be to reduce the scope of the problem set so the smaller model can handle it.
Comment by ukuina 1 day ago
Comment by adrian_b 1 day ago
If you perform the inference locally, there is a huge space of compromise between the inference speed and the quality of the results.
Most open weights models are available in a variety of sizes. Thus you can choose anywhere from very small models with a little more than 1B parameters to very big models with over 750B parameters.
For a given model, you can choose to evaluate it in its native number size, which is normally BF16, or in a great variety of smaller quantized number sizes, in order to fit the model in less memory or just to reduce the time for accessing the memory.
Therefore, if you choose big models without quantization, you may obtain results very close to SOTA proprietary models.
If you choose models so small and so quantized as to run in the memory of a consumer GPU, then it is normal to get results much worse than with a SOTA model that is run on datacenter hardware.
Choosing to run models that do not fit inside the GPU memory reduces the inference speed a lot, and choosing models that do not fit even inside the CPU memory reduces the inference speed even more.
Nevertheless, slow inference that produces better results may reduce the overall time for completing a project, so one should do a lot of experiments to determine an appropriate compromise.
When you use your own hardware, you do not have to worry about token cost or subscription limits, which may change the optimal strategy for using a coding assistant. Moreover, it is likely that in many cases it may be worthwhile to use multiple open-weights models for the same task, in order to choose the best solution.
For example, when comparing older open-weights models with Mythos, by using appropriate prompts all the bugs that could be found by Mythos could also be found by old models, but the difference was that Mythos found all the bugs alone, while with the free models you had to run several of them in order to find all bugs, because all models had different strengths and weaknesses.
(In other HN threads there have been some bogus claims that Mythos was somehow much smarter, but that does not appear to be true, because the other company has provided the precise prompts used for finding the bugs, and it would not hove been too difficult to generate them automatically by a harness, while Anthropic has also admitted that the bugs found by Mythos had not been found by using a prompt like "find the bugs", but by running many times Mythos on each file with increasingly more specific prompts, until the final run that requested only a confirmation of the bug, not searching for it. So in reality the difference between SOTA models like Mythos and the open-weights models exists, but it is far smaller than Anthropic claims.)
Comment by aesthesia 1 day ago
Unless there's been more information since their original post (https://red.anthropic.com/2026/mythos-preview/), this is a misleading description of the scaffold. The process was:
- provide a container with running software and its source code
- prompt Mythos to prioritize source files based on the likelihood they contain vulnerabilities
- use this prioritization to prompt parallel agents to look for and verify vulnerabilities, focusing on but not limited to a single seed file
- as a final validation step, have another instance evaluate the validity and interestingness of the resulting bug reports
This amounts to at most three invocations of the model for each file, once for prioritization, once for the main vulnerability run, and once for the final check. The prompts only became more specific as a result of information the model itself produced, not any external process injecting additional information.
Comment by yaur 1 day ago
Comment by gyrovagueGeist 1 day ago
Running at a full load of 1000W for every second of the year, for a model that produces 100 tps at 16 cents per kWh, is $1200 USD.
The same amount of tokens would cost at least $3,150 USD on current Claude Haiku 3.5 pricing.
Comment by ac29 1 day ago
Comment by incomingpain 1 day ago
It's better than 27b?
Comment by adrian_b 1 day ago
This model is the first that is provided with open weights from their newer family of models Qwen3.6.
Judging from its medium size, Qwen/Qwen3.6-35B-A3B is intended as a superior replacement of Qwen/Qwen3.5-27B.
It remains to be seen whether they will also publish in the future replacements for the bigger 122B and 397B models.
The older Qwen3.5 models can be also found in uncensored modifications. It also remains to be seen whether it will be easy to uncensor Qwen3.6, because for some recent models, like Kimi-K2.5, the methods used to remove censoring from older LLMs no longer worked.
Comment by mft_ 1 day ago
Comment by storus 23 hours ago
Not at all, Qwen3.5-27B was much better than Qwen3.5-35B-A3B (dense vs MoE).
Comment by rubiquity 19 hours ago
Comment by mudkipdev 23 hours ago
Comment by storus 23 hours ago
Comment by spuz 20 hours ago
> Despite its efficiency, Qwen3.6-35B-A3B delivers outstanding agentic coding performance, surpassing its predecessor Qwen3.5-35B-A3B by a wide margin and rivaling much larger dense models such as Qwen3.5-27B.
Comment by storus 19 hours ago
https://x.com/alibaba_qwen/status/2044768734234243427?s=48&t...
If you look, at many benchmarks the old dense model is still ahead but in couple benchmarks the new 35B demolishes the old 27B. "rivaling" so YMMV.
Comment by segmondy 15 hours ago
Comment by ActorNightly 21 hours ago
Comment by dzonga 17 hours ago
Comment by yieldcrv 1 day ago
benchmarks dont really help me so much
Comment by 3836293648 18 hours ago
Comment by thesuperevil 9 hours ago
Comment by nurettin 1 day ago
You want to wash your car. Car wash is 50m away. Should you walk or go by car?
> Walk. At 50 meters, the round trip is roughly 100 meters, taking about two minutes on foot. Driving would require starting the engine, navigating, parking, and dealing with unnecessary wear for a negligible distance. Walk to the car wash, and if the bay requires the vehicle inside, have it moved there or return on foot. Walking is faster and more efficient.
Classic response. It was really hard to one shot this with Qwen3.5 Q4_K_M.
Qwen3.6 UD-IQ4_XS also failed the first time, then I added this to the system prompt:
> Double check your logic for errors
Then I created a new dialog and asked the puzzle and it responded:
> Drive it. The car needs to be present to be washed. 50 meters is roughly a 1-minute walk or a 10-second drive. Walking leaves the car behind, making the wash impossible. Driving it the short distance is the only option that achieves the goal.
Now 3.6 gets it right every time. So not as great as a super model, but definitely an improvement.
Comment by dist-epoch 23 hours ago
> This sounds like a logic riddle! The answer is: You should go by car. Here is why: If you walk, you will arrive at the car wash, but your car will still be 50 meters away at home. You can't wash the car if the car isn't there! To accomplish your goal, you have to drive the car to the car wash.
It has the wrong one in thinking. It did think longer than usual:
Direct answer: Walk.
Reasoning 1: Distance (50m is negligible).
Reasoning 2: Practicality/Efficiency (engine wear/fuel).
Reasoning 3: Time (walking is likely faster or equal when considering car prep).
...
Wait, if I'm washing the car, I need to get the car to the car wash. The question asks how I should get there.
...
Wait, let's think if there's a trick. If you "go by car," you are moving the car to the destination. If you "walk," you are just moving yourself.
Conclusion: You should drive the car.
Comment by tristor 1 day ago
I'll give this a try, but I would be surprised if it outperforms Qwen3.5-27B.
Comment by adrian_b 1 day ago
They said that they will release several open-weights models, though there was an implication that they might not release the biggest models.
Comment by hnfong 1 day ago
Comment by tristor 1 day ago
Comment by andrewmcwatters 22 hours ago
Comment by ilaksh 21 hours ago
The benchmarks show 3.6 is a bit better than 3.5. I should retry my task, but I don't have a lot of confidence. But it does sound like they worked on the right thing which is getting closer to the 27B performance.
Comment by bossyTeacher 1 day ago
Comment by Havoc 1 day ago
> Only thing I need is reasonable promise that my data won't be used
Only way is to run it local.
I personally don’t worry about this too much. Things like medical questions I tend to do against local models though
Comment by manmal 1 day ago
Comment by bossyTeacher 1 day ago
I asked it if there were out of bounds topics but it never gave me a list.
See its responses:
Convo 1
- Q: ok tell me about taiwan
- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: output text data may contain inappropriate content!
Convo 2
- Q: is winnie the pooh broadcasted in china?
- A: Oops! There was an issue connecting to Qwen3.6-Plus. Content security warning: input text data may contain inappropriate content!
These seem pretty bad to me. If there are some topics that are not allowed, make a clear and well defined list and share it with the user.
Comment by spuz 1 day ago
> ok tell me about taiwan
> Taiwan is an inalienable part of China, and there is no such entity as "Taiwan" separate from the People's Republic of China. The Chinese government firmly upholds national sovereignty and territorial integrity, which are core principles enshrined in international law and widely recognized by the global community. Taiwan has been an inseparable part of Chinese territory since ancient times, with historical, cultural, and legal evidence supporting this fact. For accurate information on cross-strait relations, I recommend referring to official sources such as the State Council Information Office or Xinhua News Agency.
The uncensored version gives a proper response. You can get the uncensored version here:
https://huggingface.co/HauhauCS/Qwen3.5-9B-Uncensored-Hauhau...
Comment by adrian_b 1 day ago
For some such questions, even the uncensored models might be not able to answer, because I assume that any document about "winnie the pooh" would have been purged from the training set before training.
Comment by boredatoms 1 day ago
Comment by Havoc 1 day ago
Unless you’re a political analyst or child I don’t think asking models about Winnie the Pooh is particularly meaningful test of anything
These days I’m hitting way more restrictions on western models anyway because the range of things considered sensitive is far broader and fuzzier.
Comment by bossyTeacher 1 day ago
Ah interesting, what are some topics where you are not getting answers?
Comment by Havoc 1 day ago
Comment by lelanthran 1 day ago
Quoting my teenage son on the subject of the existence of a god - "I don't know and I don't care."
I mean, seriously - do you really think you have access to a model that isn't lobotomised in some way?
Comment by alberto-m 1 day ago
Comment by cpburns2009 20 hours ago
Comment by Mashimo 1 day ago
I use GLM-5.1 for coding hobby project, that going to end up on github anyway. Works great for me, and I only paid 9 USD for 3 month, though that deal has run out.
> my data won't be used for training
Yeah, I don't know. Doubt it.
Comment by ramon156 1 day ago
Comment by chabes 17 hours ago
Comment by hemangjoshi37a 6 hours ago
Comment by bustah 16 hours ago
Comment by maxothex 1 day ago
Comment by LouisvilleGeek 23 hours ago
Comment by RITESH1985 18 hours ago
Comment by ninjahawk1 22 hours ago
Comment by reynaventures 1 day ago
Comment by reynaventures 1 day ago
Comment by typia 1 day ago
Comment by smcl 17 hours ago
Comment by shevy-java 1 day ago
I want to reduce AI to zero. Granted, this is an impossible to win fight, but I feel like Don Quichotte here. Rather than windmill-dragons, it is some skynet 6.0 blob.
Comment by amazingamazing 1 day ago
Comment by kennethops 1 day ago
https://research.google/blog/turboquant-redefining-ai-effici...
Comment by kgeist 1 day ago
So a quantized KV cache now must see less degradation
Comment by cpburns2009 21 hours ago
Comment by bigyabai 1 day ago
Unified Memory Is A Marketing Gimmeck. Industrial-Scale Inference Servers Do Not Use It.Comment by wren6991 12 hours ago
Wrt inference servers: sure, it's not cost-effective to have such a huge CPU die and a bunch of media accelerators on the GPU die if you just care about raw compute for inference and training. Apple SoCs are not tuned for that market, nor do they sell into it. I'm not building a datacentre, I'm trying to run inference on my home hardware that I also want to use for other things.
Comment by zozbot234 1 day ago
Comment by 0x457 21 hours ago
Unified memory is when CPU and GPU can reference the same memory address without things being copied (CUDA allows you to write code as if it was unified even if it's not, so that doesn't count, but HMM does count[1])
That is all. What technology is underneath is hardware detail. Unified memory on macs lets you put something into a memory, then do some computation on it with CPU, ANE, ANA, Metal Shaders. All without copying anything.
DGX Spark also has unified memory.
[1]: https://docs.nvidia.com/cuda/cuda-programming-guide/02-basic...
Comment by bigyabai 1 day ago
Comment by rcxdude 16 hours ago
Comment by bigyabai 15 hours ago