Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving
Posted by mfiguiere 19 hours ago
Comments
Comment by alex7o 17 hours ago
Comment by mikenew 10 hours ago
Some people seem to agree and some don't, but I think that indicates we're just down to your specific domain and usage patterns rather than the SOTA models being objectively better like they clearly used to be.
Comment by operatingthetan 10 hours ago
Comment by fwipsy 9 hours ago
Comment by easygenes 5 hours ago
The best tests are your own custom personal-task-relevant standardized tests (which the best models can't saturate, so aiming for less than 70% pass rate in the best case).
All this is to say that most people are not doing the latter and their vibes are heavily confounded to the point of being mostly meaningless.
Comment by operatingthetan 9 hours ago
You are right that this is not exactly subjectivity, but I think for most people it feels like it. We don't have good benchmarks (imo), we read a lot about other people's experiences, and we have our own. I think certain models are going to be objectively better at certain tasks, it's just our ability to know which currently is impaired.
Comment by hamdingers 8 hours ago
People judge models on their outputs, but how you like to prompt has a tremendous impact on those outputs and explains why people have wildly different experiences with the same model.
Comment by ulfw 3 hours ago
One model can replace another at any given moment in time.
It's NOT a winner-takes-all industry
and hence none of the lofty valuations make sense.
the AI bubble burst will be epic and make us all poorer. Yay
Comment by mettamage 1 hour ago
Will try it out. Thanks for sharing!
Comment by LoganDark 9 hours ago
Comment by deaux 7 hours ago
If this was the case then Anthropic would be in a very bad spot.
It's not, which is why people got so mad about being forced to use it rather than better third party harnesses.
Pi is better than CC as a harness in almost every respect.
Comment by enochthered 6 hours ago
Comment by bizzletk 5 hours ago
Comment by deaux 2 hours ago
- It still lacks support for industry standards such as AGENTS.md
- Extremely limited customization
- Lots of bugs including often making it impossible to view pre-compaction messages inside Claude Code.
- Obvious one: can't easily switch between Claude and non-Claude models
- Resource usage
More than anything, I haven't found a single thing that Pi does worse. All of it is just straight up better or the same.
Comment by Mashimo 3 hours ago
Comment by bink-lynch 8 hours ago
Comment by zackify 6 hours ago
Really liking pi and glm 5.1!
Comment by jadbox 8 hours ago
Comment by bink-lynch 2 hours ago
Comment by zackify 6 hours ago
Comment by jxmesth 15 hours ago
Comment by embedding-shape 15 hours ago
What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.
Comment by jxmesth 15 hours ago
Comment by embedding-shape 14 hours ago
Comment by jxmesth 4 hours ago
Comment by noduerme 13 hours ago
Comment by dymk 13 hours ago
Comment by ycui1986 9 hours ago
Comment by dymk 8 hours ago
Comment by koen_hendriks 11 hours ago
Comment by enneff 10 hours ago
Comment by andai 10 hours ago
Comment by dymk 8 hours ago
Comment by arcanemachiner 6 hours ago
Comment by chillfox 8 hours ago
Comment by ecocentrik 15 hours ago
Comment by jxmesth 15 hours ago
Comment by ecocentrik 14 hours ago
Comment by ycui1986 6 hours ago
Comment by sscaryterry 14 hours ago
Comment by NobleLie 10 hours ago
Comment by ycui1986 9 hours ago
Comment by jwitthuhn 15 hours ago
Comment by estimator7292 13 hours ago
Comment by blurbleblurble 4 hours ago
Comment by chillfox 54 minutes ago
So I am curious, how do people get these lazy outputs?
Is it by having one of those custom system prompts that basically tells the model to be disrespectful?
Or is it free tier?
Cheap plans?
Comment by enraged_camel 35 minutes ago
Comment by szundi 35 minutes ago
Comment by Moosdijk 15 hours ago
Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.
Comment by pkulak 15 hours ago
The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.
Comment by tasuki 14 hours ago
Yes, but... isn't the same true for Opus and all the other models too?
Comment by slopinthebag 13 hours ago
So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.
Comment by tasuki 13 hours ago
Comment by RussianCow 13 hours ago
Comment by bink-lynch 8 hours ago
Comment by girvo 12 hours ago
Comment by probst 12 hours ago
It also compresses the context at around 100k tokens.
In case anyone is interested: https://github.com/sebastian/pi-extensions/tree/main/.pi/ext...
Comment by chillfox 50 minutes ago
And yes, sonnet/opus is better and what I use daily. But I wouldn’t be that upset if I had to drop down to GLM.
Comment by Mashimo 15 hours ago
You have to keep it below ~100 000 token, else it gets funny in the head.
I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.
Comment by Mashimo 1 hour ago
Evening CET experience for me is super smooth.
Comment by gck1 11 hours ago
That would leave almost no tokens for actual work
Comment by Akira1364 15 hours ago
Comment by spaceman_2020 13 hours ago
4.7 is better, but its also wildly expensive
Comment by slopinthebag 14 hours ago
Comment by ternaryoperator 16 hours ago
Comment by ezekiel68 14 hours ago
I was very impressed.
Comment by gck1 11 hours ago
Every time codex reviews claude written rust, I can't explain it, but it almost feels like codex wants to scream at whoever wrote it.
Comment by lambda 8 hours ago
Comment by justincormack 14 hours ago
Comment by sirnicolaz 14 hours ago
Comment by cornedor 16 hours ago
Comment by dev_l1x_be 15 hours ago
Comment by cornedor 14 hours ago
Comment by odie5533 13 hours ago
Comment by mkhalil 2 hours ago
FAANGS love to give away money to get people addicted to their platforms, and even they, the richest companies in the world, are throttling or reducing Opus usage for paying members, because even the money we pay them doesn't cover it.
Meanwhile, these are usable on local deployments! (and that's with the limited allowance our AI overlords afford us when it comes to choices for graphics cards too!)
Comment by FlyingSnake 16 hours ago
Comment by bensyverson 16 hours ago
Comment by nothinkjustai 16 hours ago
Comment by girvo 12 hours ago
Comment by FlyingSnake 16 hours ago
You’re absolutely right!
Jokes apart, I did notice GLM doing these back and forth loops.
Comment by tonyarkles 15 hours ago
Comment by Lerc 14 hours ago
Comment by tonyarkles 13 hours ago
Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.
Comment by nothinkjustai 16 hours ago
Comment by complexworld 4 hours ago
Comment by dev_l1x_be 15 hours ago
Comment by solomatov 14 hours ago
Could you please share more about this
Comment by alex7o 12 hours ago
One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.
Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file
Comment by OtomotO 17 hours ago
As so many things these days: It's a cult.
I've used Claude for many months now. Since February I see a stark decline in the work I do with it.
I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6
But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"
For me it's just a tool, so I shrug.
Comment by balls187 16 hours ago
I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.
Comment by runarberg 16 hours ago
1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
Comment by ehnto 16 hours ago
As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
Comment by tonyarkles 15 hours ago
One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.
> similar to how you can expect more over time from a junior you are delegating to and training
That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.
> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.
Comment by svnt 16 hours ago
Comment by runarberg 16 hours ago
Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.
Comment by rescbr 13 hours ago
I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.
Comment by flux3125 15 hours ago
Comment by e12e 16 hours ago
Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).
But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.
But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.
Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...
In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.
But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.
When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.
Comment by taurath 17 hours ago
In the same way, it’s hard to see how people who say they’re struggling are actually using it.
There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.
Comment by balls187 16 hours ago
We're also seeing that the people up top are using this to cull the herd.
Comment by psychoslave 17 hours ago
At some point the is a need to have faith in some stable enough ground to be able to walk onto.
Comment by Wolfbeta 15 hours ago
Comment by ecshafer 17 hours ago
Comment by smallmancontrov 15 hours ago
Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.
Comment by bensyverson 16 hours ago
Comment by ecshafer 16 hours ago
Comment by bensyverson 14 hours ago
I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.
Comment by tauroid 15 hours ago
Comment by svnt 16 hours ago
Comment by bensyverson 16 hours ago
Comment by svnt 15 hours ago
Comment by bensyverson 14 hours ago
East, West, Religion, Practice… From a Zen perspective, you're just troubling your mind with binaries and conflict.
Comment by svnt 13 hours ago
The binaries still functionally exist. I see a lot of value in reflective practices. At the same time it seems unlikely to me that the point of existing is to not trouble your mind.
Comment by bensyverson 12 hours ago
If Buddhism can be said to have a goal, it is to reduce suffering (including your own), so troubling your own mind is indeed something it can help with. The point of existence would be something interesting to meditate on. If you discover it, let us all know!
Comment by svnt 5 hours ago
Dogma, like the binaries, still functionally exists, whatever the narrative. If you can’t admit that, that might also be something interesting to meditate on.
Say you have eliminated all suffering. How many versions of that world exist? How many of them are true, beautiful, and good? See how, in order to evaluate the success or failure of Buddhism, we have to move beyond “eliminate suffering” to a higher value standard?
Comment by OtomotO 16 hours ago
Comment by taneq 16 hours ago
Comment by slopinthebag 8 hours ago
I think in every domain, the better you are the less useful you find AI.
Comment by redsocksfan45 15 hours ago
Comment by seanw265 14 hours ago
Qwen appears to be much more expensive:
- Qwen: $1.3 in / $7.8 out
- Kimi: $0.95 in / $4 out
--
The announcement posts only share two overlapping benchmark results. Qwen appears to score slightly lower on SWE-Bench Pro and Terminal-Bench 2.0.
Qwen:
- Teminal-Bench 2.0: 65.4
- SWE-Bench Pro: 57.3
Kimi:
- Terminal-Bench 2.0: 66.8
- SWE-Bench Pro: 58.6
--
Different models have different strong suits, and benchmarks don't cover everything. But from a numbers perspective, Kimi looks much more appealing.
Comment by archon810 8 hours ago
Comment by mchusma 12 hours ago
Comment by ninjahawk1 18 hours ago
Comment by culi 17 hours ago
Comment by robot_jesus 17 hours ago
What about Gemma and Llama and gpt-oss, not to mention lots of smaller/specialized models from Nvidia and others?
I would never argue that China isn't ahead in the open weights game, of course, but it's not like it's "all" American models by any stretch.
Comment by InkCanon 3 hours ago
Comment by walthamstow 16 hours ago
Comment by 1dom 1 hour ago
I'm annoyed at myself, because I thought/hoped/praised chinese AI when they were opening up as Llama was closing, but Qwen looks to be doing the same playbook here as Llama/Meta, Gemma/Google and OpenAI/gpt-oss.
Comment by embedding-shape 17 hours ago
Most*.
OpenAI, contrary to popular belief, actually used to believe in open research and (more or less) open models. GPT1 and GPT2 both were model+code releases (although GPT2 was a "staged" release), GPT3 ended up API-only.
Comment by culi 17 hours ago
Also the Chinese models aren't following a typical American SaaS playbook which relies on free/cheap proprietary software for early growth. They are not just publishing their weights but also their code and often even publishing papers in Open Access journals to explicitly highlight what methods and advancements were made to accomplish their results
Comment by jfoster 5 hours ago
Well, Musk v OpenAI kicks off in one week from now with the objective of forcing them back to their roots. A jury will be deciding whether a nonprofit accepting $50m - $100m of donations and then discarding their mission for an IPO is OK or not. Should be interesting.
Comment by zozbot234 17 hours ago
Comment by tasuki 14 hours ago
Comment by taneq 16 hours ago
Comment by zozbot234 17 hours ago
Comment by magicalhippo 14 hours ago
It's a good model though, would be nice with a refresh.
Comment by visarga 18 hours ago
Comment by qalmakka 17 hours ago
Comment by zozbot234 17 hours ago
Comment by qalmakka 17 hours ago
Comment by zozbot234 17 hours ago
Comment by twoodfin 17 hours ago
Today, lots of integer compute happens on local devices for some purposes, and in the cloud for others.
Same is already true for matmul, lots of FLOPS being spent locally on photo and video processing, speech to text, …
No obvious reason you wouldn’t want to specialize LLM tasks similarly, especially as long-running agents increasingly take over from chatbots as the dominant interaction architecture.
Comment by lelanthran 4 hours ago
Right now, certainly. Things change. What was a datacenter rack yesterday could be a laptop tomorrow.
Comment by BobbyJo 17 hours ago
Now, given they can't satisfy current volume, they are forced to settle for just having crazy margins.
Comment by qalmakka 17 hours ago
Comment by try-working 10 hours ago
Comment by BobbyJo 15 hours ago
Comment by ycui1986 8 hours ago
No, nVidia and AMD are not the only ones benefiting.
Comment by zozbot234 18 hours ago
Comment by ninjahawk1 18 hours ago
Comment by elorant 17 hours ago
Comment by Barrin92 6 hours ago
no it isn't. That's the kind of thing people say who've never worked in the Chinese software ecosystem. It's how the Chinese internet has worked for 20+ years. The Chinese market is so large and competition is so rabid that every company basically throws as much free stuff at consumers as they can to gain users. Entrepreneurs don't think about "grand strategic moves at the national level" while they flip through their copies of the Art of War and Confucius lol
Comment by elorant 27 minutes ago
Comment by stingraycharles 7 hours ago
Comment by try-working 10 hours ago
Comment by baq 17 hours ago
Comment by Zavora 17 hours ago
Comment by CamperBob2 17 hours ago
The idea that every new foundation model needs to be pretrained from scratch, using warehouses of GPUs to crunch the same 50 terabytes of data from the same original dumps of Common Crawl and various Russian pirate sites, is hard to justify on an intuitive basis. I think the hard work has already been done. We just don't know how to leverage it properly yet.
Comment by thesz 17 hours ago
Comment by altruios 16 hours ago
Comment by CamperBob2 16 hours ago
To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.
Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.
Comment by thesz 14 hours ago
Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree.
What does concern me here is that it is very hard to ablate tokenization artifacts.
Comment by dTal 16 hours ago
[0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887
Comment by thesz 13 hours ago
What if you still have to obtain the best result possible for given coefficient/tokenization budget?
I think that my comment express general case, while yours provide some exceptions.
Comment by dTal 1 hour ago
>What if you need to reduce number of layers
Delete some.
> and/or width of hidden layers?
Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.
> would the process of "layers to add" selection be considered training?
Er, no?
> What if you still have to obtain the best result possible for given coefficient/tokenization budget?
We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.
Comment by andriy_koval 14 hours ago
Comment by pduggishetti 17 hours ago
I believe US is building this off the cost difference from other countries using companies like scale, outlier etc, while china has the internal population to do this
Comment by testbjjl 18 hours ago
Comment by Rohansi 17 hours ago
Comment by WarmWash 17 hours ago
People think that Chinese AI labs are just super cool bros that love sharing for free.
The don't understand it's just a state sponsored venture meant to further entrench China in global supply and logistics. China's VCs are Chinese banks and a sprinkle of "private" money. Private in quotes because technically it still belongs to the state anyway.
China doesn't have companies and government like the US. It just has government, and a thin veil of "company" that readily fool westerners.
Comment by subw00f 17 hours ago
Comment by culi 17 hours ago
That's very different from an American SaaS model which relies of free but proprietary software for early growth
Comment by zozbot234 17 hours ago
Comment by WarmWash 17 hours ago
If you forever stand at the entrance eating the free samples, that's fine, they don't care. Other people are going through the door and you are still consuming what they feed you. Doesn't mean it's going to be bad or evil, but they are staking their territory of control.
Comment by zozbot234 17 hours ago
Comment by devilsdata 11 hours ago
Comment by jillesvangurp 17 hours ago
As for what comes next, it's probably going to be a bit of a race for who can do the most useful and valuable things the cheapest. If OpenAI and Anthropic don't make it, the technology will survive them. If they do, they'll be competing on quality and cost.
As for state sponsorship, a lot of things are state sponsored. Including in the US. Silicon Valley has a rich history that is rooted in massive government funding programs. There's a great documentary out there the secret history of Silicon Valley on this. Not to mention all the "cheap" gas that is currently powering data centers of course comes on the back of a long history of public funding being channeled into the oil and gas industry.
Comment by WarmWash 17 hours ago
You can make any comparison you want if you use adjectives rather than values. I can say that cars use a massive amount of water (all those radiators!) to try and downplay agricultural water usage. But its blatantly disingenuous.
SV is overwhelmingly private (actual constitutional private) money. To the point that you should disregard people saying otherwise, just like you would the people saying cars use massive amounts of water.
Comment by OtomotO 17 hours ago
Contrary: How will the closed, proprietary models from Anthropic, "Open"AI and Co. lead us all to freedom? Freedom of what exactly? Freedom of my money?
At some point this "anti-communism" bullshit propaganda has to stop. And that moment was decades ago!
Comment by Zetaphor 16 hours ago
Comment by grttsww 17 hours ago
I still prefer that over US total dominance.
Let them fight it out.
Comment by joquarky 16 hours ago
But the events of the past decade or so have clearly demonstrated that there are no "good" actors.
I personally couldn't care less who wins in the China vs US AI competition, both sides have a long list of pros and cons.
Comment by spwa4 17 hours ago
Then decide ...
Comment by joquarky 16 hours ago
Or maybe families of African descent.
Or maybe families of Japanese Americans who lived in the US during WWII.
Or maybe people of Latin descent living in the US today.
Comment by jazz9k 16 hours ago
You really don't see the difference?
Comment by well_ackshually 15 hours ago
I'm perfectly happy to let the chinese get a piece of the pie and fight the US, no matter how bad they are right now.
Comment by grttsww 6 hours ago
Comment by darkwater 17 hours ago
Comment by ai_fry_ur_brain 16 hours ago
It would be a great day for humanity if people would stopping glazing text autocomplete as revolutionary.
Comment by 0xbadcafebee 18 hours ago
Comment by Aurornis 18 hours ago
I find even the SOTA models to be far away from trustworthy for anything beyond throwaway tasks. Supervising a less-than-SOTA model to save $10 to $100 per month is not attractive to me in the least.
I have been experimenting with self hosted models for smaller throwaway tasks a lot. It’s fun, but I’m not going to waste my time with it for the real work.
Comment by zozbot234 18 hours ago
Comment by senordevnyc 15 hours ago
Comment by jatins 4 hours ago
Doesn't justify 10x the cost in that case imo
Comment by dandaka 17 hours ago
Comment by cyanydeez 17 hours ago
Comment by dnnddidiej 12 hours ago
Comment by 0xbadcafebee 9 hours ago
Buying the most expensive circular saw doesn't get you the best woodworking, but it is the most expensive woodworking.
Comment by itake 8 hours ago
Comment by 0xbadcafebee 2 hours ago
https://medium.com/@adambaitch/the-model-vs-the-harness-whic... | https://aakashgupta.medium.com/2025-was-agents-2026-is-agent... | https://x.com/Hxlfed14/status/2028116431876116660 | https://www.langchain.com/blog/the-anatomy-of-an-agent-harne...
(I don't think anecdotes are useful in these comparisons, but I'll throw mine in anyway: I use GPT-5.4, GPT-5.3-Codex, Gemini-3-Pro, Opus, Sonnet, at work every week. I then switch to GLM-5.1, K2-Thinking. Other than how chatty they get, and how they handle planning, I get the same results. Sometimes they're great, sometimes I spent an hour trying to coax them towards the solution I want. The more time I spend describing the problem and solution and feeding them data, the better the results, regardless of model. The biggest problem I run into lately is every website in the world is blocking WebFetch so I have to manually download docs, which sucks. And for 90% of my coding and system work, I see no difference between M2.5 and SOTA models, because there's only so much better you can get at writing a simple script or function or navigating a shell. This is why Anthropic themselves have always told people to use Sonnet to orchestrate complex work, and Haiku for subagents. But of course they want you to pay for Opus, because they want your money.)
Comment by slopinthebag 13 hours ago
Also not everyone wants to use Claude Code, so if they're paying API pricing it's more likely thousands of dollars a month. If you can get the same results by spending a fraction of that, why wouldn't you?
Comment by esperent 7 hours ago
That was the breaking point, I cancelled my subscription.
As it happens I had a low coding workload over the past two weeks so I've been noodling around in PI mostly with Gemini Flash api. I like it - I even agree it's a much better harness than CC. However, the lock in is real. Even without switching models which each have their own quirks, I expect my work speed to drop drastically for at least a week or two even if I was focused on it fully. But after the learning period I think pi will be faster. The danger of course is that CC is fairly on rails while with PI you could end up spending all your time tinkering with the harness.
Comment by gck1 11 hours ago
You can't do any serious work on it without rationing your work and kneecapping your workflows, to the point where you design workflows around anthropic usage limit woodoo rather than what actually works.
Without this, I run into WEEKLY usage limits on $200 plan, working on a single codebase, one feature at a time, on just day 3.
Comment by slopinthebag 9 hours ago
Comment by AnonymousPlanet 17 hours ago
Comment by flatline 16 hours ago
Comment by hedora 13 hours ago
On a related note, I really need to try some local models (probably starting with qwen), since, at least in 2026, the Chinese models are way better at protecting democracy and free speech than the US models.
Comment by AnonymousPlanet 16 hours ago
What if they learned that half of the American small and medium sized companies would have started pouring all their business information into such a service?
Comment by dnnddidiej 12 hours ago
Comment by chatmasta 16 hours ago
Comment by 0xbadcafebee 9 hours ago
Comment by chatmasta 9 hours ago
Comment by tgrowazay 15 hours ago
Comment by xutopia 15 hours ago
Comment by 0xbadcafebee 9 hours ago
Comment by fnetisma 17 hours ago
Comment by jjice 19 hours ago
Comment by SwellJoe 17 hours ago
But, if for some reason everything stopped at Opus 4.5 level and we never got a better model (and 4.6/4.7 are better, if only marginally so and mostly expanding the kind of work it can do rather than making it better at making web apps), we could still do a lot of real work real fast with Opus 4.5, and software development would never go back to everyone handwriting most of the code.
A model as good as Opus 4.5 (or slightly better according to the mostly easily gamed benchmarks) at a 10th the price is probably a worthwhile proposition for a lot of people. $100 a month, or more, to get Opus 4.7 is well worth it for a western developer...the time the lower-end models waste is far more expensive than the cost of using the most expensive models. For the foreseeable future, I'll keep paying a premium for the models that waste less of my time and produce better results with less prodding.
But, also, it's wild how fast things move. Open models you can run on relatively modest hardware are competitive with frontier models of two years ago. I mean, you can run Qwen 3.6 MoE 35B A3B or the larger Gemma 4 models on normal hardware, like a beefy Macbook or a Strix Halo or any recentish 24GB/32GB GPU...not much more expensive than the average developer laptop of pre-AI times. And, it can write code. It can write decent prose (Qwen is maybe better at code, Gemma definitely has better prose), they can use tools, they have a big enough context window for real work. They aren't as good as Opus 4.5, yet.
Anyway, I use several models at this point, for security and code reviews, even if Claude Code with Opus is still obviously the best option for most software development tasks. I'll give Qwen a try, too. I like their small models, which punch well above their weight, I'll probably like the big one, too.
Comment by Someone1234 18 hours ago
Even many people on a Claude subscription aren't choosing or able to choose Opus 4.7 because of those cost/usage pressures. Often using Sonnet or an older opus, because of the value Vs. quality curve.
Comment by dd8601fn 18 hours ago
Comment by seplite 18 hours ago
Comment by CamperBob2 18 hours ago
Comment by jpfromlondon 17 hours ago
Comment by paprikanotfound 16 hours ago
Comment by jpfromlondon 1 hour ago
Comment by elAhmo 17 hours ago
Comment by wahnfrieden 18 hours ago
Comment by oidar 18 hours ago
Comment by vidarh 18 hours ago
If even cheaper models start reaching that level (GLM 5.1 is also close enough that I'm using it at lot), that's a big deal, and a totally valid reason to compare against Opus 4.5
Comment by jasonjmcghee 17 hours ago
For me, Opus 4.5 and 4.6 feel so different compared to sonnet.
Maybe I'm lazy or something but sonnet is much worse in my experience at inferring intent correctly if I've left any ambiguity.
That effect is super compounding.
Comment by hirako2000 18 hours ago
In any case a benchmark provided by the provider is always biased, they will pick the frameworks where their model fares well. Omit the others.
Independent benchmarks are the go to.
Comment by culi 17 hours ago
Comment by alex_young 18 hours ago
Comment by cute_boi 17 hours ago
Comment by bluegatty 18 hours ago
Comment by jdw64 16 hours ago
While Qwen advertises large context windows, in practice the effectiveness of long-context usage seems to depend heavily on its context caching behavior. According to the official documentation, Qwen provides both implicit and explicit context caching, but these come with constraints such as short TTL (around a few minutes), prefix-based matching, and minimum token thresholds.
Because of these constraints, especially in workflows like coding agents where context grows over time, cache reuse may not scale as effectively as expected. As a result, even though the per-token price looks low, the effective cost in long sessions can feel higher due to reduced cache hit rates and repeated computation.
That said, in certain areas such as security-related tasks, I’ve personally had cases where Qwen performed better than Opus.
In my personal experience, Qwen tends to perform much better than Opus on shorter units like individual methods or functions. However, when looking at the overall coding experience, I found it works better as a function-level generator rather than as an autonomous, end-to-end coding assistant like Claude.
Comment by ezekiel68 14 hours ago
Anthropic's "Best Practices" doc[0] for Claude Code states, "A clean session with a better prompt almost always outperforms a long session with accumulated corrections."
Comment by hedora 13 hours ago
Comment by greyskull 9 hours ago
Does anyone have a similar experience of having thoroughly used CC/Codex/whatever and also have an analogous self-hosted setup that they're somewhat happy with? I'm struggling a bit.
I have 32GB of DDR5 (seems inadequate nowadays), an AMD 7800X3D, and an RTX 4090. I'm using Windows but I have WSL enabled.
I tried a few combinations of ollama, docker desktop model runner, pi-coding-agent and opencode; and for models, I think I tried a few variants each of Gemma 4, Qwen, GLM-5.1. My "baseline" RAM usage was so high from the handful of regular applications that IIRC it wasn't enough to use the best models; e.g., I couldn't run Gemma4-31B.
Things work okay in a Windows-only setup, though the agent struggled to get file paths correct. I did have some success running pi/opencode in WSL and running ollama and the model via docker desktop.
In terms of actual performance, it was painfully slow compared to the throughput I'm used to from CC, and the tooling didn't feel as good as the CC harness. Admittedly I didn't spend long enough actually using it after fiddling with setup for so long, it was at least a fun experiment.
Comment by ihowlatthemoon 5 hours ago
I use llama-server that comes with llama.cpp instead of using ollama. Here are the exact settings I use.
llama-server -ngl 99 -c 192072 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-27B-Q4_K_M.gguf
Comment by greyskull 4 hours ago
How did you land on that model? Hard to tell if I should be a) going to 3.5, b) going to fewer parameters, c) going to a different quantization/variant.
I didn't consider those other flags either, cool.
Are you having good luck with any particular harnesses or other tooling?
Comment by martinald 9 hours ago
You can keep the KV cache on GPU which means it's pretty damn fast and you should be able to hold a reasonable context window size (on your GPU).
I've had really impressive results locally with this.
I'd strongly recommend cloning llama.cpp locally btw (in wsl2) and asking a frontier model in eg Claude code to set it up for you and tweak it. In my experience the apps that sit on top of llama.cpp don't expose all the options and flags and one wrong flag can mean terrible performance (eg context windows not being cached). If you compile it from source with a coding agent it can look up the actual code when things go wrong.
You should be able to get at least 20-40tok/s on that machine on Gemma 4 which is very usable, probabaly faster on qwen3.6 since it's only 3b active params.
Comment by Ey7NFZ3P0nzAe 3 hours ago
Working well so far.
Comment by greyskull 8 hours ago
Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc.
I struggle to even ask the right questions about the workflow and environment.
Comment by madtowneast 9 hours ago
Comment by greyskull 8 hours ago
Comment by daemonologist 8 hours ago
Grab a recent win-vulkan-x64 build of llama.cpp here: https://github.com/ggml-org/llama.cpp/releases - llama.cpp is the engine used by Ollama and common wisdom is to just use it directly. You can try CUDA as well for a speedup but in my experience Vulkan is most likely to "just work" and is not too far behind in speed.
For best quality, download the biggest version of Qwen 3.5 27B you can fit on your 4090 while still leaving room for context and overhead: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF - I would try the UD-Q5_K_XL but you might have to drop down to Q5_K_S. For best speed, you could use Qwen 3.6 35B-A3B (bigger model but fewer parameters are active per token): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF - probably the UD-Q4_K_S for this one.
Now you need to make sure the whole model is fitting in VRAM on the 4090 - if anything gets offloaded to system memory it's going to slow way down. You'll want to read the docs here: https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... (and probably random github issues and posts on r/localllama as well), but to get started:
llama-server -m /path/to/above/model/here.gguf --no-mmap --fit on --fit-ctx 20000 --parallel 1
This will spit out a whole bunch of info; for now we want to look just above the dotted line for "load_tensors: offloading n/n layers to GPU" - if fewer than 100% of the layers are on GPU, inference is going to be slower and you probably want to drop down to a smaller version of the model. The "dense" 27B will be slowed more by this than the "mixture-of-experts" 35B-A3B, which has to move fewer weights per token from memory to the GPU.Go to the printed link (localhost:8080 by default) and check that the model seems to be working normally in the default chat interface. Then, you're going to want more context space than 20k tokens, so look at your available VRAM (I think the regular Windows task manager resource monitor will show this) and incrementally increase the fit-ctx target until it's almost full. 100k context is enough for basic coding, but more like 200k would be better. Qwen's max native context length is 262,144. If you want to push this to the limit you can use `--fit-target <amount of memory in MB>` to reduce the free VRAM target to less than the default 1024 - this may slow down the rest of your system though.
Finally, start hooking up coding harnesses (llama-server is providing an OpenAI-compatible API at localhost:8080/v1/ with no password/token). Opencode seems to work pretty reliably, although there's been some controversy about telemetry and such. Zed has a nice GUI but Qwen sometimes has trouble with its tools. Frankly I haven't found an open harness I'm really happy with.
Comment by greyskull 6 hours ago
> nothing you can run locally, on that machine anyways, is going to compare with Opus
Definitely not expecting that. Just wanted to find a setup that individuals were content with using a coding harness and a model that is usable locally.
What does your setup look like? Model, harness, etc.
Comment by unethical_ban 4 hours ago
Comment by fr3on 13 hours ago
Comment by sva_ 13 hours ago
Comment by trvz 19 hours ago
I knew of all the 3.5’s and the one 3.6, but only now heard about the Plus.
Comment by Alifatisk 18 hours ago
Comment by wg0 17 hours ago
1. Keeping models closed source.
2. Jacking up pricing. A lot. Sometimes up to 100% increase.
Comment by embedding-shape 17 hours ago
Comment by aerhardt 15 hours ago
Comment by esperent 7 hours ago
That's going to be the path for every new company from every country, I assume. They are not releasing open models out of the goodness of their hearts. They are for-profit companies, they don't have hearts, they just have balance sheets.
Comment by halJordan 10 hours ago
Comment by nicce 15 hours ago
How is that different from American?
Comment by Tepix 16 hours ago
Oh wait, it doesn't apply to those…
Comment by Kerrick 15 hours ago
Comment by slopinthebag 14 hours ago
Comment by dingocat 15 hours ago
Comment by OtomotO 17 hours ago
Comment by rc_kas 16 hours ago
Comment by sunaookami 14 hours ago
Comment by cnlwsu 16 hours ago
Comment by cute_boi 17 hours ago
Comment by gpm 16 hours ago
Comment by bigyabai 15 hours ago
Comment by throwaway613746 15 hours ago
Comment by ai_fry_ur_brain 16 hours ago
If you overuse LLMs or get excited about them at all, you're ngmi and a complete idiot.
Comment by johnnyApplePRNG 3 hours ago
Comment by atilimcetin 18 hours ago
And I use Claude, Gemini, GLM, Qwen to double check my math, my code and to get practical information to make my path tracer more efficient. Claude and Gemini failed me more than a couple of times with wrong, misleading and unnecessary information but on the other hand Qwen always gave me proper, practical and correct information. I’ve almost stopped using Claude and Gemini to not to waste my time anymore.
Claude code may shine developing web applications, backends and simple games but it's definitely not for me. And this is the story of my specific use case.
Comment by wg0 17 hours ago
In my own experience, even with web app of medium scale (think Odoo kind of ERP), they are next to useless in understanding and modling domain correctly with very detailed written specs fed in (whole directory with index.md and sub sections and more detailed sections/chapters in separate markdown files with pointers in index.md) and I am not talking open weight models here - I am talking SOTA Claude Opus 4.6 and Gemini 3.1 Pro etc.
But that narrative isn't popular. I see the parallels here with the Crypto and NFT era. That was surely the future and at least my firm pays me in cypto whereas NFTs are used for rewarding bonusess.
Comment by wg0 17 hours ago
Comment by esperent 7 hours ago
Yearly breaking changes but impossible to know what version any example code you find is related to (except that if you're on the latest version, it's definitely not for your version), closed and locked down forum (after several months of being a paying customer, I couldn't even post a reply, let alone ask a question), weird split between open and closed, weird OWL frontend framework that seems to be a bad clone of an old React version, etc. etc. Painful all around. I would call this kind of codebase pre-LLM slop, accreted over many years of bad engineering decisions.
Comment by amarcheschi 17 hours ago
otoh, we spotted a wrong formula regarding learning rate on wikipedia and it is now correct :) without gemini and just our intuition of "mhh this formula doesn't seem right", that definitely inflated our ego
Comment by muyuu 16 hours ago
it puts a massive backstop at the margins they can possibly extract from users
Comment by zozbot234 18 hours ago
Comment by atilimcetin 18 hours ago
Comment by jasonjmcghee 17 hours ago
Comment by hedora 13 hours ago
Comment by jansan 17 hours ago
Comment by zozbot234 17 hours ago
Comment by Alifatisk 17 hours ago
This is not my experience at all, Qwen3.6-Plus spits out multiple paragraphs of text for the prompts I give. It wasn't like this before. Now I have to explicitly tell it not to yap so much and keep it short, concise and direct.
Comment by djyde 12 hours ago
Comment by djyde 11 hours ago
Comment by freely0085 9 hours ago
Comment by piotraleksander 2 hours ago
Comment by Oras 18 hours ago
Comment by ac29 18 hours ago
Comment by cmrdporcupine 18 hours ago
https://deepinfra.com/zai-org/GLM-5.1
Looks like fp4 quantization now though? Last week was showing fp8. Hm..
Comment by wolttam 18 hours ago
I also regularly experience Deepinfra slow to an absolute crawl - I've actually gotten more consistent performance from Z.ai.
I really liked Deepinfra but something doesn't seem right over there at the moment.
Comment by cmrdporcupine 17 hours ago
It's frankly a bummer that there's not seemingly a better serving option for GLM 5.1 than z.AI, who seems to have reliability and cost issues.
Comment by coder68 18 hours ago
Comment by Oras 18 hours ago
CC has a limited capacity for Opus, but fairly good for Sonnet. For Codex, never had issues about hitting my limits and I'm only a pro user.
Comment by kardianos 18 hours ago
Comment by vidarh 18 hours ago
It's not crushing Opus 4.5 in real-life use for me, but it's close enough to be near interchangeable with Sonnet for me for a lot of tasks, though some of the "savings" are eaten up by seemingly using more tokens for similar complexity tasks (I don't have enough data yet, but I've pushed ~500m tokens through it so far.
Comment by pros 18 hours ago
Comment by bensyverson 16 hours ago
Comment by Alifatisk 18 hours ago
They have difficulty supplying their users with capacity, but in an email they pointed out that they are aware of it. During peak hours, I experience degraded performance. But I am on their lowest tier subscription, so I understand if my demand is not prioritized during those hours.
Comment by culi 16 hours ago
https://arena.ai/leaderboard/text?viewBy=plot&license=open-s...
Comment by c0n5pir4cy 18 hours ago
I did give it one task which was more complex and I was quite impressed by. I had a local setup with Tiltdev, K3S and a pnpm monorepo which was failing to run the web application dev server; GLM correctly figured out that it was a container image build cache issue after inspecting the containers etc and corrected the Tiltfile and build setup.
Comment by cleaning 18 hours ago
Comment by Oras 18 hours ago
For more complicated stuff, like queries or data comparison, Codex seems always behind for me.
Comment by throwaw12 18 hours ago
Comment by edwinjm 18 hours ago
Comment by throwaw12 18 hours ago
Comment by zozbot234 18 hours ago
Comment by throwaw12 17 hours ago
OpenAI on the other hand has different models optimized for coding, GPT-x-codex, Anthropic doesnt have this distinction
Comment by pixel_popping 16 hours ago
Comment by __blockcipher__ 18 hours ago
Comment by esafak 18 hours ago
Comment by XCSme 16 hours ago
Comment by chatmasta 16 hours ago
Comment by zozbot234 16 hours ago
Comment by digimantis 8 hours ago
Comment by marsulta 16 hours ago
Comment by o10449366 15 hours ago
Comment by alx-ppv 14 hours ago
Comment by fragmede 4 hours ago
Comment by Aeroi 13 hours ago
Comment by xmly 15 hours ago
Comment by DeathArrow 17 hours ago
They brag about Qwen but don't let people use it.
Comment by alanmercer 5 hours ago
Comment by EthanFrostHI 6 hours ago
Comment by JLO64 18 hours ago
Comment by bauratynov 3 hours ago
Comment by mockbolt 15 hours ago
Comment by souravroyetl 16 hours ago
Comment by dakolli 15 hours ago