Anonymous request-token comparisons from Opus 4.6 and Opus 4.7
Posted by anabranch 2 days ago
Comments
Comment by andai 2 days ago
Here is a comparison for 4.5, 4.6 and 4.7 (Output Tokens section):
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
4.7 comes out slightly cheaper than 4.6. But 4.5 is about half the cost:
https://artificialanalysis.ai/?models=claude-opus-4-7%2Cclau...
Notably the cost of reasoning has been cut almost in half from 4.6 to 4.7.
I'm not sure what that looks like for most people's workloads, i.e. what the cost breakdown looks like for Claude Code. I expect it's heavy on both input and reasoning, so I don't know how that balances out, now that input is more expensive and reasoning is cheaper.
On reasoning-heavy tasks, it might be cheaper. On tasks which don't require much reasoning, it's probably more expensive. (But for those, I would use Codex anyway ;)
Comment by matheusmoreira 2 days ago
https://news.ycombinator.com/item?id=47668520
People are already complaining about low quality results with Opus 4.7. I'm also spotting it making really basic mistakes.
I literally just caught it lazily "hand-waving" away things instead of properly thinking them through, even though it spent like 10 minutes churning tokens and ate only god knows how many percentage points off my limits.
> What's the difference between this and option 1.(a) presented before?
> Honestly? Barely any. Option M is option 1.(a) with the lifecycle actually worked out instead of hand-waved.
> Why are you handwaving things away though? I've got you on max effort. I even patched the system prompts to reduce this.
> Fair call. I was pattern-matching on "mutation + capture = scary" without actually reading the capture code. Let me do the work properly.
> You were right to push back. I was wrong. Let me actually trace it properly this time.
> My concern from the first pass was right. The second pass was me talking myself out of it with a bad trace.
It's just a constant stream of self-corrections and doubts. Opus simply cannot be trusted when adaptive thinking is enabled.
Can provide session feedback IDs if needed.
Comment by codethief 2 days ago
In my experience, prompts like this one, which 1) ask for a reason behind an answer (when the model won't actually be able to provide one), 2) are somewhat standoff-ish, don't work well at all. You'll just have the model go the other way.
What works much better is to tell the model to take a step back and re-evaluate. Sometimes it also helps to explicitly ask it to look at things from a different angle XYZ, in other words, to add some entropy to get it away from the local optimum it's currently at.
Comment by mrandish 2 days ago
This is key. In my experience, asking an LLM why it did something is usually pointless. In a subsequent round, it generally can't meaningfully introspect on its prior internal state, so it's just referring to the session transcript and extrapolating a plausible sounding answer based on its training data of how LLMs typically work.
That doesn't necessarily mean the reply is wrong because, as usual, a statistically plausible sounding answer sometimes also happens to be correct, but it has no fundamental truth value. I've gotten equally plausible answers just pasting the same session transcript into another LLM and asking why it did that.
Comment by Terretta 2 days ago
From early GPT days to now, best way to get a decently scoped and reasonably grounded response has always been to ask at least twice (early days often 7 or 8 times).
Because not only can it not reflect, it cannot "think ahead about what it needs to say and change its mind". It "thinks" out loud (as some people seem to as well).
It is a "continuation" of context. When you ask what it did, it still doesn't think, it just* continues from a place of having more context to continue from.
The game has always been: stuff context better => continue better.
Humans were bad at doing this. For example, asking it for synthesis with explanation instead of, say, asking for explanation, then synthesis.
You can get today's behaviors by treating "adaptive thinking" like a token budgeted loop for context stuffing, so eventually there's enough context in view to produce a hopefully better contextualized continuation from.
It seems no accident we've hit on the word "harness" — so much that seems impressive by end of 2025 was available by end of 2023 if "holding it right". If (and only if!) you are an expert in an area you need it to process: (1) turn thinking off, (2) do your own prompting to "prefill context", and (3) you will get superior final response. Not vibing, just staff-work.
---
* “just” – I don't mean "just" dismissively. Qwen 3.5 and Gemma 4 on M5 approaches where SOTA was a year ago, but faster and on your lap. These things are stunning, and the continuations are extraordinary. But still: Garbage in, garbage out; gems in, gem out.
Comment by vanviegen 2 days ago
It can't do any better in the moment it's making the choices. Introspection mostly amounts to back-rationalisation, just like in humans. Though for humans, doing so may help learning to make better future decisions in similar situations.
Comment by sillyfluke 2 days ago
>Introspection mostly amounts to back-rationalisation, just like in humans.
That's the best case scenario. Again, let's stop anthropologizing. The given reasons why may be incompatible with the original answer upon closer inspection...
Comment by Dumbledumb 2 days ago
Regarding not just telling „try again“: of course you are right to suggest that applying human cognition mechanisms to llm is not founded on the same underlying effects.
But due to the nature of training and finetuning/rf I don’t think it is unreasonable that instructing to do backwards reflection could have a positive effect. The model might pattern match this with and then exhibit a few positive behaviors. It could lead it doing more reflection within the reasoning blocks and catch errors before answering, which is what you want. These will have attention to the question of „what caused you to make this assumption“, also, encouraging this behavior. Yes, both mechanisms are exhibited through linear forward going statical interpolation, but the concept of reasoning has proven that this is an effective strategy to arrive at a more grounded result than answering right away.
Lastly, back to anthro. it shows that you, the user, is encouraging of deeper thought an self corrections. The model does not have psychological safety mechanisms which it guards, but again, the way the models are trained causes them to emulate them. The RF primes the model for certain behavior, I.e. arriving at answer at somepoint, rather than thinking for a long time. I think it fair to assume that by „setting the stage“ it is possible to influence what parts of the RL activate. While role-based prompting is not that important anymore, I think the system prompts of the big coding agents still have it, suggesting some, if slight advantage, of putting the model in the right frame of mind. Again, very sorry for that last part, but anthro. does seem to be a useful analogy for a lot of concepts we are seeing (the reason for this being in the more far of epistemological and philosophical regions, both on the side of the models and us)
Comment by KronisLV 2 days ago
Yep, I've gotten used to treating the model output as a finished, self-contained thing.
If it needs to be explained, the model will be good at that, if it has an issue, the model will be good at fixing it (and possibly patching any instructions to prevent it in the future). I'm not getting out the actual reason why things happened a certain way, but then again, it's just a token prediction machine and if there's something wrong with my prompt that's not immediately obvious and perhaps doesn't matter that much, I can just run a few sub-agents in a review role and also look for a consensus on any problems that might be found, for the model to then fix.
Comment by wallst07 2 days ago
"Why did you guess at the functions signature and get it wrong, what information were you using and how can we prevent it next time."
Is that not the right approach?
Comment by natdempk 1 day ago
Comment by Forgeties79 2 days ago
That kind of strikes me as a huge problem. Working backwards from solutions (both correct and wrong) can yield pretty critical information and learning opportunities. Otherwise you’re just veering into “guess and check” territory.
Comment by AlexCoventry 2 days ago
It has the K/V cache, no?
Comment by Sinidir 1 day ago
Comment by matheusmoreira 2 days ago
It's just that Opus 4.6 DISABLE_ADAPTIVE_THINKING=1 doesn't seem to require me to do this at all, or at least not as often. It'd fully explore the code and take into account all the edge cases and caveats without any explicit prompting from me. It's a really frustrating experience to watch Anthropic's flagship subscription-only model burn my tokens only to end up lazily hand-waving away hard questions unless I explicitly tell it not to do that.
I have to give it to Opus 4.7 though: it recovered much better than 4.6.
Comment by bobkb 2 days ago
Strangely this option was not working for many of us on a team plan
Comment by j-bos 2 days ago
Comment by christina97 2 days ago
It never leads to anything helpful. I don’t generally find it necessary to drive humans into a corner. I’m not sure it’s because it’s explicitly not a human so I don’t feel bad for it, though I think it’s more the fact that it’s always so bland and is entirely unable to respond to a slight bit of negative sentiment (both in terms of genuinely not being able to exert more effort into getting it right when someone is frustrated with it, but also in that it is always equally nonchalant and inflexible).
Comment by manmal 2 days ago
Comment by nhod 2 days ago
If you ask the average human "Why?", they will generally get defensive, especially if you are asking them to justify their own motivation.
However, if you ask them to describe the thinking and actions that led to their result, they often respond very differently.
Comment by nelox 2 days ago
Comment by sroussey 2 days ago
Comment by Forgeties79 2 days ago
Comment by sroussey 1 day ago
Comment by Forgeties79 1 day ago
Comment by noodletheworld 2 days ago
I desperately hate that modern tooling relies on “did you perform the correct prayer to the Omnissiah”
> to add some entropy to get it away from the local optimum
Is that what it does? I don't think thats what it does, technically.
I think thats just anthropomorphizing a system that behaves in a non deterministic way.
A more menaingful solution is almost always “do it multiple times”.
That is a solution that makes sense sometimes because the system is prob based, but even then, when youre hitting an opaque api which has multiple hidden caching layers, /shrug who knows.
This is way I firmly believing prompt engineering and prompt hacking is just fluff.
Its both mostly technically meaningless (observing random variance over a sample so small you cant see actual patterns) and obsolete once models/apis change.
Just ask Claude to rewrite your request “as a prompt for claude code” and use that.
I bet it wont be any worse than the prompt you write by hand.
Comment by nprateem 2 days ago
"Why did you do that?" (Me, just wanting to understand)
"You're right I should have done the opposite" (starts implementing the opposite without seeking approval, etc.
But if you agree with it it won't do that, so it isn't simply a case of randomly rerunning prompts.
Comment by tclancy 2 days ago
Comment by what 2 days ago
Do you think it knows what max effort or patched system prompts are? It feels really weird to talk to an LLM like it’s a person that understands.
Comment by matheusmoreira 2 days ago
As someone who's been programming alone for over a decade, I absolutely do want to enjoy my coding buddy experience. I want to trust it. I feel pretty bad when I have to treat Claude like a dumb machine. It's especially bad when it starts making mistakes due to lack of reasoning. When I start explaining obvious stuff it's because I've lost the respect I had for it and have started treating it like a moron I have to babysit instead of a fellow programmer. It's definitely capable of understanding and reasoning, it's just not doing it because of adaptive thinking or bad system prompts or whatever else.
Comment by hattmall 2 days ago
Comment by rectang 2 days ago
It seems like they're working hard to prioritize wrapping their arms around huge contexts, as opposed to handling small tasks with precision. I prefer to limit the context and the scope of the task and focus on trying to get everything right in incremental steps.
Comment by matheusmoreira 2 days ago
I think the problem just comes down to adaptive thinking allowing the model to choose how much effort it spends on things, a power which it promptly abuses to be as lazy as possible. CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 significantly improved Opus 4.6's behavior and the quality of its results. But then what do they do when they release 4.7?
https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning.
> The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
Comment by bobkb 2 days ago
Comment by matheusmoreira 1 day ago
Comment by virtualritz 2 days ago
Comment by matheusmoreira 1 day ago
https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
Comment by xvector 2 days ago
Comment by scrollop 2 days ago
"With Opus 4.6, extended thinking was a toggle you managed: turn it on for hard stuff, off for quick stuff. If you left it on, every question paid the thinking tax whether it needed to or not. Now, with Opus 4.7, extended thinking becomes adaptive thinking. "
https://claude.com/resources/tutorials/working-with-claude-o...
Comment by xvector 1 day ago
Comment by matheusmoreira 1 day ago
Comment by matheusmoreira 1 day ago
https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
Comment by HarHarVeryFunny 1 day ago
Does it? Anthropic's own announcement says that for the same "effort level" 4.7 does more thinking (i.e uses more output tokens) than 4.6, and they've also increased the default effort level from 4.6 high to 4.7 xhigh.
I'm not sure what dominates the cost for a typical mix of agentic coding tasks - input tokens or output ones, but if you are working on an existing project rather than a brand new one, then file input has to be a significant factor and preliminary testing says that the new tokenizer is typically generating 40% or so more tokens for the exact same input.
I really have to wonder how much of 4.7's increase in benchmark scores over 4.6 is because the model is actually better trained for these cases, or just because it is using more tokens - more compute and thinking steps - to generate the output. It has to be a mix of the two.
Comment by scrollop 2 days ago
"Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens. "
Comment by andai 2 days ago
I'm not sure where that discrepancy comes from (is Anthropic using different benchmarks?).
There's a few different theories but all we have now are synthetic benchmarks, anecdotes and speculation.
(Benchmarks are misleading, I think our best bet now is for individuals to run real world tests, giving the same task to each model, and compare the quality, cost and time.)
The input cost inflation however is real, and dramatic.
I would have expected them to lower input costs proportionally, because otherwise you're getting less intelligence per dollar even with the smarter model. Think that would be the smartest thing for them to do, at least PR wise. And maybe a bit of free usage as an apology :)
Comment by irthomasthomas 2 days ago
Comment by andai 2 days ago
Agree though that benchmarks aren't very helpful w.r.t. estimating real world performance or costs.
What we'd need are people giving the same real world tasks to 4.6 and 4.7 and measuring time, quality and costs.
Comment by irthomasthomas 1 day ago
Comment by QuantumGood 2 days ago
Comment by hgoel 2 days ago
I hit my 5 hour limit within 2 hours yesterday, initially I was trying the batched mode for a refactor but cancelled after seeing it take 30% of the limit within 5 minutes. Had to cancel and try a serial approach, consumed less (took ~50 minutes, xhigh effort, ~60% of the remaining allocation IIRC), but still very clearly consumed much faster than with 4.6.
It feels like every exchange takes ~5% of the 5 hour limit now, when it used to be maybe ~1-2%. For reference I'm on the Max 5x plan.
For now I can tolerate it since I still have plenty of headroom in my limits (used ~5% of my weekly, I don't use claude heavily every day so this is OK), but I hope they either offer more clarity on this or improve the situation. The effort setting is still a bit too opaque to really help.
Comment by matheusmoreira 2 days ago
Comment by zamalek 1 day ago
It decided to leave the write endpoints added to an authentication service completely unauthenticated. The effort to do the contrary was about 6 characters, and in the claude.md. It tried to implement PKCE by embedding _everything_ in the state.
This thing is beyond untrustworthy.
The fact that they are using Claude to build Claude (not just Claude Code) probably explains a lot.
Comment by matheusmoreira 1 day ago
Comment by sutterd 2 days ago
Comment by scrollop 2 days ago
https://claude.com/resources/tutorials/working-with-claude-o...
You want extended thinking? It's not adaptive thinking and opus will turn it on if it thinks it needs to. But it probably won't, according to user reports as tokens are expensive. Except opus 4.7 now uses 35% more and outputs more thinking tokens.
Comment by sutterd 1 day ago
Comment by matheusmoreira 1 day ago
With Opus 4.7 you absolutely do. Users don't have a choice.
https://code.claude.com/docs/en/model-config
> Opus 4.7 always uses adaptive reasoning. The fixed thinking budget mode and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING do not apply to it.
Comment by __s 2 days ago
Comment by _blk 2 days ago
Comment by gck1 2 days ago
Comment by bashtoni 2 days ago
Comment by g4cg54g54 2 days ago
but if you are api user you must set `ENABLE_PROMPT_CACHING_1H` as i understood
and when using your own api (via `ANTHROPIC_BASE_URL`) ensure `CLAUDE_CODE_ATTRIBUTION_HEADER=0` is set as well... https://github.com/anthropics/claude-code/issues/50085
and check out the other neckbreakers ive found pukes lots of malicious compliance by feels... :/
[BUG] new sessions will *never* hit a (full)cache #47098 https://github.com/anthropics/claude-code/issues/47098
[BUG] /clear bleeds into the next session (what also breaks cache) #47756 https://github.com/anthropics/claude-code/issues/47756
[BUG] uncachable system prompt caused by includeGitInstructions / CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS -> git status https://github.com/anthropics/claude-code/issues/47107
Comment by andersa 2 days ago
Comment by matheusmoreira 1 day ago
After the Claude Code source code leak someone discovered that some variables are read directly from the process environment. Can't even trust that setting them in ~/.claude/settings.json will work!
I've actually started asking Claude itself to dissect every Claude Code update in order figure out if it broke some part of the Rube Goldberg machine I was forced to set up.
Comment by ethbr1 1 day ago
If it increases a KPI by 5% for 95% of users but torpedos the experience for 5%? Ship it.
Comment by hnben 9 minutes ago
on one hand 95% of users get an improved experience. While a competitor gets the chance to build a business for the remaining 5%.
Comment by plaguuuuuu 2 days ago
My attention span is such that I get side tracked and wind up taking longer than 5 mins quite a bit :D
Comment by _blk 2 days ago
Comment by gck1 2 days ago
However, cache being hit doesn't necessarily mean Anthropic won't just subtract usage from you as if it wasn't hit. It's Anthropic we're talking about. They can do whatever they want with your usage and then blame you for it.
Comment by Fabricio20 2 days ago
Comment by HarHarVeryFunny 2 days ago
Comment by ethanj8011 2 days ago
Comment by krackers 2 days ago
Why can't they save the kv cache to disk then later reload it to memory?
Comment by stavros 2 days ago
Comment by zozbot234 2 days ago
Comment by vanviegen 2 days ago
Comment by zozbot234 2 days ago
Comment by vanviegen 2 days ago
That's hardly tiny.
Comment by stingraycharles 2 days ago
Typically it’s cached for about 5 minutes, you can pay extra for longer caches.
Comment by krackers 2 days ago
If you're willing to incur a latency penalty on a "cold resume" (which is fine for most use-cases), why couldn't they just move it to disk. The size of the KV cache should scale on the order of something like (context_length * n_layers * residual_length). I think for a standard V3-MoE model at 1M token length, this should be on the order of 100G at FP16? And you can surely play tricks with KV compression (e.g. the recent TurboQuant paper). It doesn't seem like an outrageous amount of data to put onto cheap scratch HDD (and it doesn't grow indefinitely since really old conversations can be discarded).
Comment by stingraycharles 2 days ago
Correct, when you’re using the API you can choose between 60 minute or 5 minute cache writes for this reason, but I believe the subscription doesn’t offer this. 60 minute cache writes are about 25% more expensive than regular cache writes.
I don’t have insights into internals at Anthropic so I don’t know where the pain point is for increasing cache sizes.
Comment by conception 2 days ago
Comment by hgoel 2 days ago
Comment by trueno 2 days ago
if it's the latter that's crazy. i dont even know what to do there, compactions already feel like a memory wipe
Comment by thefourthchime 2 days ago
Comment by viktorianer 2 days ago
Comment by glerk 2 days ago
And yes, Claude models are generally more fun to use than GPT/Codex. They have a personality. They have an intuition for design/aesthetics. Vibe-coding with them feels like playing a video game. But the result is almost always some version of cutting corners: tests removed to make the suite pass, duplicate code everywhere, wrong abstraction, type safety disabled, hard requirements ignored, etc.
These issues are not resolved in 4.7, no matter what the benchmarks say, and I don't think there is any interest in resolving them.
Comment by Bridged7756 2 days ago
It seems that they got a grip on the "coding LLM" market and now they're starting to seek actual profit. I predict we'll keep seeing 40%+ more expensive models for a marginal performance gain from now on.
Comment by xpe 2 days ago
Just to get a sense for the rate of change, imagine if you took a survey. Compare what people said about AI tools... 3 years ago, 2 years ago, 1 year ago, 6 months ago. Then think about what is plausible that people will be saying in 3 months, 6 months, 9 months ...
Moving the goalposts has always happened, but it is happening faster than I've ever seen it. Many people seem to redefine their expectations on a monthly basis now. Worse, they seem to be unaware they are doing it.
Fancy search? Ok, I'll bite. Compare today's "fancy search" to what we had ~3 years ago according to your choice of metric. Here's one: minutes spent relative to information found. Today, in ~5 minutes I can do a literature review that would have taken me easily 10+ hours five years ago. We don't need to argue phrasing when we can pick some prototypical tasks and compare them.
We're going to have different takes about where various AI technologies will be in these future timelines. It is much better to run to where the ball is likely to be, even if we have different ideas of where that is.
The human brain, at best, struggles to grasp even linear change. But linear change is not a good way to predict compounding technological change.
Comment by manmal 2 days ago
And it will not yield the same outcome you would have had. Your own taste in clicking links and pre-filtering as you do your research, is no longer being done if you outsource this. I‘m guilty of this myself. But let’s not kid ourselves.
I’ve had GPT Pro think 40 minutes about the ideal reverse osmosis setup for my home. It came up with something that would have been able to support 10 houses and cost 20k. Even though I did tell it all about what my water consumers are and that it should research their peak usage. It just failed to observe that you can buffer water in a tank.
There‘s a reason they let you steer GPT-Pro as it goes, now.
Comment by xpe 2 days ago
Comment by manmal 1 day ago
Comment by xpe 1 day ago
Comment by toraway 2 days ago
Of course, it wasn't nearly as effective back then compared to current SOTA models, but none of those are hard to imagine someone recommending Cursor for anytime in 2024 or later.
If OP instead said something like one shotting an entire line of business app with 10k LoC I would agree with your reminder about perspective. But it feels somewhat hype-y to say that goal posts are being moved "monthly" when most of their list has been possible for years.
Comment by xpe 2 days ago
> But it feels somewhat hype-y to say that goal posts are being moved "monthly"...
Here's what I mean. What you see if you kept a journal once a day and wrote down:
1. what impressed you about AI that day;
2. what did you do with it that day that you pretty much took for granted ("just SoTA")
Then compare today against 30 days ago. A lot changes! My point is that it is getting harder to impress us: our standard for what we expect seems to be changing significantly on a ~monthly basis. What does this rate of change where you "just expect something to work as table stakes" feel like to you? Certainly faster than annually, right? 6 months? 3? 2? 1?
For me, a lot of this isn't just the raw technology but also socialization of what the tools can do and the personal experience of doing it yourself.
Comment by ozgrakkurt 2 days ago
I don't believe you can do a same quality job with an LLM in 5 minutes.
Comment by xpe 2 days ago
My example: I wanted to get a sense for the feasibility of doing a project that blends Gaussian Processes, active learning, and pairwise comparisons. So I want to dig into the literature to find out what is out there. This was around 5 minutes with Claude. In this case, I don't think I could have found what I wanted in 10 hours of searching and reading. This is the kind of thing that great LLMs unlock.
Comment by ozgrakkurt 2 days ago
There is no replacement for reading textbooks or high quality papers.
If you are saying that you didn't do this kind of thing anyway and now you can do it. Then I would question the definition of the action you are doing because it is not the same in my opinion.
Comment by xpe 1 day ago
For more background, please read Rapoport's Rules : https://themindcollection.com/rapoports-rules/
> Rapoport’s Rules, also known as Dennett’s Rules, is a list of four guidelines that detail how to interpret arguments charitably and criticise constructively. The concept was coined by philosopher Daniel C. Dennett in his book Intuition Pumps. Dennett acknowledged our proclivity to misinterpret and attack a counterpart’s argument instead of engaging meaningfully with what was actually said.
Comment by Bridged7756 1 day ago
Did it ever occur to you that the ever changing goalposts might have more to do with the expensive marketing campaigns of the big LLM providers?
We could talk about what's a measurable metric and what's not. Certainly, we have not much more other than "benchmarks" of which, honestly, I don't know the veracity of, or if big LLM cheats somehow, or if the performance is even stable. The core idea is that LLMs remain able to do exactly what they were able to do back at release; text prediction. They got better in some regards, sure.
Your example is worrisome to me. It should be to you too. You didn't write a literature review, you generated a scaffold of a literature review, with the same vices of LLM-based-writing as anything it does and still needing review and revising. I would hope rewriting to avoid your work be associated with LLM-generation. For better or worse, you still need to, normally, revise your work. For, once again, because this point seems to be difficult to grasp, a text predictor is not a reliable source of information. We make tradeoffs, sacrificing reliability for ease of use, but any real work needs human reviewing: which goes back to my first point. In this example it's doing nothing other than it being a fancy search and scaffolding tool.
The ball is likely to be in the same place because, once again, they're text predictors. Not sentient beings, or intelligent. Still generating text, still hallucinating, probably even more so thanks to the ever increasing amount of LLM-written content on the internet and initiatives like poison fountain doing a number on the generated content.
It's wild to me to make such claims about the rate of change of those tools. You're claiming we'll see exponential gains for those tools, I take, while completely ignoring the base set of constraints those models will, never, be able to get rid of. They only know how to produce text. They don't know, and will never really, know if it's right.
Comment by xpe 1 day ago
... but only with certain conversational norms. I say this because I predict we aren't (yet) matched up in a way such that we would have a conversation useful to us. The main reason (I guess) isn't about our particular viewpoints nor about i.e. "if we're both critical thinkers". We're both demonstrating that frame, at least in our language. Instead, I think it is about the way we engage and what we want to get out of a conversation. Just to pick one particular guide star, I strive to follow Rapoport's Rules [1]. FWIW, HN Guidelines are not all that different, so simply by commenting here, one is explicitly joining a sort of social contract that point in their direction already.
Anatol Rapoport or Daniel Dennett were not only brilliant in their areas of specialty but also in teaching us how to criticize constructively in general. I offer the link at [1] just in case you want to read them and give them a try, here. We can start the conversation over (if you want).
---
In response to your comments about consciousness, intelligence, etc, here are some examples of what I mean by intelligence and why:
- intelligence: https://news.ycombinator.com/item?id=43236444
- general intelligence: https://news.ycombinator.com/item?id=43223521
- pressure towards AGI: https://news.ycombinator.com/item?id=41707643
- intelligence as "what machines cannot do" / no physics-based constraints to surpass human intelligence: https://news.ycombinator.com/item?id=44974963
---
[1]: https://assets.edge.bigthink.com/uploads/attachment/file/151...
Comment by danny_codes 2 days ago
Comment by Bridged7756 2 days ago
Comment by braebo 2 days ago
Comment by 3dfd 2 days ago
If the frontier models reach a point of barely any noticeable improvements the trade off changes.
You do not need a perfect substitute if you are getting it for free...
People will factor in future expectations about the development of open source vs frontier models. Why do you think OAI and anthropic are pushing hard on marketing? its for this reason. They want to get contractual commitments that firms have to honour whilst open source closes the gap.
Comment by retsibsi 2 days ago
Comment by alex_sf 2 days ago
Comment by girvo 2 days ago
It’s about the only one that is at that level though to be fair. They’re all still useful, still!
Comment by danny_codes 2 days ago
Comment by 3dfd 2 days ago
Comment by djeastm 2 days ago
Comment by holoduke 2 days ago
Comment by glerk 2 days ago
I’m definitely not coming to this from a “AI is useless” angle. I’ve been using these tools extensively over the past year and they are providing a massive productivity boost.
Comment by wallst07 2 days ago
However when you guide the AI as a constant, and the model behaves MUCH differently (given a baseline guide), that is where the problem lies.
It's as if your 'guidance' has to be variable on how well the model is behaving. Analogy is a junior dev who is sometimes excellent, and sometimes shows up drunk for work and you have no breathalyzer.
Comment by the_gipsy 2 days ago
Is that what the soul is?
Comment by 3dfd 2 days ago
Comment by xpe 2 days ago
This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
My prior: it is 10X to 20X more likely Anthropic has done something other than shift to a short-term squeeze their customers strategy (which I think is only around ~5%)
What do I mean by "something other"? (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded. (2) Another possibility is that they are not as tuned to to what customers want relative to what their engineers want. (3) It is also possible they have slowed down their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos). Also, the above three possibilities are not mutually exclusive.
I don't expect us (readers here) to agree on the probabilities down to the ±5% level, but I would think a large chunk of informed and reasonable people can probably converge to something close to ±20%. At the very least, can we agree all of these factors are strong contenders: each covers maybe at least 10% to 30% of the probability space?
How short-sighted, dumb, or back-against-the-wall would Anthropic have to be to shift to a "let's make our new models intentionally _worse_ than our previous ones?" strategy? Think on this. I'm not necessarily "pro" Anthropic. They could lose standing with me over time, for sure. I'm willing to think it through. What would the world have to look like for this to be the case.
There are other factors that push back against claims of a "short-term greedy strategy" argument. Most importantly, they aren't stupid; they know customers care about quality. They are playing a longer game than that.
Yes, I understand that Opus 4.7 is not impressing people or worse. I feel similarly based on my "feels", but I also know I haven't run benchmarks nor have I used it very long.
I think most people viewed Opus 4.6 as a big step forward. People are somewhat conditioned to expect a newer model to be better, and Opus 4.7 doesn't match that expectation. I also know that I've been asking Claude to help me with Bayesian probabilistic modeling techniques that are well outside what I was doing a few weeks ago (detailed research and systems / software development), so it is just as likely that I'm pushing it outside its expertise.
Comment by glerk 2 days ago
I said "it seems like". Obviously, I have no idea whether this is an intentional strategy or not and it could as well be a side effect of those things that you mentioned.
Models being "worse" is the perceived effect for the end user (subjectively, it seems like the price to achieve the same results on similar tasks with Opus has been steadily increasing). I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
Comment by xpe 2 days ago
>> This part of the above comment strikes me as uncharitable and overconfident. And, to be blunt, presumptuous. To claim to know a company's strategy as an outsider is messy stuff.
> I said "it seems like".
Sorry. I take back the "presumptuous" part. But part of my concern remains: of all the things you chose to wrote, you only mentioned "the Tinder/casino intermittent reinforcement strategy". That phrase is going to draw eyeballs, and you got mine at least. As a reader, it conveys you think it is the most likely explanation. I'm trying to see if there is something there that I'm missing. How likely do you think is? Do you think it is more likely than the other three I mentioned? If so, it seems like your thinking hinges on this:
> I am claiming that there is no incentive for Anthropic to address this issue because of their business model (maximize the amount of tokens spent and price per token).
No incentive? Hardly. First, Anthropic is not a typical profit-maximizing entity, it a Public Benefit Corporation [1] [2]. Yes, profits matter still, but there are other factors to consider if we want to accurately predict their actions.
Second, even if profit maximization is the only incentive in play, profit-maximizing entities can plan across different time horizons. Like I mentioned in my above comment, it would be rather myopic to damage their reputation with a strategy that I summarize as a short-term customer-squeeze strategy.
Third, like many people here on HN, I've lived in the Bay Area, and I have first-degree connections that give me high confidence (P>80%) that key leaders at Anthropic have motivations that go much beyond mere profit maximization.
A\'s AI safety mission is a huge factor and not the PR veneer that pessimists tend to claim. Most people who know me would view me as somewhat pessimistic and anti-corporate and P(doomy). I say this to emphasize I'm not just casting stones at people for "being negative". IMO, failing to recognize and account for Anthropic's AI safety stance isn't "informed hard-hitting pessimism" so much as "limited awareness and/or poor analysis".
I'm not naive. That safety mission collides in a complicated way with FU money potential. Still, I'm confident (P>60%) that a significant number (>20%) of people at Anthropic have recently "cerebrated bad times" [3] i.e. cogitated futures where most humans die or lose control due to AI within ~10 to ~20 years. Being filthy rich doesn't matter much when dead or dehumanized.
[1]: https://law.justia.com/codes/delaware/title-8/chapter-1/subc...
[2]: https://time.com/6983420/anthropic-structure-openai-incentiv...
[3]: Weird Al: please make "Cerebration" for us.
Comment by glerk 2 days ago
> How likely do you think this is? Do you think it is more likely than the other three I mentioned?
I won't write down probability estimates, because frankly, I have no idea. Unless you are yourself a decision-maker at Anthropic, which, from what I can infer, you aren't, both of us are speculating. However, I can try to address each of your explanations at face value, because I don't think any of them makes Anthropic look any better than the explanation I provided.
> (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.
As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower. However, there is a very widespread rumor that Anthropic is routing traffic to quantized models during peak times to help decrease costs. Boris Cherny, Thariq Shihipar, and others have repeatedly denied this is happening [1]. I would be more concerned if this were the actual explanation, because as a user of the Claude Code Max plan and of the API, I have the expectation that each dollar I spend buys me access to the same model without opaque routing in the background.
> (2) Another possibility is that they are not as tuned to what customers want relative to what their engineers want.
There is actually a strong case for this: the high performance on the benchmarks relative to the qualitatively low performance reported on real-world tasks after launch. I suspect quite a bit of RL training was spent optimizing for beating those benchmarks, which resulted in overfitting the model on particular kinds of tasks. I'm not claiming this is nefarious in any way or that it is something only Anthropic is guilty of doing: these benchmarks are supposed to be a good representation of general software tasks, and using them as a training ground is expected.
> (3) It is also possible they have slowed their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos).
This would be the most concerning to me. I don't want to get too deeply into a political/philosophical argument, but I am very much on the other side of the e/accy vs. P(doomy) debate, and I strongly believe that keeping these tools under the control of some council of enlightened elders who claim to know what is best for humanity is ultimately futile.
If the result of the behind-the-scenes "cerebration" is an actual effort to try and slow down AI development or access, I don't have much confidence in the future of Anthropic.
I agree that there are incentives other than pure profit maximization here (I don't want to get into "my friend at Anthropic told me such and such" games, but I also believe this is the case). I'm sure there is some tension between these objectives inside Anthropic, but what is interesting is that lower model quality and maximizing user engagement could, at least in principle, align with both constraints.
Comment by xpe 1 day ago
Thanks for getting into some of the details ...
>> (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.
> As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower.
Yes, many scaling issues would manifest in that way -- but not all. It seems plausible for Anthropic to have other ways to degrade model performance that don't show up in the latency or reliability metrics. I need to research more... (I'll try to think more on your other points later).
Comment by kalkin 2 days ago
Comment by h14h 2 days ago
https://artificialanalysis.ai/?intelligence-efficiency=intel...
Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.
Comment by theptip 2 days ago
Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.
Comment by manmal 2 days ago
Comment by dktp 2 days ago
Though, from my limited testing, the new model is far more token hungry overall
Comment by manmal 2 days ago
Comment by httgbgg 2 days ago
Comment by manmal 2 days ago
Comment by httgbgg 1 day ago
I’m surprised this is even a question; obviously a better prompter has the same properties and it’s not in dispute?
Comment by squeaky-clean 2 days ago
Are you aware that output tokens are priced 5x higher than input tokens?
Comment by manmal 2 days ago
That’s just wrong. File reads, searches, compiler output, are the top input token consumers in my workflow. None of them can be removed. And they are the majority of my input tokens. That’s also why labs are trying to make 1M input work, and why compaction is so important to get right.
Regarding output - yes, but that wasn’t the topic in this thread. It’s just easier to argue with input tokens that price has gone up. I have a hunch the price for output will go up similarly, but can’t prove it. The jury’s out IMO: https://news.ycombinator.com/item?id=47816960
Comment by kalkin 2 days ago
Comment by SkyPuncher 2 days ago
I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.
Comment by the_gipsy 2 days ago
Comment by theptip 2 days ago
Comment by jascha_eng 2 days ago
Comment by rectang 2 days ago
My workflow is to give the agent pretty fine-grained instructions, and I'm always fighting agents that insist on doing too much. Opus 4.5 is the best out of all agents I've tried at following the guidance to do only-what-is-needed-and-no-more.
Opus 4.6 takes longer, overthinks things and changes too much; the high-powered GPTs are similarly flawed. Other models such as Sonnet aren't nearly as good at discerning my intentions from less-than-perfectly-crafted prompts as Opus.
Eventually, I quit experimenting and just started using Opus 4.5 exclusively knowing this would all be different in a few months anyway. Opus cost more, but the value was there.
But now I see that 4.7 is going to replace both 4.5 and 4.6 in VSCode Copilot, and with a 7.5x modifier. Based on the description, this is going to be a price hike for slower performance — and if the 4.5 to 4.6 change is any guide, more overthinking targeted at long-running tasks, rather than fine-grained. For me, that seems like a step backwards.
Comment by axpy906 2 days ago
Comment by rectang 2 days ago
I find that Opus is really good at discerning what I mean, even when I don't state it very clearly. Sonnet often doesn't quite get where I'm going and it sometimes builds things that don't make sense. Sonnet also occasionally makes outright mistakes, like not catching every location that needs to be changed; Opus makes nearly every code change flawlessly, as if it's thinking through "what could go wrong" like a good engineer would.
Sonnet is still better than older and/or less-capable models like GPT 4.1, Raptor mini (Preview), or GPT-5 mini, which all fail in the same way as Sonnet but more dramatically... but Opus is much better than Sonnet.
Recent full-powered GPTs (including the Codex variants) are competitive with Opus 4.6, but Opus 4.5 in particular is best in class for my workflow. I speculate that Opus 4.5 dedicates the most cycles out of all models to checking its work and ensuring correctness — as opposed to reaching for the skies to chase ambitious, highly complex coding tasks.
Comment by tobyhinloopen 2 days ago
Comment by trueno 2 days ago
as in 4.5 is no longer going to be avail? F.
ive also been sticking with 4.5 that sucks
Comment by rectang 2 days ago
> Over the coming weeks, Opus 4.7 will replace Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+[...]
> This model is launching with a 7.5× premium request multiplier as part of promotional pricing until April 30th.
Comment by xstas1 2 days ago
Comment by d0gsg0w00f 2 days ago
Comment by benjiro3000 2 days ago
Comment by tiffanyh 2 days ago
After just ~4 prompts I blew past my daily limit. Another ~7 more prompts & I blew past my weekly limit.
The entire HTMl/CSS/JS was less than 300 lines of code.
I was shocked how fast it exhausted my usage limits.
Comment by zaptrem 2 days ago
Comment by nixpulvis 2 days ago
Comment by nixpulvis 2 days ago
Comment by sync 2 days ago
Comment by iammrpayments 2 days ago
Comment by eneveu 20 hours ago
It's not available on Free plan, but it's available on Pro.
Comment by templar_snow 2 days ago
Comment by tiffanyh 2 days ago
Comment by cageface 2 days ago
That said I find the GPT plans much better value.
Comment by someguyiguess 2 days ago
Comment by sumedh 2 days ago
Try it.
Comment by tomtomistaken 2 days ago
Comment by hirako2000 2 days ago
With enterprise subscription, the bill gets bigger but it's not like VP can easily send a memo to all its staff that a migration is coming.
Individuals may end their subscription, that would appease the DC usage, and turn profits up.
Comment by fooster 2 days ago
Comment by hirako2000 1 day ago
That and atrophy, I will pass on what Claude is trying to accomplish.
I'm not dismissing LLMs entirely, for certain cases the concerns don't apply, at least not as much.
Comment by semcheck 2 days ago
Comment by hereme888 2 days ago
Comment by someuser54541 2 days ago
Comment by UltraSane 2 days ago
Comment by pixelatedindex 2 days ago
Comment by jlongman 2 days ago
Comment by amulyabaral 2 days ago
Comment by einpoklum 2 days ago
Comment by adrian_b 2 days ago
Otherwise, the non-standard order can be understood incorrectly. While the distinction between agents and patients is the most important that depends on word order in English, there are also other order-dependent distinctions, e.g. between beneficiary and patient, when the beneficiary is not marked by a preposition, or between a noun and its attribute, e.g. "police dog" is not the same as "dog police" and unless there is a detailed context you cannot know what is meant when the word order is wrong.
English is one of the languages with the most rigid word order. There are languages, especially among older languages, where almost any word order can be used without causing ambiguities, because all the possible roles of the words are marked by prepositions, postpositions or affixes (or sometimes by accentuation shifts).
Comment by einpoklum 2 days ago
> Left to Right English - read can, who? Anyone with [which] impressed am I.
and the causation is wrong; instead of the ability being impressive, it's the impressive character than allows reading in the opposite order.
So, you're right, and now I'll wait for the dog police to come pick me up.
Comment by y1n0 2 days ago
Comment by embedding-shape 2 days ago
Comment by usrnm 2 days ago
Comment by embedding-shape 2 days ago
Comment by UltraSane 2 days ago
Comment by bee_rider 2 days ago
Comment by freak42 2 days ago
Comment by gsleblanc 2 days ago
Comment by ACCount37 2 days ago
The "small subset" argument is profoundly unconvincing, and inconsistent with both neurobiology of the human brain and the actual performance of LLMs.
The transformer architecture is incredibly universal and highly expressive. Transformers power LLMs, video generator models, audio generator models, SLAM models, entire VLAs and more. It not a 1:1 copy of human brain, but that doesn't mean that it's incapable of reaching functional equivalence. Human brain isn't the only way to implement general intelligence - just the one that was the easiest for evolution to put together out of what it had.
LeCun's arguments about "LLMs can't do X" keep being proven wrong empirically. Even on ARC-AGI-3, which is a benchmark specifically designed to be adversarial to LLMs and target the weakest capabilities of off the shelf LLMs, there is no AI class that beats LLMs.
Comment by bigyabai 2 days ago
The human brain is not a pretrained system. It's objectively more flexible than than transformers and capable of self-modulation in ways that no ML architecture can replicate (that I'm aware of).
Comment by ACCount37 2 days ago
I've seen plenty of wacky test-time training things used in ML nowadays, which is probably the closest to how the human brain learns. None are stable enough to go into the frontier LLMs, where in-context learning still reigns supreme. In-context learning is a "good enough" continuous learning approximatation, it seems.
Comment by bigyabai 2 days ago
"it seems" is doing a herculean effort holding your argument up, in this statement. Say, how many "R"s are in Strawberry?
Comment by ACCount37 2 days ago
LLMs get better release to release. Unfortunately, the quality of humans in LLM capability discussions is consistently abysmal. I wouldn't be seeing the same "LLMs are FUNDAMENTALLY FLAWED because I SAY SO" repeated ad nauseam otherwise.
Comment by bigyabai 2 days ago
In-context learning is professedly not "good enough" to approximate continuous learning of even a child.
Comment by ACCount37 2 days ago
You can also ask an LLM to solve that problem by spelling the word out first. And then it'll count the letters successfully. At a similar success rate to actual nine-year-olds.
There's a technical explanation for why that works, but to you, it might as well be black magic.
And if you could get a modern agentic LLM that somehow still fails that test? Chances are, it would solve it with no instructions - just one "you're wrong".
1. The LLM makes a mistake
2. User says "you're wrong"
3. The LLM re-checks by spelling the word out and gives a correct answer
4. The LLM then keeps re-checking itself using the same method for any similar inquiry within that context
In-context learning isn't replaced by anything better because it's so powerful that finding "anything better" is incredibly hard. It's the bread and butter of how modern LLM workflows function.
Comment by squeaky-clean 2 days ago
In fact, asking a model not to repeat the same mistake makes it more likely to commit that mistake again, because it's in it's context.
I think anyone who uses LLMs a lot will tell your that your steps 3 and 4 are fictional.
Comment by ACCount37 2 days ago
The "spell out" trick, by the way, was what was added to the system prompts of frontier models back when this entire meme was first going around. It did mitigate the issue.
Comment by bigyabai 2 days ago
We're back around to the start again. "Incredibly hard" is doing all of the heavy lifting in this statement, it's not all-powerful and there are enormous failure cases. Neither the human brain nor LLMs are a panacea for thought, but nobody in academia or otherwise is seriously comparing GPT to the human brain. They're distinct.
> There's a technical explanation for why that works, but to you, it might as well be black magic.
Expound however much you need. If there's one thing I've learned over the past 12 months it's that everyone is now an expert on the transformer architecture and everyone else is wrong. I'm all ears if you've got a technical argument to make, the qualitative comparison isn't convincing me.
Comment by ACCount37 2 days ago
The key words are "tokenization" and "metaknowledge", the latter being the only non-trivial part. An LLM can explain it in detail. They know more than you do too.
Comment by 8note 2 days ago
what problem does this allow you to solve that you couldnt otherwise?
Comment by squeaky-clean 2 days ago
Comment by aerhardt 2 days ago
And even then... why can't they write a novel? Or lowering the bar, let's say a novella like Death in Venice, Candide, The Metamorphosis, Breakfast at Tiffany's...?
Every book's in the training corpus...
Is it just a matter of someone not having spent a hundred grand in tokens to do it?
Comment by voxl 2 days ago
Comment by conception 2 days ago
Comment by aerhardt 2 days ago
Comment by zozbot234 2 days ago
Comment by mh- 2 days ago
It's just that the ones that manage to suppress all the AI writing "tells" go unnoticed as AI. This is a type of survivorship bias, though I feel there must be a better term for it that eludes me.
Comment by colechristensen 2 days ago
There's a lot of bad writing out there, I can't imagine nobody has used an LLM to write a bad novella.
Comment by aerhardt 2 days ago
I provide four examples in my comment...
Comment by colechristensen 2 days ago
Yes, those are examples of novellas, surely you believe an LLM could write a bad novella? I'm not sure what your point is. Either you think it can't string the words together in that length or your standard is it can't write a foundational piece of literature that stays relevant for generations... I'm not sure which.
Comment by aerhardt 2 days ago
But GP's argument ("limit the space to text") could be taken to imply - and it seems to be a common implication these days - that LLMs have mastered the text medium, or that they will very soon.
> it can't write a foundational piece of literature
Why not, if this a pure textual medium, the corpus includes all the great stories ever written, and possibly many writing workshops and great literature courses?
Comment by colechristensen 2 days ago
Comment by aerhardt 2 days ago
So at least we can agree that AI hasn't mastered the text medium, without further qualification?
And what about my argument, further qualified, which is that I don't think it could even write as well as a good professional writer - not necessarily a generational one?
Comment by colechristensen 2 days ago
I don't know what this means and I don't know what would qualify it as having "mastered" at all. Seems like a no-true-Scotsman thing where regardless there would always be someone that it couldn't actually do a thing because this and that.
>why can't they write a novel?
This is what I'm disagreeing with. I think an LLM can write a novel well enough that it's recognizably a pretty mediocre novel, no worse than the median written human novel which to be fair is pretty bad. You seem to have an unqualified bar something needs to pass before "writing a novel" is accomplished but it's not clear what that is. At the same time you're switching between the ability to do a thing and the ability to do a thing in a way that's honored as the best of the best for a century. So I don't know it kind of seems like you just don't like AI and have a different standard for it that adjusts so that it fails. This doesn't match what you'd consider some random Bob's ability to do a thing.
Comment by aerhardt 2 days ago
I am just challenging the notions that "if you limit it to text, it's doing really well" or that the text contains in itself all the information that is needed to carry out a task to a certain level of quality. This applies in my experience not only to writing literature but also to certain human tasks which may appear mundane and easy to automate.
Comment by ipaddr 2 days ago
Comment by mohamedkoubaa 2 days ago
Comment by 3dfd 2 days ago
Comment by npollock 2 days ago
[Opus 4.6] 3% context | last: 5.2k in / 1.1k out
add this to .claude/settings.json
"statusLine": { "type": "command", "command": "jq -r '\"[\\(.model.display_name)] \\(.context_window.used_percentage // 0)% context | last: \\(((.context_window.current_usage.input_tokens // 0) / 1000 * 10 | floor / 10))k in / \\(((.context_window.current_usage.output_tokens // 0) / 1000 * 10 | floor / 10))k out\"'" }
Comment by bertil 2 days ago
After a few basic operations (retrospective look at the flow of recent reviews, product discussions) I would expect this to act like a senior member of the team, while 4.6 was good, but far more likely to be a foot-gun.
Comment by dakiol 2 days ago
We'll be keeping an eye on open models (of which we already make good use of). I think that's the way forward. Actually it would be great if everybody would put more focus on open models, perhaps we can come up with something like the "linux/postgres/git/http/etc" of the LLMs: something we all can benefit from while it not being monopolized by a single billionarie company. Wouldn't it be nice if we don't need to pay for tokens? Paying for infra (servers, electricity) is already expensive enough
Comment by ahartmetz 2 days ago
One of two main reasons why I'm wary of LLMs. The other is fear of skill atrophy. These two problems compound. Skill atrophy is less bad if the replacement for the previous skill does not depend on a potentially less-than-friendly party.
Comment by post-it 2 days ago
It was an experiment to see if I could enter a mature codebase I had zero knowledge of, look at it entirely through an AI, and come to understand it.
And it worked! Even though I've only worked on the codebase through Claude, whenever I pick up a ticket nowadays I know what file I'll be editing and how it relates to the rest of the code. If anything, I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding.
Comment by estetlinus 2 days ago
Comment by stringfood 2 days ago
Comment by lobf 2 days ago
I’ve been 90% vibe coding for a year or so now, and I’ve learned so much about networking just from spinning up a bunch of docker containers and helping GPT or Claude fix niggling issues.
I essentially have an expert (well, maybe not an expert but an entity far more capable than I am on my own) who’s shoulder I can look over and ask as many questions I want to, and who will explain every step of the process to me if I want.
I’m finally able to create things on my computer that I’ve been dreaming about for years.
Comment by estetlinus 1 day ago
Comment by idopmstuff 2 days ago
I usually learn way more by having Claude do a task and then quizzing it about what it did than by figuring out how to do it myself. When I have to figure out how to do the thing, it takes much more time, so when I'm done I have to move on immediately. When Claude does the task in ten minutes I now have several hours I can dedicate entirely to understanding.
Comment by onemoresoop 2 days ago
Comment by post-it 2 days ago
When you have a headache, do you avoid taking ibuprofen because one day it may not be available anymore? Two hundred years ago, if you gave someone ibuprofen and told them it was the solution for 99% of the cases where they felt some kind of pain, they might be suspicious. Surely that's too good to be true.
But it's not. Ibuprofen really is a free lunch, and so is AI. It's weird to experience, but these kinds of technologies come around pretty often, they just become ubiquitous so quickly that we forget how we got by without them.
Comment by visarga 2 days ago
If that happened at this point, it would be after societal collapse.
Comment by onemoresoop 2 days ago
Comment by hdjrudni 2 days ago
Every now and then I pause before I ask an LLM to undo something it just did or answer something I know it answered already, somewhere. And then I remember oh yeah, it's an LLM, it's not going to get upset.
Comment by dlopes7 2 days ago
Comment by techpression 2 days ago
Comment by bdangubic 2 days ago
Comment by thih9 2 days ago
Comment by bdangubic 2 days ago
Comment by thih9 2 days ago
Comment by ashirviskas 2 days ago
Comment by SpicyLemonZest 2 days ago
Comment by therealdrag0 2 days ago
Comment by Jweb_Guru 2 days ago
Comment by viccis 2 days ago
Comment by windexh8er 2 days ago
Comment by post-it 2 days ago
Comment by root_axis 2 days ago
Comment by post-it 2 days ago
My syntax writing skills may well be atrophying, but I'll just do a leetcode by hand once in a while.
Comment by viccis 2 days ago
>I have a significantly better understanding of the codebase than I would without AI at this point in my onboarding
One of the pitfalls of using AI to learn is the same as I'd see students doing pre-AI with tutoring services. They'd have tutors explain the homework to do them and even work through the problems with them. Thing is, any time you see a problem or concept solved, your brain is tricked into thinking you understand the topic enough to do it yourself. It's why people think their job interview questions are much easier than they really are; things just seem obvious when you've thought about the solution. Anyone who's read a tutorial, felt like they understood it well, and then struggled for a while to actually start using the tool to make something new knows the feeling very well. That Todo List app in the tutorial seemed so simple, but the author was making a bunch of decisions constantly that you didn't have to think about as you read it.
So I guess my question would be: If you were on a plane flight with no wifi, and you wanted to do some dev work locally on your laptop, how comfortable would you be vs if you had done all that work yourself rather than via Claude?
Comment by post-it 2 days ago
Probably about as comfortable as I would be if I also didn't have my laptop and instead had to sketch out the codebase in a notebook. There's no sense preparing for a scenario where AI isn't available - local models are progressing so quickly that some kind of AI is always going to be available.
Comment by viccis 1 day ago
Comment by Ifkaluva 2 days ago
Comment by post-it 2 days ago
Comment by throwaway613746 2 days ago
Comment by ljm 2 days ago
I've worked with people who will look at code they don't understand, say "llm says this", and express zero intention of learning something. Might even push back. Be proud of their ignorance.
It's like, why even review that PR in the first place if you don't even know what you're working with?
Comment by psygn89 2 days ago
A good dev would've read deeper into the concern and maybe noticed potential flaws, and if he had his own doubts about what the concern was about, would have asked for more clarification. Not just feed a concern into AI and fling it back. Like please, in this day and age of AI, have the benefit of the doubt that someone with a concern would have checked with AI himself if he had any doubts of his own concern...
Comment by oremj 2 days ago
Comment by foobarchu 2 days ago
Comment by pizza234 2 days ago
Comment by malnourish 2 days ago
I spent years cultivating expertise in C++ and .NET. And I found that time both valuable and enjoyable. But that's because it was a path to solve problems for my team, give guidance, and do so with both breadth and depth.
Now I focus on problems at a higher level of abstraction. I am certain there's still value in understanding ownership semantics and using reflection effectively, but they're broadly less relevant concerns.
Comment by mlvljr 2 days ago
Comment by dingaling 2 days ago
Comment by sroussey 2 days ago
Comment by trinsic2 2 days ago
Comment by kilroy123 2 days ago
Comment by ohazi 2 days ago
Comment by RexM 2 days ago
Comment by monkpit 2 days ago
Comment by redanddead 2 days ago
Comment by groundzeros2015 2 days ago
Comment by drivebyhooting 2 days ago
Comment by groundzeros2015 2 days ago
Comment by root_axis 2 days ago
Comment by mattgreenrocks 2 days ago
And no, I don't understand them at all. Taking responsibility for something, improving it, and stewarding it into production is a fantastic feeling, and much better than reading the comment section. :)
Comment by tossandthrow 2 days ago
We have gone multi cloud disaster recovery on our infrastructure. Something I would not have done yet, had we not had LLMs.
I am learning at an incredible rate with LLMs.
Comment by mgambati 2 days ago
But I’m so much more detached of the code, I don’t feel that ‘deep neural connection’ from actual spending days in locked in a refactor or debugging a really complex issue.
I don’t know how a feel about it.
Comment by Fire-Dragon-DoL 2 days ago
Sure, you don't know the code by heart, but people debugging code translated to assembly already do that.
The big difference is being able to unleash scripts that invalidate enormous amount of hypothesis very fast and that can analyze the data.
Used to do that by hand it took hours, so it would be a last resort approach. Now that's very cheap, so validating many hypothesis is way cheaper!
I feel like my "debugging ability" in terms of value delivered has gone way up. For skill, it's changing. I cannot tell, but the value i am delivering for debugging sessions has gone way up
Comment by afzalive 2 days ago
But if you don't and there's no PR process (side projects), the motivation to form that connection is quite low.
Comment by hombre_fatal 2 days ago
No, because you can get LLMs to produce high quality code that has gone through an infinite number of refinement/polish cycles and is far more exhaustive than the code you would have written yourself.
Once you hit that point, you find yourself in a directional/steering position divorced from the code since no matter what direction you take, you'll get high quality code.
Comment by afzalive 16 hours ago
> no matter what direction you take, you'll get high quality code
This is not the case today. You get medium-quality, sometimes over-engineered code 10x faster.
Comment by ori_b 2 days ago
Comment by tossandthrow 2 days ago
You very much decide how you employ LLMs.
Nobody are keeping a gun to your head to use them. In a certain way.
Sonif you use them in a way that increase you inherent risk, then you are incredibly wrong.
Comment by ori_b 2 days ago
Comment by SpicyLemonZest 2 days ago
I understand why a designer might read this post and not be happy about it. If you don't think your management values or appreciates design skill, you'd worry they're going to glaze over the bullet points about design productivity, and jump straight to the one where PMs and marketers can build prototypes and ignore you. But that's not what the sales pitch is focused on.
Comment by ori_b 2 days ago
Comment by Forgeties79 2 days ago
This all bumps up against the fact that most people default to “you use the tool wrong” and/or “you should only use it to do things where you already have firm grasp or at least foundational knowledge.”
It also bumps against the fact that the average person is using LLM’s as a replacement for standard google search.
Comment by andy_ppp 2 days ago
Comment by trinsic2 2 days ago
If you don't know whats going on through the whole process, good luck with the end product.
Comment by weego 2 days ago
Comment by tossandthrow 2 days ago
The latent assumption here is that learning is zero sum.
That you can take a 30 year old from 1856 bring them into present day and they will learn whatever subject as fast as a present day 20 year old.
That teachers doesn't matter.
That engagement doesn't matter.
Learning is not zero sum. Some cultural background makes learning easier, some mentoring makes is easier, and some techniques increases engagement in ways that increase learning speed.
Comment by bluefirebrand 2 days ago
Could you do it again without the help of an LLM?
If no, then can you really claim to have learned anything?
Comment by _blk 2 days ago
Not everyone learns at the same pace and not everyone has the same fault tolerance threshold. In my experiencd some people are what I call "Japanese learners" perfecting by watching. They will learn with AI but would never do it themselves out of fear of getting something wrong while they understand most of it, others that I call "western learners" will start right away and "get their hands dirty" without much knowledge and also get it wrong right away. Both are valid learning strategies fitting different personalities.
Comment by tossandthrow 2 days ago
And yes. If LLMs disappear, then we need to hire a lot of people to maintain the infrastructure.
Which naturally is a part of the risk modeling.
Comment by bluefirebrand 2 days ago
Not what I asked, but thanks for playing.
Comment by tossandthrow 2 days ago
> Could you do it again without the help of an LLM?
Comment by bluefirebrand 2 days ago
Comment by Paradigma11 2 days ago
Comment by lelanthran 2 days ago
Well, yes?
What do you think "learning" means? If you cannot do something without the teacher, you haven't learned that thing.
Comment by techpression 2 days ago
Comment by falkensmaize 2 days ago
If your child says they've learned their multiplication tables but they can't actually multiply any numbers you give them do they actually know how to do multiplication? I would say no.
Comment by Jweb_Guru 2 days ago
Comment by UncleMeat 2 days ago
Comment by sho_hn 2 days ago
Comment by danw1979 2 days ago
It’s quite possible to be deep into solving a problem with an LLM guiding you where you’re reading and learning from what it says. This is not really that different from googling random blogs and learning from Stack Overflow.
Assuming everyone just sits there dribbling whilst Claude is in YOLO mode isn’t always correct.
Comment by subscribed 2 days ago
> Could you do it again on your own?
Can you you see how nonsensical your stance is? You're straight up accusing GP of lying they are learning something at the increased rate OR suggesting if they couldn't learn that, presumably at the same rate, on they own, they're not learning anything.
That's not very wise to project your own experiences on others.
Comment by sroussey 2 days ago
Comment by i_love_retros 2 days ago
I don't believe it. Having something else do the work for you is not learning, no matter how much you tell yourself it is.
Comment by margalabargala 2 days ago
Having other people do work for you is how people get to focus on things they actually care about.
Do you use a compiler you didn't write yourself? If so can you really say you've ever learned anything about computers?
Comment by butterisgood 2 days ago
Comment by viccis 2 days ago
Comment by margalabargala 2 days ago
Comment by viccis 1 day ago
Comment by margalabargala 1 day ago
That is in fact the anti LLM argument you've ostensibly been discussing. If you want to talk to the person who made it up I'm not your guy.
Comment by tossandthrow 2 days ago
Open your eyes, and you might become a believer.
Comment by nothinkjustai 2 days ago
Comment by subscribed 2 days ago
Indeed, quite weird and no imagination.
Comment by tossandthrow 2 days ago
It does seem like there is a cult of people who categorically see LLMs as being poor at anything without it being founded in anything experience other than their 2023 afternoon to play around with it.
Comment by nothinkjustai 2 days ago
Can’t you be satisfied with outcompeting “non believers”? What motivates you to argue on the internet about it? Deep down are you insecure about your reliance on these tools or something, and want everyone else to be as well?
Comment by tossandthrow 2 days ago
It feels so off rebuilding serious SaaS apps in days for production, only to be told it is not possible?
Comment by nothinkjustai 2 days ago
Comment by Wowfunhappy 2 days ago
That’s product atrophy, not skill atrophy.
Comment by deadbabe 2 days ago
And not even just understanding, but verifying that they’ve implemented the optimal solution.
Comment by tehjoker 2 days ago
Comment by jjallen 2 days ago
What an interesting paradox-like situation.
Comment by estetlinus 2 days ago
Well, if internet is down, so is our revenue buddy. Engineering throughput would be the last of our concerns.
Comment by solarengineer 2 days ago
When future humans rediscover mathematics.
Comment by IgorPartola 2 days ago
And don’t get me started on memory management. Nobody even knows how to use malloc(), let alone brk()/mmap(). Everything is relying on automatic memory management.
I mean when was the last time you actually used your magnetized needle? I know I am pretty rusty with mine.
Comment by otabdeveloper4 2 days ago
Yeah, exactly.
Comment by techpression 2 days ago
Comment by boxingdog 2 days ago
Comment by dgellow 2 days ago
Comment by xixixao 2 days ago
It’s like saying clothing manufacturers are paying the “loom tax” tax when they could have been weaving by hand…
Comment by SlinkyOnStairs 2 days ago
Where producing 2x the t-shirts will get you ~2x the revenue, it's quite unlikely that 10x the code will get you even close to 2x revenue.
With how much of this industry operates on 'Vendor Lock-in' there's a very real chance the multiplier ends up 0x. AI doesn't add anything when you can already 10x the prices on the grounds of "Fuck you. What are you gonna do about it?"
Comment by groundzeros2015 2 days ago
Comment by 3dfd 2 days ago
Comment by bigbadfeline 2 days ago
Open source libraries and projects together with open source AI is the only way to avoid the existential risks of closed source AI.
Comment by redanddead 2 days ago
Comment by dakiol 2 days ago
Comment by davidron 2 days ago
I don't know about 10x, but this could only happen if PMs suddenly got really lazy or the engineers actually got at least 1.5x faster. My gut says it's way more because we're now also consistently up to date on our dependencies and completing massive refactors we were putting off for years.
There are lots of reasons this could be the case. Quality suddenly changed, the nature of the work changed, engineers leveled up... But for this to have happened consistently across a bunch of engineering teams is quite the coincidence if not this one thing we are all talking about.
Comment by Silhouette 2 days ago
The evangelists told us 20 years ago that if we weren't doing TDD then we weren't really professional programmers at all. The evangelists told us 10 years ago that if we were still running stuff locally then we must be paying a fortune for IT admin or not spending our time on the work that mattered. The evangelists this week tell us that we need to be using agents to write all our code or we'll get left in the dust by our competitors who are.
I'm still waiting for my flying car. Would settle for some graphics software on Linux that matches the state of the art on Windows or even reliable high-quality video calls and online chat rooms that don't make continental drift look fast.
Comment by dgellow 2 days ago
Comment by senordevnyc 2 days ago
Comment by Lihh27 2 days ago
Comment by otabdeveloper4 2 days ago
This doesn't happen. Literally zero evidence of this.
Comment by dgellow 2 days ago
Comment by Miner49er 2 days ago
If the actual rate is .9x then it matters a lot.
Or even if it's like 1.1x, is the cost worth the return?
Comment by xvector 2 days ago
Meta pays $750k+ TC and makes far more profit/eng, do you think they care about $5k/eng/mo in inference? A 1.1x increase would be so significant that it would justify the cost easily, especially when you can just compress comps to make up for it
Comment by AlexeyBelov 1 day ago
Comment by otabdeveloper4 2 days ago
Do you really think they go on vibes - "welp, this AI thing seems to improve developer performance, I guess. Heck, what's an extra 5k per developer anyways, amirite".
Well, maybe they really do in your neck of the woods. Explains a lot, I guess.
Comment by xvector 1 day ago
Comment by otabdeveloper4 1 day ago
That is just, like, your opinion, man.
Also, I doubt these kinds of companies have "quality" of anything, never mind "gains in quality".
Comment by surgical_fire 2 days ago
Would it matter?
Comment by dgellow 1 day ago
Comment by JambalayaJimbo 2 days ago
Comment by benjiro3000 2 days ago
Comment by michaelje 2 days ago
Frontier labs are incentivized to keep it that way, and they're investing billions to make AI = API the default. But that's a business model, not a technical inevitability.
Comment by trueno 2 days ago
ive had to like tune out of the LLM scene because it's just a huge mess. It feels impossible to actually get benchmarks, it's insanely hard to get a grasp on what everyone is talking about, bots galore championing whatever model, it's just way too much craze and hype and misinformation. what I do know is we can't keep draining lakes with datacenters here and letting companies that are willing to heel turn on a whim basically control the output of all companies. that's not going to work, we collectively have to find a way to make local inference the path forward.
everyone's foot is on the gas. all orgs, all execs, all peoples working jobs. there's no putting this stuff down, and it's exhausting but we have to be using claude like _right now_. pretty much every company is already completely locked in to openai/gemini/claude and for some unfortunate ones copilot. this was a utility vendor lock in capture that happened faster than anything ive ever seen in my life & I already am desperate for a way to get my org out of this.
Comment by hakfoo 2 days ago
I get choice paralysis when you show me a prompt box-- I don't know what I can reasonably ask for and how to best phrase it, so I just panic. It doesn't help when we see articles saying people are getting better outcomes by adding things like "and no bugs plz owo"
I'm sure this is by design-- anything with clear boundaries and best practices would discourage gacha style experimentation. Can you trust anyone who sells you a metered service to give you good guidance on how to use it efficiently?
Comment by trueno 2 days ago
i don't know how else to phrase it: this feels like such an unstable landscape, "beta" software/services are running rampant in every industry/company/org/etc and there's absolutely no single resource we can turn to to help stay ahead of & plan for the rapidly-evolving landscape. every, and i mean every company, is incredibly irresponsible for using this stuff. including my own. once again though, cat's already out of the bag. now we fight for our lives trying to contain it and ensure things are well understood and implemented properly...which seems to be the steepest uphill battle of my life
Comment by dewarrn1 2 days ago
Comment by i_love_retros 2 days ago
My manager doesn't even want us to use copilot locally. Now we are supposed to only use the GitHub copilot cloud agent. One shot from prompt to PR. With people like that selling vendor lock in for them these companies like GitHub, OpenAI, Anthropic etc don't even need sales and marketing departments!
Comment by tossandthrow 2 days ago
Comment by dgellow 2 days ago
Comment by tossandthrow 2 days ago
One shoting has a very specific meaning, and agentic workflows are not it?
What is the implied meaning I should understand from them using one shot?
They might refer to the lack of humans in the loop.
Comment by dgellow 2 days ago
Comment by tossandthrow 2 days ago
But it requires that one does not do something stupid.
Eg. For recurring tasks: keep the task specification in the source code and just ask Claude to execute it.
The same with all documentation, etc.
Comment by aliljet 2 days ago
Comment by xvector 2 days ago
The open model mentality is also just so bizarre to me. You're going to use an inferior model to save, what, a couple hundred bucks a month? Is your time really worth that little?
No one working on a serious project at a serious company is downgrading their agent's intelligence for a marginal cost saving. Downgrading your model is like downgrading the toilet paper on your yacht.
Comment by tredre3 2 days ago
I agree that people who claim that open models are as good as claude/openai/z are lying, delusional, or not doing very much. I've tried them all, included GLM 5.1.
GLM is not bad but the hardware needed will never recoup the ROI vs just using a commercial provider through its API.
That being said, you're being reductive here. For many use cases local models offer advantages that can't obtained through a commercial API : Privacy, ownership of the entire stack, predictability. They can't be rugpulled, they can't snitch on you. They will not give you 503.
Those advantages are very valuable for things like a local assistant, as an agent, for data extraction, for translations, for games (role playing and whatnot), etc.
That being said I know that many people are like you, they don't give a second thought about privacy. They'd plug Anthropic to their brain if they could. So I understand the sentiment. I just think that you should in turn try to understand why someone would use an open model.
Comment by WarmWash 2 days ago
Comment by parinporecha 2 days ago
Comment by slopinthebag 2 days ago
5.1 is like $4 / 1m output, Opus 4.6 is $25. GPT 5.4 pro is $270 with large contexts :O
Comment by esafak 2 days ago
Comment by ojosilva 2 days ago
Comment by Someone1234 2 days ago
I've said it before and I'll say it again, local models are "there" in terms of true productive usage for complex coding tasks. Like, for real, there.
The issue right now is that buying the compute to run the top end local models is absurdly unaffordable. Both in general but also because you're outbidding LLM companies for limited hardware resources.
You have a $10K budget, you can legit run last year's SOTA agentic models locally and do hard things well. But most people don't or won't, nor does it make cost effective sense Vs. currently subsidized API costs.
Comment by gbro3n 2 days ago
Comment by Someone1234 2 days ago
So my point is: If you have the attitude that unless it is the bleeding edge, it may have well not exist, then local models are never going to be good enough. But truth is they're now well exceeding what they need to be to be huge productivity tools, and would have been bleeding edge fairly recently.
Comment by gbro3n 2 days ago
Comment by dakiol 2 days ago
Don't you understand that by choosing the best model we can, we are, collectively, step by step devaluating what our time is worth? Do you really think we all can keep our fancy paychecks while keep using AI?
Comment by gbro3n 2 days ago
Comment by lelanthran 2 days ago
There were always jobs that required those "many more skills" but didn't require any programming skills.
We call those people Business Analysts and you could have been doing it for decades now. You didn't, because those jobs paid half what a decent/average programmer made.
Now you are willingly jumping into that position without realising that the lag between your value (i.e. half your salary, or less) would eventually disappear.
Comment by gbro3n 2 days ago
Comment by lelanthran 2 days ago
They don't need to all run on the same frameworks, they just need to run on documented frameworks.
What possible value can you bring to a BA?
The system topology (say, if the backend was microservices vs Lambda vs something-else)? The LLM can explain to the BA what their options are, and the impact of those options.
The framework being used (Vue, or React, or something else)? The AI can directly twiddle that for the BA.
Solving a problem? If the observability is setup, the LLM can pinpoint almost all the problems too,and with a separate UAT or failover-type replica, can repro, edit, build, deploy and test faster than you can.
Like I already said, if[1] you're now able to build or enhance a system without actually needing programming skills, why are you excited about that? You could always do that. It's just that it pays half what programming skills gets you.
You (and many others who boast about not writing code since $DATE) appear to be willingly moving to a role that already pays less, and will pay even less once the candidates for that role double (because now all you programmers are shifting towards it).
It's supply and demand, that's all.
--------------
[1] That's a very big "If", I think. However, the programmers who are so glad to not program appear to believe that it's a very small "If", because they're the ones explaining just how far the capabilities have come in just a year, and expect the trend to continue. Of course, if the SOTA models never get better than what we have now, then, sure - your argument holds - you'll still provide value.
Comment by aliljet 2 days ago
Comment by wellthisisgreat 2 days ago
Early last year or late last year?
opus 4.5 was quite a leap
Comment by HWR_14 2 days ago
Comment by sscaryterry 2 days ago
Comment by leonidasv 2 days ago
I fear that this may not be feasible in the long term. The open-model free ride is not guaranteed to continue forever; some labs offer them for free for publicity after receiving millions in VC grants now, but that's not a sustainable business model. Models cost millions/billions in infrastructure to train. It's not like open-source software where people can just volunteer their time for free; here we are talking about spending real money upfront, for something that will get obsolete in months.
Current AI model "production" is more akin to an industrial endeavor than open-source arrangements we saw in the past. Until we see some breakthrough, I'm bearish on "open models will eventually save us from reliance on big companies".
Comment by falkensmaize 2 days ago
If you mean obsolete in the sense of "no longer fit for purpose" I don't think that's true. They may become obsolete in terms of "can't do hottest new thing" but that's true of pretty much any technology. A capable local model that can do X will always be able to do X, it just may not be able to do Y. But if X is good enough to solve your problem, why is a newer better model needed?
I think if we were able to achieve ~Opus 4.6 level quality in a local model that would probably be "good enough" for a vast number of tasks. I think it's debatable whether newer models are always better - 4.7 seems to be somewhat of a regression for example.
Comment by sergiotapia 2 days ago
1. Opencode
2. Fireworks AI: GLM 5.1
And it is SIGNIFICANTLY cheaper than Claude. I'm waiting eagerly for something new from Deepseek. They are going to really show us magic.
Comment by ben8bit 2 days ago
Comment by culi 2 days ago
model elo $/M
---------------------------------------
glm-5.1 1538 2.60
glm-4.7 1440 1.41
minimax-m2.7 1422 0.97
minimax-m2.1-preview 1392 0.78
minimax-m2.5 1386 0.77
deepseek-v3.2-thinking 1369 0.38
mimo-v2-flash (non-thinking) 1337 0.24
https://arena.ai/leaderboard/code?viewBy=plot&license=open-s...Comment by logicprog 2 days ago
Comment by blahblaher 2 days ago
Comment by zozbot234 2 days ago
Comment by pitched 2 days ago
Comment by zozbot234 2 days ago
Comment by DeathArrow 2 days ago
Comment by pitched 2 days ago
Comment by cyberax 2 days ago
So far, Qwen 3.6 created a functionally equivalent Golang implementation that works against the flat file backend within the last 2 days. I'm extremely impressed.
Comment by Gareth321 2 days ago
Comment by wuschel 2 days ago
Comment by pitched 2 days ago
Comment by equasar 2 days ago
I don't know if it is bun related, but in task manager, is the thing that is almost at the top always on CPU usage, turns out for me, bun is not production ready at all.
Wish Zed editor had something like BigPickle which is free to use without limits.
Comment by Jarred 2 days ago
What issue did you run into?
Comment by jherdman 2 days ago
Comment by danw1979 2 days ago
Comment by pitched 2 days ago
Comment by richardfey 2 days ago
Comment by _blk 2 days ago
Comment by macwhisperer 2 days ago
Comment by zozbot234 2 days ago
Comment by pitched 2 days ago
Comment by tredre3 2 days ago
A medium MoE like 35B can still achieve usable speeds in that setup, mind you, depending on what you're doing.
Comment by Gareth321 2 days ago
Comment by cpursley 2 days ago
Comment by cmrdporcupine 2 days ago
Comment by myaccountonhn 2 days ago
Comment by elbear 2 days ago
Comment by DeathArrow 2 days ago
Comment by GaryBluto 2 days ago
Google just released Gemma 4, perhaps that'd be worth a try?
Comment by OrvalWintermute 2 days ago
If you have HPC or Supercompute already, you have much of the expertise on staff already to expand models locally, and between Apple Silicon and Exo there are some amazingly solutions out there.
Now, if only the rumors about Exo expanding to Nvidia are true..
Comment by somewhereoutth 2 days ago
Comment by zozbot234 2 days ago
Comment by tehjoker 2 days ago
Comment by DeathArrow 2 days ago
Training and inference costs so we would have to pay for them.
Comment by groundzeros2015 2 days ago
Comment by sky2224 2 days ago
I think companies that are shelling out the money for these enterprise accounts could honestly just buy some H100 GPUs and host the models themselves on premises. Github CoPilot enterprise charges $40 per user per month (this can vary depending on your plan of course), but at this price for 1000 users that comes out to $480,000 a year. Maybe I'm missing something, but that's roughly what you're going to be spending to get a full fledged hosting setup for LLMs.
Comment by merlinoa 2 days ago
Comment by subarctic 2 days ago
Comment by sky2224 2 days ago
However, we've been seeing advancements in compressing context and capabilities of smaller models that I don't think it'd be too far off to see something like what I'm talking about within the next 5 years.
Comment by SilverElfin 2 days ago
Comment by Bridged7756 2 days ago
Comment by sourya4 2 days ago
made a HN post of my X article on the lock-in factor and how we should embrace the modular unix philosophy as a way out: https://news.ycombinator.com/item?id=47774312
Comment by crgk 2 days ago
Comment by finghin 2 days ago
Comment by giancarlostoro 2 days ago
I'm still surprised top CS schools are not investing in having their students build models, I know some are, but like, when's the last time we talked about a model not made by some company, versus a model made by some college or university, which is maintained by the university and useful for all.
It's disgusting that OpenAI still calls itself "Open AI" when they aren't truly open.
Comment by atleastoptimal 2 days ago
Comment by Frannky 2 days ago
Comment by wahnfrieden 2 days ago
Comment by boxingdog 2 days ago
Comment by throwaway613746 2 days ago
Comment by gbgarbeb 2 days ago
Comment by couchdb_ouchdb 2 days ago
Comment by Gareth321 2 days ago
Comment by jbrooks84 2 days ago
Comment by autoconfig 2 days ago
Comment by zuzululu 2 days ago
Sticking with codex. Also GPT 5.5 is set to come next week.
Comment by fathermarz 2 days ago
I think people aren’t reading the system cards when they come out. They explicitly explain your workflow needs to change. They added more levels of effort and I see no mention of that in this post.
Did y’all forget Opus 4? That was not that long ago that Claude was essentially unusable then. We are peak wizardry right now and no one is talking positively. It’s all doom and gloom around here these days.
Comment by gck1 2 days ago
How about - don't break my workflow unless the change is meaningful?
While we're at it, either make y in x.y mean "groundbreaking", or "essentially same, but slightly better under some conditions". The former justifies workflow adjustments, the latter doesn't.
Comment by RevEng 2 days ago
Comment by isodev 2 days ago
Comment by someguyiguess 2 days ago
Comment by templar_snow 2 days ago
Comment by copperx 2 days ago
Comment by templar_snow 2 days ago
Comment by copperx 2 days ago
Comment by FireBeyond 2 days ago
Comment by vidarh 2 days ago
Comment by anabranch 2 days ago
I'm surprised that it's 45%. Might go down (?) with longer context answers but still surprising. It can be more than 2x for small prompts.
Comment by pawelduda 2 days ago
Comment by KellyCriterion 2 days ago
Comment by tailscaler2026 2 days ago
Comment by gadflyinyoureye 2 days ago
Comment by sgc 2 days ago
Comment by andai 2 days ago
If I can have Claude write up the plan, and the other models actually execute it, I'd get the best of both worlds.
(Amusingly, I think Codex tolerates being invoked by Claude (de facto tolerated ToS violation), but not the other way around.)
Comment by zozbot234 2 days ago
You could nonetheless have Codex write up the plan to an .md file for Claude (perhaps Sonnet or even Haiku?) to execute.
Comment by andai 2 days ago
Anthropic's been against 3rd party usage of the subs, including, I believe, 3rd party software invoking Claude Code.
OpenAI, from what I heard, is the only company tolerating this type of usage (for now).
So my point was, if you get a subscription from both, you could let Claude "drive" Codex, but not the reverse.
And while it sounds silly, it might actually be the only way to get real work done at the moment, with usage being exhausted so quickly.
Comment by smt88 2 days ago
If tech companies convince Congress that AI is an existential issue (in defense or even just productivity), then these companies will get subsidies forever.
Comment by andai 2 days ago
And shafting your customers too hard is bad for business, so I expect only moderate shafting. (Kind of surprised at what I've been seeing lately.)
Comment by danny_codes 2 days ago
Comment by smt88 2 days ago
Comment by pitched 2 days ago
Comment by nothinkjustai 2 days ago
Comment by senordevnyc 2 days ago
Comment by RevEng 2 days ago
Comment by varispeed 2 days ago
Comment by Syzygies 2 days ago
So far, Opus 4.7 seems a bit smarter than Opus 4.6 for my use case. That's my only concern. Is an $80 bottle of wine a better value than a $20 or $40 bottle of wine? Pretty much never. If there are those of us willing to buy $80 bottles of wine, of course the market will facilitate this.
People can use whatever model they want. I'm too worried about worms crawling through my dead body to waste time on any but the smartest model any moment can offer.
Comment by WarmWash 2 days ago
Comment by xvector 2 days ago
And what's missing in all these token count complaints is that 4.7 is actually cheaper overall anyways because it produces fewer output tokens.
Comment by throwatdem12311 2 days ago
Comment by napolux 2 days ago
Comment by ausbah 2 days ago
Comment by zozbot234 2 days ago
Plenty of OSS models being released as of late, with GLM and Kimi arguably being the most interesting for the near-SOTA case ("give these companies a run for their money"). Of course, actually running them locally for anything other than very slow Q&A is hard.
Comment by slowmovintarget 2 days ago
Comment by rectang 2 days ago
This gives me hope that even if future versions of Opus continue to target long-running tasks and get more and more expensive while being less-and-less appropriate for my style, that a competitor can build a model akin to Opus 4.5 which is suitable for my workflow, optimizing for other factors like cost.
Comment by DeathArrow 2 days ago
Comment by amelius 2 days ago
Comment by andai 2 days ago
Comment by never_inline 2 days ago
Comment by 100ms 2 days ago
Comment by casey2 2 days ago
Comment by embedding-shape 2 days ago
Comment by jimkleiber 2 days ago
Is Opus 4.7 that significantly different in quality that it should use that much more in tokens?
I like Claude and Anthropic a lot, and hope it's just some weird quirk in their tokenizer or whatnot, just seems like something changed in the last few weeks and may be going in a less-value-for-money direction, with not much being said about it. But again, could just be some technical glitch.
Comment by hopfenspergerj 2 days ago
Comment by jimkleiber 2 days ago
Comment by gck1 2 days ago
First they introduce a policy to ban third party clients, but the way it's written, it affects claude -p too, and 3 months later, it's still confusing with no clarification.
Then they hide model's thinking, introduce a new flag which will still show summaries of thinking, which they break again in the next release, with a new flag.
Then they silently cut the usage limits to the point where the exact same usage that you're used to consumes 40% of your weekly quota in 5 hours, but not only they stay silent for entire 2 weeks - they actively gaslight users saying they didn't change anything, only to announce later that they did, indeed change the limits.
Then they serve a lobotomized model for an entire week before they drop 4.7, again, gaslighting users that they didn't do that.
And then this.
Anthropic has lost all credibility at this point and I will not be renewing my subscription. If they can't provide services under a price point, just increase the price or don't provide them.
EDIT: forgot "adaptive thinking", so add that too. Which essentially means "we decide when we can allocate resources for thinking tokens based on our capacity, or in other words - never".
Comment by bobjordan 2 days ago
Our default topology is a two-agent pair: one implementer and one reviewer. In practice, that usually means Opus writing code and Codex reviewing it.
I just finished a 10-hour run with 5 of these teams in parallel, plus a Codex run manager. Total swarm: 5 Opus 4.7 agents and 6 Codex/GPT-5.4 agents.
Opus was launched with:
`export CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=35 claude --dangerously-skip-permissions --model 'claude-opus-4-7[1M]' --effort high --thinking-display summarized`
Codex was launched with:
`codex --dangerously-bypass-approvals-and-sandbox --profile gpt-5-4-high`
What surprised me was usage: after 10 hours, both my Claude Code account and my Codex account had consumed 28% of their weekly capacity from that single run.
I expected Claude Code usage to be much higher. Instead, on these settings and for this workload, both platforms burned the same share of weekly budget.
So from this datapoint alone, I do not see an obvious usage-efficiency advantage in switching from Opus 4.7 to Codex/GPT-5.4.
Comment by pitched 2 days ago
Comment by varispeed 2 days ago
Having a taste of unnerfed Opus 4.6 I think that they have a conflict of interest - if they let models give the right answer first time, person will spend less time with it, spend less money, but if they make model artificially dumber (progressive reasoning if you will), people get frustrated but will spend more money.
It is likely happening because economics doesn't work. Running comparable model at comparable speed for an individual is prohibitively expensive. Now scale that to millions of users - something gotta give.
Comment by fmckdkxkc 2 days ago
It’s funny everyone says “the cost will just go down” with AI but I don’t know.
We need to keep the open source models alive and thriving. Oh, but wait the AI companies are buying all the hardware.
Comment by atleastoptimal 2 days ago
Comment by razodactyl 2 days ago
To me this seems more that it's trained to be concise by default which I guess can be countered with preference instructions if required.
What's interesting to me is that they're using a new tokeniser. Does it mean they trained a new model from scratch? Used an existing model and further trained it with a swapped out tokeniser?
The looped model research / speculation is also quite interesting - if done right there's significant speed up / resource savings.
Comment by andai 2 days ago
Comment by fumar 2 days ago
Comment by macinjosh 2 days ago
Comment by coldtea 2 days ago
It's going to be a very expensive game, and the masses will be left with subpar local versions. It would be like if we reversed the democratization of compilers and coding tooling, done in the 90s and 00s, and the polished more capable tools are again all proprietary.
Comment by danny_codes 2 days ago
So over time older models will be less valuable, but new models will only be slightly better. Frontier players, therefore, are in a losing business. They need to charge high margins to recoup their high training costs. But latecomers can simply train for a fraction of the cost.
Since performance is asymptomatic, eventually the first-mover advantage is entirely negligible and LLMs become simple commodity.
The only moat I can see is data, but distillation proves that this is easy to subvert.
There will probably be a window though where insiders get very wealthy by offloading onto retail investors, who will be left with the bag.
Comment by coldtea 2 days ago
There hasn't been a real Moore's law for a good while even before LLMs.
And memory isn't getting less expensive either...
Comment by quux 2 days ago
Oh well
Comment by slowmovintarget 2 days ago
OpenAI was built as you say. Google had a corporate motto of "Don't be evil" which they removed so they could, um, do evil stuff without cognitive dissonance, I guess.
This is the other kind of enshitification where the businesses turn into power accumulators.
Comment by throwaway041207 2 days ago
You could call it a rug pull, but they may just be doing the math and realize this is where pricing needs to shift to before going public.
Comment by zozbot234 2 days ago
Comment by vicchenai 2 days ago
what bugs me is the tokenizer change feels like a stealth price hike. if you're charging the same $/token but the same text now costs 35% more tokens, thats just a 35% price increase with extra steps. at least be upfront about it.
Comment by ianberdin 2 days ago
Not a secret, the model is the best on the world. Yet it is crazy expensive and this 35% is huge for us. $10,000 becomes $13,500. Don’t forget, anthropic tokenizer also shows way more than other providers.
We have experimented a lot with GLM 5.1. It is kinda close, but with downsides: no images, max 100K adequate context size and poor text writing. However, a great designer. So there is no replacement. We pray.
Comment by BrianneLee011 2 days ago
Comment by spencerkw 1 day ago
what makes it worse is it compounds with two other things: thinking tokens (invisible but counted against limits) and the more verbose output style. so the effective cost delta is closer to 1.5-2x, not just the 1.35x from the tokenizer alone.
practically the only mitigation right now is to keep using 4.6 for tasks where you don't need the reasoning improvements and only use 4.7 when you actually need it. but that means maintaining model selection logic per-task, which most people won't bother with.
Comment by monkpit 2 days ago
Comment by QuadrupleA 2 days ago
Comment by aray07 2 days ago
It was on the higher end of Anthropics range - closer to 30-40% more tokens
https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new...
Comment by ai_slop_hater 2 days ago
Comment by quux 2 days ago
Comment by alphabettsy 2 days ago
Maybe I missed it, but it doesn’t tell you if it’s more successful for less overall cost?
I can easily make Sonnet 4.6 cost way more than any Opus model because while it’s cheaper per prompt it might take 10x more rounds (or never) solve a problem.
Comment by senordevnyc 2 days ago
Comment by ben8bit 2 days ago
Comment by hirako2000 2 days ago
That's an incentive difficult to reconcile with the user's benefit.
To keep this business running they do need to invest to make the best model, period.
It happens to be exactly what Anthropic's strategy is. That and great tooling.
Comment by subscribed 2 days ago
And they're selling less and less (suddenly 5 hour window lasts 1 hour on the similar tasks it lasted 5 hours a week ago), so IMO they're scamming.
I hope many people are making notes and will raise heat soon.
Comment by hirako2000 1 day ago
Anthropic has to keep racing ahead and be stamped offering the best frontier models.
It isn't optimal, so the models cost them disproportionately too much to sell at a profitable price. So they keep feeding the hype and push the costs higher, hoping there won't be too much heat and get away with it.
I wouldn't like to be a leader at such company, but their pay keep them in line.
Comment by ivanfioravanti 2 days ago
Comment by l5870uoo9y 2 days ago
Comment by andai 2 days ago
The difference here is Opus 4.7 has a new tokenizer which converts the same input text to a higher number of tokens. (But it costs the same per token?)
> Claude Opus 4.7 uses a new tokenizer, contributing to its improved performance on a wide range of tasks. This new tokenizer may use roughly 1x to 1.35x as many tokens when processing text compared to previous models (up to ~35% more, varying by content), and /v1/messages/count_tokens will return a different number of tokens for Claude Opus 4.7 than it did for Claude Opus 4.6.
> Pricing remains the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens.
ArtificialAnalysis reports 4.7 significantly reduced output tokens though, and overall ~10% cheaper to run the evals.
I don't know how well that translates to Claude Code usage though, which I think is extremely input heavy.
Comment by TomGarden 2 days ago
Comment by nmeofthestate 2 days ago
Comment by jbrooks84 2 days ago
Comment by silverwind 2 days ago
Comment by Frannky 2 days ago
Comment by Shailendra_S 2 days ago
What I've been doing is running a dual-model setup — use the cheaper/faster model for the heavy lifting where quality variance doesn't matter much, and only route to the expensive one when the output is customer-facing and quality is non-negotiable. Cuts costs significantly without the user noticing any difference.
The real risk is that pricing like this pushes smaller builders toward open models or Chinese labs like Qwen, which I suspect isn't what Anthropic wants long term.
Comment by OptionOfT 2 days ago
There are 2 things to consider:
* Time to market.
* Building a house on someone else's land.
You're balancing the 2, hoping that you win the time to market, making the second point obsolete from a cost perspective, or you have money to pivot to DIY.Comment by duped 2 days ago
This is going to be blunt, but this business model is fundamentally unsustainable and "founders" don't get to complain their prospecting costs went up. These businesses are setting themselves up to get Sherlocked.
The only realistic exit for these kinds of businesses is to score a couple gold nuggets, sell them to the highest bidder, and leave.
Comment by c0balt 2 days ago
A smaller builder might reconsider (re)acquiring relevant skills and applying them. We don't suddenly lose the ability to program (or hire someone to do it) just because an inference provider is available.
Comment by gverrilla 2 days ago
Comment by nickvec 2 days ago
Comment by blahblaher 2 days ago
Comment by willis936 2 days ago
Comment by operatingthetan 2 days ago
Comment by micromacrofoot 2 days ago
latest claude still fails the car wash test
Comment by reddit_clone 2 days ago
>I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.
Comment by zozbot234 2 days ago
Comment by eezing 2 days ago
Comment by liangyunwuxu 2 days ago
Comment by bparsons 2 days ago
Claude design on the other hand seemed to eat through (its own separate usage limit) very fast. Hit the limit this morning in about 45 mins on a max plan. I assume they are going to end up spinning that product off as a separate service.
Comment by axeldunkel 2 days ago
Comment by cooldk 2 days ago
Comment by DeathArrow 2 days ago
Comment by dackdel 2 days ago
Comment by ozgrakkurt 2 days ago
Also there should be time distribution for the queries and a way to filter by query time. This is because Anthropic is reported to change the model quality arbitrarily in the background.
Also there is no unit in table column headers. For example "Request 4.7" is this the amount of tokens 4.7 consumes? Is it output/input/reasoning etc.
Really difficult to make sense of this.
People get offended if what they are doing is labeled as slop but this is unfortunately the level of quality I expect from AI related content or code.
Comment by therobots927 2 days ago
Comment by falcor84 2 days ago
To be clear, I'm not saying that it's a good thing, but it does seem to be going in this direction.
Comment by dgellow 2 days ago
Comment by therobots927 2 days ago
And junior devs have never added much value. The first two years of any engineer’s career is essentially an apprenticeship. There’s no value add from have a perpetually junior “employee”.
Comment by ManlyBread 2 days ago
This has resulted in +92.9% cost and token difference. Submission bd2457e5, currently at the top of the leaderboard.
Comment by justindotdev 2 days ago
Comment by bcherny 2 days ago
Under the hood, what was happening is that older models needed reminders, while 4.7 no longer needs it. When we showed these reminders to 4.7 it tended to over-fixate on them. The fix was to stop adding cyber reminders.
More here: https://x.com/ClaudeDevs/status/2045238786339299431
Comment by bakugo 2 days ago
Comment by templar_snow 2 days ago
Comment by maleldil 2 days ago
Comment by matheusmoreira 2 days ago
> 4.7 is quite... dumb. i think they have lobotomized this model
Is adaptive thinking still broken? Why was the option to disable it taken away?
Comment by vessenes 2 days ago
Comment by aenis 2 days ago
Comment by dgellow 2 days ago
Comment by lucid-dev 2 days ago
It looks like you don't allow testing of anything beyond a certain token size.
Which makes your test kind of pointless, because if you are chatting about AI with something that's only a few hundred tokens, the data your collecting is pretty minimal and specific, not something that's generally applicable or relevant to wider user outside of that specific context.
Comment by erelong 2 days ago
Comment by QuadrupleA 2 days ago
Comment by fny 2 days ago
In my opinion, we've reached some ceiling where more tokens lead only to incremental improvements. A conspiracy seems unlikely given all providers are still competing for customers and a 50% token drives infra costs up dramatically too.
Comment by mvkel 2 days ago
The whole magic of (pre-nerfed) 4.6 was how it magically seemed to understand what I wanted, regardless of how perfectly I articulated it.
Now, Anth says that needing to explicitly define instructions are as a "feature"?!
Comment by alekseyrozh 2 days ago
Comment by Futurmix 1 day ago
Comment by agentseal 1 day ago
Comment by EthanFrostHI 2 days ago
Comment by contractlens_hn 1 day ago
Comment by chandureddyvari 2 days ago
Comment by jeremie_strand 2 days ago
Comment by kziad 2 days ago
Comment by jiusanzhou 2 days ago
Comment by kuzivaai 2 days ago
Comment by Olivia_Pan 2 days ago
Comment by maxbeech 2 days ago
Comment by matt3210 2 days ago
Comment by operatingthetan 2 days ago
If the models don't get to a higher level of 'intelligence' and still struggle with certain basic tasks at the SOTA while also getting more expensive, then the pitch is misleading and unlikely to happen.
So yes, I expect the price to go down.
Comment by ant6n 2 days ago
Comment by monkeydust 2 days ago