Less human AI agents, please
Posted by nialse 2 hours ago
Comments
Comment by gregates 1 hour ago
I ask my coding agent to do some tedious, extremely well-specified refactor, such as (to give a concrete real life example) changing a commonly used fn to take a locale parameter, because it will soon need to be locale-aware. I am very clear — we are not actually changing any behavior, just the fn signature. In fact, at all call sites, I want it to specify a default locale, because we haven't actually localized anything yet!
Said agent, I know, will spend many minutes (and tokens) finding all the call sites, and then I will still have to either confirm each update or yolo and trust the compiler and tests and the agents ability to deal with their failures. I am ok with this, because while I could do this just fine with vim and my lsp, the LLM agent can do it in about the same amount of time, maybe even a little less, and it's a very straightforward change that's tedious for me, and I'd rather think about or do anything else and just check in occasionally to approve a change.
But my f'ing agent is all like, "I found 67 call sites. This is a pretty substantial change. Maybe we should just commit the signature change with a TODO to update all the call sites, what do you think?"
And in that moment I guess I know why some people say having an LLM is like having a junior engineer who never learns anything.
Comment by felipeerias 59 minutes ago
> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.
I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".
Comment by zingar 1 hour ago
I'm fascinated that so many folks report this, I've literally never seen it in daily CC use. I can only guess that my habitually starting a new session and getting it to plan-document before action ("make a file listing all call sites"; "look at refactoring.md and implement") makes it clear when it's time for exploration vs when it's time for action (i.e. when exploring and not acting would be failing).
Comment by bandrami 1 hour ago
Comment by cadamsdotcom 52 minutes ago
You’ll be amazed how good the script is.
My agent did 20 class renames and 12 tables. Over 250 files and from prompt to auditing the script to dry run to apply, a total wall clock time of 7 minutes.
Took a day to review but it was all perfect!
Comment by comrade1234 45 minutes ago
Comment by gregates 31 minutes ago
Comment by prymitive 1 hour ago
Comment by chrisjj 1 hour ago
Reminds us of the most important button the "AI" has, over the similarly bad human employee.
'X'
Until, of course, we pass resposibility for that button to an "AI".
Comment by nialse 43 minutes ago
Comment by grebc 1 hour ago
Comment by gregates 1 hour ago
You would think.
It would be one thing if it was like, ok, we'll temporarily commit the signature change, do some related thing, then come back and fix all the call sites, and squash before merging. But that is not the proposal. The plan it proposes is literally to make what it has identified as the minimal change, which obviously breaks the build, and call it a day, presuming that either I or a future session will do the obvious next step it is trying to beg off.
Comment by chillfox 1 hour ago
I have never seen those “minimal change” issues when using zed, but have seen them in claude code and aider. Been using sonnet/opus high thinking with the api in all the agents I have tested/used.
Comment by solumunus 1 hour ago
Comment by gregates 53 minutes ago
It's true that I could accept the plan and hope that it will realize that it can't commit a change that doesn't compile on its own, later. I might even have some reason to think that's true, such as your stop hook, or a "memory" it wrote down before after I told it to never ever commit a change that doesn't compile, in all caps. But that doesn't change the badness of the plan.
Which is especially notable because I already told it the correct plan! It just tried to change the plan out of "laziness", I guess? Or maybe if you're enough of an LLM booster you can just say I didn't use exactly the right natural language specification of my original plan.
Comment by solumunus 1 hour ago
Comment by anuramat 1 hour ago
Comment by hausrat 1 hour ago
Comment by anuramat 1 hour ago
in any case, how is this specific to transformers?
Comment by chrisjj 1 hour ago
It shouldn't. It should just do what it is told.
Comment by js8 1 hour ago
I believe how "neurotypical" (for the lack of a better word) you want model to be is a design choice. (But I also believe model traits such as sycophancy, some hallucinations or moral transgressions can be a side effect of training to be subservient. With humans it is similar, they tend to do these things when they are forced to perform.)
Comment by nialse 1 hour ago
EDIT: It's specifically GPT-5.4 High in the Codex harness.
Comment by anuramat 54 minutes ago
claude on the other hand was exactly as described in the article
Comment by zingar 1 hour ago
Comment by raincole 2 hours ago
Comment by pjc50 1 hour ago
You can do a mental or physical search and replace all references to the LLM as "it" if you like, but that doesn't change the interaction.
Comment by zingar 1 hour ago
Comment by plastic041 1 hour ago
"Ignoring" instructions is not human thing. It's a bad LLM thing. Or just LLM thing.
Comment by zingar 1 hour ago
Maybe this is more anthropomorphising but I think this pushing back is exactly the result that the LLMs are giving; but we're expecting a bit too much of them in terms of follow-up like: "ok I double checked and I really am being paid to do things the hard way".
Comment by nialse 1 hour ago
Comment by fennecbutt 1 hour ago
Comment by lexicality 1 hour ago
Comment by anuramat 50 minutes ago
Comment by mentalgear 2 hours ago
Comment by NitpickLawyer 1 hour ago
Trying to limit / disallow something seems to be hurting the overall accuracy of models. And it makes sense if you think about it. Most of our long-horizon content is in the form of novels and above. If you're trying to clamp the machine to machine speak you'll lose all those learnings. Hero starts with a problem, hero works the problem, hero reaches an impasse, hero makes a choice, hero gets the princess. That can be (and probably is) useful.
Comment by vachanmn123 2 hours ago
Comment by bob1029 1 hour ago
"ChatGPT wrapper" is no longer a pejorative reference in my lexicon. How you expose the model to your specific problem space is everything. The code should look trivial because it is. That's what makes it so goddamn compelling.
Comment by noobermin 1 hour ago
Once again, one of the things I blame this moment for is people are essentially thinking they can stop thinking about code because the theft matrices seem magical. What we still need is better tools, not replacements for human junior engineers.
Comment by aryehof 1 hour ago
Comment by DeathArrow 1 hour ago
They drift to their training data. If thousand of humans solved a thing in a particular way, it's natural that AI does it too, because that is what it knows.
Comment by jansan 1 hour ago
Comment by nialse 57 minutes ago
Comment by chrisjj 1 hour ago
More of that please. Perhaps on a check box, "[x] Less bullsh*t".
Comment by DeathArrow 33 minutes ago
>Less human AI agents, please.
Agents aren't humans. The choices they make do depend on their training data. Most people using AI for coding know that AI will sometime not respect rules and the longer the task is, the more AI will drift from instructions.
There are ways to work around this: using smaller contexts, feeding it smaller tasks, using a good harness, using tests etc.
But at the end of the day, AI agents will shine only if they are asked to to what they know best. And if you want to extract the maximum benefit from AI coding agents, you have to keep that in mind.
When using AI agents for C# LOB apps, they mostly one shot everything. Same for JS frontends. When using AI to write some web backends in Go, the results were still good. But when I tried asking to write a simple cli tool in Zig, it pretty much struggled. It made lots of errors, it was hard to solve the errors. It was hard to fix the code so the tests pass. Had I chose Python, JS, C, C#, Java, the agent would have finished 20x faster.
So, if you keep in mind what the agent was trained on, if you use a good harness, if you have good tests, if you divide the work in small and independent tasks and if the current task is not something very new and special, you are golden.
Comment by lokthedev 1 hour ago
Comment by incognito124 2 hours ago
Comment by zingar 1 hour ago
Comment by nialse 1 hour ago
Comment by zingar 1 hour ago
In fairness to coding agents, most of coding is not exactly specified like this, and the right answer is very frequently to find the easiest path that the person asking might not have thought about; sometimes even in direct contradiction of specific points listed. Human requirements are usually much more fuzzy. It's unusual that the person asking would have such a clear/definite requirement that they've thought about very clearly.
Comment by fennecbutt 1 hour ago
Just as a human would use a task list app or a notepad to keep track of which tasks need to be done so can a model.
You can even have a mechanism for it to look at each task with a "clear head" (empty context) with the ability to "remember" previous task execution (via embedding the reasoning/output) in case parts were useful.
Comment by zingar 1 hour ago
Comment by nialse 1 hour ago
In long editing sessions with multiple iterations, the context can accumulate stale information, and that actively hurts model performance. Compaction is one way to deal with that. It strips out material that should be re-read from disk instead of being carried forward.
A concrete example is iterative file editing with Codex. I rewrite parts of a file so they actually work and match the project’s style. Then Codex changes the code back to the version still sitting in its context. It does not stop to consider that, if an external edit was made, that edit is probably important.
Comment by zingar 58 minutes ago
Long context as disadvantage is pretty well discussed, and agent-native compaction has been inferior to having it intentionally build the documentation that I want it to use. So far this has been my LLM-coding superpower. There are also a few products whose entire purpose is to provide structure that overcomes compaction shortcomings.
When Geoff Huntley said that Claude Code's "Ralph loop" didn't meet his standards ("this aint it") the major bone of contention as far as I can see was that it ran subagents in a loop inside Claude Code with native compaction; as opposed to completely empty context.
I do see hints that improving compaction is a major area of work for agent-makers. I'm not certain where my advantage goes at that point.
Comment by nialse 1 hour ago