AGENTS.md outperforms skills in our agent evals

Posted by maximedupre 13 hours ago

18784Original

Comments

Comment by tottenhm 5 hours ago

> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it.

The agent passes the Turing test...

Comment by cainxinth 1 hour ago

Even AI doesn’t RTFM

Comment by pylotlight 58 minutes ago

It learnt from the best

Comment by w10-1 3 hours ago

The key finding is that "compression" of doc pointers works.

It's barely readable to humans, but directly and efficiently relevant to LLM's (direct reference -> referent, without language verbiage).

This suggests some (compressed) index format that is always loaded into context will replace heuristics around agents.md/claude.md/skills.md.

So I would bet this year we get some normalization of both the indexes and the referenced documentation (esp. matching terms).

Possibly also a side issue: API's could repurpose their test suites as validation to compare LLM performance of code tasks.

LLM's create huge adoption waves. Libraries/API's will have to learn to surf them or be limited to usage by humans.

Comment by ai-christianson 2 hours ago

They say compressed... but isn't this just "minified"?

Comment by ethmarks 1 hour ago

Minification is still a form of compression, it just leaves the file more readable than more powerful compression methods (such as ZIP archives).

Comment by throwaway314155 57 minutes ago

I'd say minification/summarization is more like a lossy, semantic compression. This is only relevant to LLM's and doesn't really fit more classical notions of compression. Minification would definitely be a clearer term, even if compression _technically_ makes sense.

Comment by jgbuddy 4 hours ago

Am I missing something here?

Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.

However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.

It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)

Comment by jstummbillig 4 hours ago

> Obviously directly including context in something like a system prompt will put it in context 100% of the time.

How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md seems to work better than however skills are implemented.

Comment by cortesoft 3 hours ago

Isn't the difference that a skill means you just have to add the script name and explanation to the context instead of the entire script plus the explanation?

Comment by verdverm 3 hours ago

I like to think about it this way, you want to put some high level, table of contents, sparknotes like stuff in the system prompt. This helps warm up the right pathways. In this, you also need to inform that there are more things it may need, depending on "context", through filesystem traversal or search tools, the difference is unimportant, other than most things outside of coding typically don't do filesystem things the same way

Comment by imiric 1 hour ago

The amount of discussion and "novel" text formats that accomplish the same thing since 2022 is insane. Nobody knows how to extract the most value out of this tech, yet everyone talks like they do. If these aren't signs of a bubble, I don't know what is.

Comment by sevg 3 hours ago

You could put the name and explanation in CLAUDE.md/AGENTS.md, plus the path to the rest of the skill that Claude can read if needed.

That seems roughly equivalent to my unenlightened mind!

Comment by _the_inflator 2 hours ago

I agree with you.

I think Vercel mixes skills and context configuration up. So the whole evaluation is totally misleading because it tests for two completely different use cases.

To sum it up: Vercel should us both files, agents.md is combination with skills. Both functions have two totally different purposes.

Comment by observationist 4 hours ago

This is one of the reasons the RLM methodology works so well. You have access to as much information as you want in the overall environment, but only the things relevant to the task at hand get put into context for the current task, and it shows up there 100% of the time, as opposed to lossy "memory" compaction and summarization techniques, or probabilistic agent skills implementations.

Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.

Comment by judahmeek 2 hours ago

> only the things relevant to the task at hand get put into context for the current task

And how do you guarantee that said relevant things actually get put into the context?

OP is about the same problem: relevant skills being ignored.

Comment by verdverm 3 hours ago

You aren't wrong, you really want a bit of both.

1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"

2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)

Comment by TeeWEE 11 minutes ago

Indeed seems like Vercel completely missed the point about agents.

In Claude Code you can invoke an agent when you want as a developer and it copies the file content as context in the prompt.

Comment by teknopaul 3 hours ago

My reading was that copying the doc's ToC in markdown + links was significantly more effective than giving it a link to the ToC and instructions to read it.

Which makes sense.

& some numbers that prove that.

Comment by orlandohohmeier 4 hours ago

I’ve been using symlinked agent files for about a year as a hacky workaround before skils became a thing load additional “context” for different tasks, and it might actually address the issue you’re talking about. Honestly, it’s worked so well for me that I haven’t really felt the need to change it.

Comment by mbm 3 hours ago

What sort of files do you generally symlink in?

Comment by deaux 1 hour ago

You're right, the results are completely as expected.

The article also doesn't mention that they don't know how the compressed index output quality. That's always a concern with this kind of compression. Skills are just another, different kind of compression. One with a much higher compression rate and presumably less likely to negatively influence quality. The cost being that it doesn't always get invoked.

Comment by denolfe 1 hour ago

PreSession Hook from obra/superpowers injects this along with more logic for getting rid of rationalizing out of using skills:

> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill. IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.

While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.

Comment by stingraycharles 53 minutes ago

I always say “invoke your <x> skill to do X. then invoke your <y> skill to do Y. “

works pretty well

Comment by holocen 13 minutes ago

Prompted and built a bit of an extension of skills.sh with https://passivecontext.dev it basically just takes the skill and creates that "compressed" index. Still have to install the skill and all that, but might give others a bit of a short cut to experiment with.

Comment by chr15m 3 hours ago

I'm not sure if this is widely known but you can do a lot better even than AGENTS.md.

Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.

This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.

Comment by gbnwl 2 hours ago

Cheaper? Loading every bit of documentation into context every time, regardless of whether it’s relevant to the task the agent is working on? How? I’d much rather call out the location of relevant docs in Claude.md or Agents.md and tell the agent to read them only when needed.

Comment by chr15m 42 minutes ago

As they point out in the article, that approach is fragile.

Cheaper because it has the right context from the start instead of faffing about trying to find it, which uses tokens and ironically bloats context.

It doesn't have to be every bit of documentation, but putting the most salient bits in context makes LLMs perform much more efficiently and accurately in my experience. You can also use the trick of asking an LLM to extract the most useful parts from the documentation into a file, which you then re-use across projects.

https://github.com/chr15m/ai-context

Comment by TeeWEE 9 minutes ago

This is quite a bad idea. You need to control the size and quality of your context by giving it one file that is optimized.

You don’t want to be burning tokens and large files will give diminishing returns as is mentioned in the Claude Code blog.

Comment by d3m0t3p 3 hours ago

Yea but the goal it not to bloat the context space. Here you "waste" context by providing non usefull information. What they did instead is put an index of the documentation into the context, then the LLM can fetch the documentation. This is the same idea that skills but it apparently works better without the agentic part of the skills. Furthermore instead of having a nice index pointing to the doc, They compressed it.

Comment by chr15m 30 minutes ago

The minification is a great idea. Will try this.

Their approach is still agentic in the sense that the LLM must make a tool cool to load the particular doc in. The most efficient approach would be to know ahead of time which parts of the doc will be needed, and then give the LLM a compressed version of those docs specifically. That doesn't require an agentic tool call.

Of course, it's a tradeoff.

Comment by bmitc 2 hours ago

What does it mean to waste context?

Comment by therealpygon 2 hours ago

Context quite literally degrades performance of attention with size in non-needle-in-haystack lookups in almost every model to varying degrees. Thus to answer the question, the “waste” is making the model dumber unnecessarily in an attempt to make it smarter.

Comment by bagels 2 hours ago

The context window is finite. You can easily fill it with documentation and have no room left for the code and question you want to work on. It also means more tokens sent with every request, increasing cost if you're paying by the token.

Comment by thorum 4 hours ago

The article presents AGENTS.md as something distinct from Skills, but it is actually a simplified instance of the same concept. Their AGENTS.md approach tells the AI where to find instructions for performing a task. That’s a Skill.

I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.

Comment by verdverm 3 hours ago

Yea, I am now separating them based on

1. Those I force into the system prompt using rules based systems and "context"

2. Those I let the agent lookup or discover

I also limit what gets into message parts, moving some of the larger token consumers to the system prompt so they only show once, most notable read/write_file

Comment by BenoitEssiambre 4 hours ago

Wouldn't this have been more readable with a \n newline instead of a pipe operator as a seperator? This wouldn't have made the prompt longer.

Comment by jryan49 5 hours ago

Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.

Comment by only-one1701 5 hours ago

This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.

Comment by bluGill 5 hours ago

That is my experience. Sometimes the LLM gives good results, sometimes it does something stupid. You tell it what to do, and like a stubborn 5 year old it ignores you - even after it tries it and fails it will do what you tell it for a while and then go back to the thing that doesn't work.

Comment by CuriouslyC 3 hours ago

I always make a habit of doing a lot of duplicate runs when I benchmark for this reason. Joke's on me, in the time I spent doing 1 benchmark with real confidence intervals and getting no traction on my post, I could have done 10 shitty benchmarks or 1 shitty benchmark and 9x more blogspam. Perverse incentives rule us all.

Comment by minimal_action 53 minutes ago

It's very interesting but presenting success rates without any measure of the error, or at least inline details about the number of iterations is unprofessional. Especially for small differences or when you found the "same" performance.

Comment by verdverm 3 hours ago

This largely mirrors my experience building my custom agent

1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none

2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically

3. Have topical markdown files / skills

4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me

5. Iterate, experiment, do weird things, and have fun!

I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass

Comment by onnimonni 43 minutes ago

Would someone know if their eval tests are open source and where I could find them? Seems useful for iterating on Claude Code behaviour.

Comment by jascha_eng 1 hour ago

This does not normalize for tokens used if their skill description was as large as the docs index and contained all the reasons the LLM might want to use the skill, it likely performs much better than just one sentence as well.

Comment by thevinter 3 hours ago

I'm a bit confused by their claims. Or maybe I'm misunderstanding how Skills should work. But from what I know (and the small experience I had with them), skills are meant to be specifications for niche and well defined areas of work (i.e. building the project, running custom pipelines etc.)

If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...

Comment by 3 hours ago

Comment by meatcar 3 hours ago

What if instead of needing to run a codemod to cache per-lib docs locally, documentation could be distributed alongside a given lib, as a dev dependency, version locked, and accessible locally as plaintext. All docs can be linked in node_modules/.docs (like binaries are in .bin). It would be a sort of collection of manuals.

What a wonderful world that would be.

Comment by tobyjsullivan 3 hours ago

Sounds a bit like man pages. I think you’re onto something.

Comment by gpm 1 hour ago

Compressing information in AGENTS.md makes a ton of sense, but why are they measuring their context in bytes and not tokens!?

Comment by AndyNemmity 1 hour ago

My experience agrees with this.

Which is why I use a skill that is a command, that routes requests to agents and skills.

Comment by tanishqkanc 53 minutes ago

this is only gonna be an issue until the next gen models where the labs will aggressively post train the models to proactively call skills

Comment by songodongo 3 hours ago

> When it needs specific information, it reads the relevant file from the .next-docs/ directory.

I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?

Comment by pietz 5 hours ago

Isn't it obvious that an agent will do better if he internalizes the knowledge on something instead of having the option to request it?

Skills are new. Models haven't been trained on them yet. Give it 2 months.

Comment by WA 5 hours ago

Not so obvious, because the model still needs to look up the required doc. The article glances over this detail a little bit unfortunately. The model needs to decide when to use a skill, but doesn’t it also need to decide when to look up documentation instead of relying on pretraining data?

Comment by 3 hours ago

Comment by velcrovan 4 hours ago

Removing the skill does remove a level of indirection.

It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."

Comment by sothatsit 5 hours ago

I believe the skills would contain the documentation. It would have been nice for them to give more information on the granularity of the skills they created though.

Comment by newzino 4 hours ago

The compressed agents.md approach is interesting, but the comparison misses a key variable: what happens when the agent needs to do something outside the scope of its instructions?

With explicit skills, you can add new capabilities modularly - drop in a new skill file and the agent can use it. With a compressed blob, every extension requires regenerating the entire instruction set, which creates a versioning problem.

The real question is about failure modes. A skill-based system fails gracefully when a skill is missing - the agent knows it can't do X. A compressed system might hallucinate capabilities it doesn't actually have because the boundary between "things I can do" and "things I can't" is implicit in the training rather than explicit in the architecture.

Both approaches optimize for different things. Compressed optimizes for coherent behavior within a narrow scope. Skills optimize for extensibility and explicit capability boundaries. The right choice depends on whether you're building a specialist or a platform.

Comment by jstummbillig 4 hours ago

Why could you not have a combination of both?

Comment by verdverm 3 hours ago

You can and should, it works better than either alone

Comment by 1 hour ago

Comment by smcleod 5 hours ago

Sounds like they've been using skills incorrectly if they're finding their agents don't invoke the skills. I have Claude Code agents calling my skills frequently, almost every session. You need to make sure your skill descriptions are well defined and describe when to use them and that your tasks / goals clearly set out requirements that align with the available skills.

Comment by joebates 1 hour ago

It's still not always reliable.

I have a skill in a project named "determine-feature-directory" with a short description explaining that it is meant to determine the feature directory of a current branch. The initial prompt I provide will tell it to determine the feature directory and do other work. Claude will even state "I need to determine the feature directory..."

Then, about 5-10% of the time, it will not use the skill. It does use the skill most of the time, but the low failure rate is frustrating because it makes it tough to tell whether or not a prompt change actually improved anything. Of course I could be doing something wrong, but it does work most of the time. I miss deterministic bugs.

Recently, I stopped Claude after it skipped using a skill and just said "Aren't you forgetting something?". It then remembered to use the skill. I found that amusing.

Comment by velcrovan 4 hours ago

I think if you read it, their agents did invoke the skills and they did find ways to increase the agents' use of skills quite a bit. But the new approach works 100% of the time as opposed to 79% of the time, which is a big deal. Skills might be working OK for you at that 79% level and for your particular codebase/tool set, that doesn't negate anything they've written here.

Comment by 4 hours ago

Comment by 5 hours ago

Comment by keeganpoppen 2 hours ago

i dont know why, but this just feels like the most shallow “i compare llms based on the specs” kind of analysis you can get… it has extreme “we couldn’t get the llm to intuit what we wanted to do, so we assumed that it was a problem with the llm and we overengineered a way to make better prompts completely by accident” energy…

Comment by ares623 5 hours ago

2 months later: "Anthropic introduces 'Claude Instincts'"

Comment by rao-v 5 hours ago

In a month or three we’ll have the sensible approach, which is smaller cheaper fast models optimized for looking at a query and identifying which skills / context to provide in full to the main model.

It’s really silly to waste big model tokens on throat clearing steps

Comment by Calavar 5 hours ago

I thought most of the major AI programming tools were already doing this. Isn't this what subagents are in Claude code?

Comment by rao-v 38 minutes ago

Sub-agents are typically one of the major models but with a specific and limited context + prompt. I’m talking about a small fast model focused on purely curating the skills / MCPs / files to provide to the main model before it kicks off.

Basically use a small model up front to efficiently trigger the big model. Sub agents are at best small models deployed by the bigger model (still largely manually triggered in most workflows today)

Comment by MillionOClock 4 hours ago

I don't know about Claude Code but in GitHub Copilot as far as I can tell the subagents are just always the same model as the main one you are using. They also need to be started manually by the main agent in many cases, whereas maybe the parent comment was referring about calling them more deterministically?

Comment by sheepscreek 4 hours ago

It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.

Comment by verdverm 3 hours ago

I've done very similar things with my custom agent that uses Gemini and have gotten very similar results. Working on the evals to back that claim up

Comment by EnPissant 5 hours ago

This is confusing.

TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.

The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".

Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?

Comment by NitpickLawyer 5 hours ago

The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md

Comment by sally_glance 4 hours ago

I also thought this is how skills work, but in practice I experienced similar issues. The agents I'm using (Gemini CLI, Opencode, Claude) all seem to have trouble activating skills on their own unless explicitly prompted. Yeah, probably this will be fixed over the next couple of generations but right now dumping the documentation index right into the agent prompt or AGENTS.md works much better for me. Maybe it's similar to structured output or tool calls which also only started working well after providers specifically trained their models for them.

Comment by hahahahhaah 1 hour ago

Next.js sure makes a good benchmark for AI capability (and for clarity... this is not a compliment).

Comment by sothatsit 5 hours ago

This seems like an issue that will be fixed in newer model releases that are better trained to use skills.

Comment by meeech 3 hours ago

question: anyone recognize that eval UI or is it something they made in-house?

Comment by heliumtera 4 hours ago

you are telling me that a markdown saying:

*You are the Super Duper Database Master Administrator of the Galaxy*

does not improve the model ability reason about databases?

Comment by ChrisArchitect 4 hours ago

Title is: AGENTS.md outperforms skills in our agent evals

Comment by CjHuber 4 hours ago

That feels like a stupid article. well of course if you have one single thing you want to optimize putting it into AGENTS.md is better. but the advantage of skills is exactly that you don't cram them all into the AGENTS file. Let's say you had 3 different elaborate things you want the agent to do. good luck putting them all in your AGENTS.md and later hoping that the agent remembers any of it. After all the key advantage of the SKILLs is that they get loaded to the end of the context when needed

Comment by smrtinsert 2 hours ago

Are people running into mismatched code vs project a lot? I've worked on python and java codebases with claude code and have yet to run into a version mismatch issue. I think maybe once it got confused on the api available in python, but it fixed it by itself. From other blog posts similar to this it would seem to be a widespread problem, but I have yet to see it as a big problem as part of my day job or personal projects.

Comment by thom 5 hours ago

You need the model to interpret documentation as policy you care about (in which case it will pay attention) rather than as something it can look up if it doesn’t know something (which it will never admit). It helps to really internalise the personality of LLMs as wildly overconfident but utterly obsequious.

Comment by delduca 4 hours ago

Ah nice… vercel is vibecoded

Comment by heliumtera 3 hours ago

web people opted into react, dude. that says a lot.

they used prisma to handle their database interactions. they preached tRPC and screamed TYPE SAFETY!!!

you really think these guys will ever again touch the keyboard to program? they despise programming.

Comment by dca88 29 minutes ago

This. I read this article and it pains me to see the amount of manpower put into doing anything but actually getting work done.

Comment by 4 hours ago