AGENTS.md outperforms skills in our agent evals
Posted by maximedupre 13 hours ago
Comments
Comment by tottenhm 5 hours ago
The agent passes the Turing test...
Comment by cainxinth 1 hour ago
Comment by pylotlight 58 minutes ago
Comment by w10-1 3 hours ago
It's barely readable to humans, but directly and efficiently relevant to LLM's (direct reference -> referent, without language verbiage).
This suggests some (compressed) index format that is always loaded into context will replace heuristics around agents.md/claude.md/skills.md.
So I would bet this year we get some normalization of both the indexes and the referenced documentation (esp. matching terms).
Possibly also a side issue: API's could repurpose their test suites as validation to compare LLM performance of code tasks.
LLM's create huge adoption waves. Libraries/API's will have to learn to surf them or be limited to usage by humans.
Comment by ai-christianson 2 hours ago
Comment by ethmarks 1 hour ago
Comment by throwaway314155 57 minutes ago
Comment by jgbuddy 4 hours ago
Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.
However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.
It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)
Comment by jstummbillig 4 hours ago
How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md seems to work better than however skills are implemented.
Comment by cortesoft 3 hours ago
Comment by verdverm 3 hours ago
Comment by imiric 1 hour ago
Comment by sevg 3 hours ago
That seems roughly equivalent to my unenlightened mind!
Comment by _the_inflator 2 hours ago
I think Vercel mixes skills and context configuration up. So the whole evaluation is totally misleading because it tests for two completely different use cases.
To sum it up: Vercel should us both files, agents.md is combination with skills. Both functions have two totally different purposes.
Comment by observationist 4 hours ago
Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.
Comment by judahmeek 2 hours ago
And how do you guarantee that said relevant things actually get put into the context?
OP is about the same problem: relevant skills being ignored.
Comment by verdverm 3 hours ago
1. You absolutely want to force certain context in, no questions or non-determinism asked (index and sparknotes). This can be done conditionally, but still rule based on the files accessed and other "context"
2. You want to keep it clean and only provide useful context as necessary (skills, search, mcp; and really a explore/query/compress mechanism around all of this, ralph wiggum is one example)
Comment by TeeWEE 11 minutes ago
In Claude Code you can invoke an agent when you want as a developer and it copies the file content as context in the prompt.
Comment by teknopaul 3 hours ago
Which makes sense.
& some numbers that prove that.
Comment by orlandohohmeier 4 hours ago
Comment by mbm 3 hours ago
Comment by deaux 1 hour ago
The article also doesn't mention that they don't know how the compressed index output quality. That's always a concern with this kind of compression. Skills are just another, different kind of compression. One with a much higher compression rate and presumably less likely to negatively influence quality. The cost being that it doesn't always get invoked.
Comment by denolfe 1 hour ago
> If you think there is even a 1% chance a skill might apply to what you are doing, you ABSOLUTELY MUST invoke the skill. IF A SKILL APPLIES TO YOUR TASK, YOU DO NOT HAVE A CHOICE. YOU MUST USE IT.
While this may result in overzealous activation of skills, I've found that if I have a skill related, I _want_ to use it. It has worked well for me.
Comment by stingraycharles 53 minutes ago
works pretty well
Comment by holocen 13 minutes ago
Comment by chr15m 3 hours ago
Create a folder called .context and symlink anything in there that is relevant to the project. For example READMEs and important docs from dependencies you're using. Then configure your tool to always read .context into context, just like it does for AGENTS.md.
This ensures the LLM has all the information it needs right in context from the get go. Much better performance, cheaper, and less mistakes.
Comment by gbnwl 2 hours ago
Comment by chr15m 42 minutes ago
Cheaper because it has the right context from the start instead of faffing about trying to find it, which uses tokens and ironically bloats context.
It doesn't have to be every bit of documentation, but putting the most salient bits in context makes LLMs perform much more efficiently and accurately in my experience. You can also use the trick of asking an LLM to extract the most useful parts from the documentation into a file, which you then re-use across projects.
Comment by TeeWEE 9 minutes ago
You don’t want to be burning tokens and large files will give diminishing returns as is mentioned in the Claude Code blog.
Comment by d3m0t3p 3 hours ago
Comment by chr15m 30 minutes ago
Their approach is still agentic in the sense that the LLM must make a tool cool to load the particular doc in. The most efficient approach would be to know ahead of time which parts of the doc will be needed, and then give the LLM a compressed version of those docs specifically. That doesn't require an agentic tool call.
Of course, it's a tradeoff.
Comment by bmitc 2 hours ago
Comment by therealpygon 2 hours ago
Comment by bagels 2 hours ago
Comment by thorum 4 hours ago
I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.
Comment by verdverm 3 hours ago
1. Those I force into the system prompt using rules based systems and "context"
2. Those I let the agent lookup or discover
I also limit what gets into message parts, moving some of the larger token consumers to the system prompt so they only show once, most notable read/write_file
Comment by BenoitEssiambre 4 hours ago
Comment by jryan49 5 hours ago
Comment by only-one1701 5 hours ago
Comment by bluGill 5 hours ago
Comment by CuriouslyC 3 hours ago
Comment by minimal_action 53 minutes ago
Comment by verdverm 3 hours ago
1. Start from the Claude Code extracted instructions, they have many things like this in there. Their knowledge share in docs and blog on this aspect are bar none
2. Use AGENTS.md as a table of contents and sparknotes, put them everywhere, load them automatically
3. Have topical markdown files / skills
4. Make great tools, this is still opaque in my mind to explain, lots of overlap with MCP and skills, conceptually they are the same to me
5. Iterate, experiment, do weird things, and have fun!
I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass
Comment by onnimonni 43 minutes ago
Comment by jascha_eng 1 hour ago
Comment by thevinter 3 hours ago
If your goal is to always give a permanent knowledge base to your agent that's exactly what AGENTS.md is for...
Comment by meatcar 3 hours ago
What a wonderful world that would be.
Comment by tobyjsullivan 3 hours ago
Comment by gpm 1 hour ago
Comment by AndyNemmity 1 hour ago
Which is why I use a skill that is a command, that routes requests to agents and skills.
Comment by tanishqkanc 53 minutes ago
Comment by songodongo 3 hours ago
I guess you need to make sure your file paths are self-explanatory and fairly unique, otherwise the agent might bring extra documentation into the context trying to find which file had what it needed?
Comment by pietz 5 hours ago
Skills are new. Models haven't been trained on them yet. Give it 2 months.
Comment by WA 5 hours ago
Comment by velcrovan 4 hours ago
It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."
Comment by sothatsit 5 hours ago
Comment by newzino 4 hours ago
With explicit skills, you can add new capabilities modularly - drop in a new skill file and the agent can use it. With a compressed blob, every extension requires regenerating the entire instruction set, which creates a versioning problem.
The real question is about failure modes. A skill-based system fails gracefully when a skill is missing - the agent knows it can't do X. A compressed system might hallucinate capabilities it doesn't actually have because the boundary between "things I can do" and "things I can't" is implicit in the training rather than explicit in the architecture.
Both approaches optimize for different things. Compressed optimizes for coherent behavior within a narrow scope. Skills optimize for extensibility and explicit capability boundaries. The right choice depends on whether you're building a specialist or a platform.
Comment by jstummbillig 4 hours ago
Comment by verdverm 3 hours ago
Comment by smcleod 5 hours ago
Comment by joebates 1 hour ago
I have a skill in a project named "determine-feature-directory" with a short description explaining that it is meant to determine the feature directory of a current branch. The initial prompt I provide will tell it to determine the feature directory and do other work. Claude will even state "I need to determine the feature directory..."
Then, about 5-10% of the time, it will not use the skill. It does use the skill most of the time, but the low failure rate is frustrating because it makes it tough to tell whether or not a prompt change actually improved anything. Of course I could be doing something wrong, but it does work most of the time. I miss deterministic bugs.
Recently, I stopped Claude after it skipped using a skill and just said "Aren't you forgetting something?". It then remembered to use the skill. I found that amusing.
Comment by velcrovan 4 hours ago
Comment by keeganpoppen 2 hours ago
Comment by ares623 5 hours ago
Comment by rao-v 5 hours ago
It’s really silly to waste big model tokens on throat clearing steps
Comment by Calavar 5 hours ago
Comment by rao-v 38 minutes ago
Basically use a small model up front to efficiently trigger the big model. Sub agents are at best small models deployed by the bigger model (still largely manually triggered in most workflows today)
Comment by MillionOClock 4 hours ago
Comment by sheepscreek 4 hours ago
Comment by verdverm 3 hours ago
Comment by EnPissant 5 hours ago
TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.
The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".
Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?
Comment by NitpickLawyer 5 hours ago
Comment by sally_glance 4 hours ago
Comment by hahahahhaah 1 hour ago
Comment by sothatsit 5 hours ago
Comment by meeech 3 hours ago
Comment by heliumtera 4 hours ago
*You are the Super Duper Database Master Administrator of the Galaxy*
does not improve the model ability reason about databases?
Comment by ChrisArchitect 4 hours ago
Comment by CjHuber 4 hours ago
Comment by smrtinsert 2 hours ago
Comment by thom 5 hours ago
Comment by delduca 4 hours ago
Comment by heliumtera 3 hours ago
they used prisma to handle their database interactions. they preached tRPC and screamed TYPE SAFETY!!!
you really think these guys will ever again touch the keyboard to program? they despise programming.
Comment by dca88 29 minutes ago