My Agent Skill for Test-Driven Development
Posted by laxmena 5 days ago
Comments
Comment by simonw 4 days ago
(I've been getting solid results recently from simply telling Claude Code and Codex "Test with uv run pytest, use red/green TDD".)
Comment by __mharrison__ 4 days ago
# Python Tooling
- Use `uv` to manage Python environments and dependencies.
- Use `uv run` to execute Python scripts and commands.
- Use `pytest` for testing your code.
- Use the `hypothesis` library for property-based testing when you have complex input spaces or need to test edge cases.
- Don't edit `pyproject.toml` directly. Instead, use `uv add` and `uv add --dev` to manage dependencies.
- Use ruff, ty, prek, wily for code quality and linting.
- Don't use excessive casting. If you find yourself needing to cast types frequently, consider refactoring your code to use more appropriate types. Casting should only be done in boundary layers where you are interfacing with external systems.
- Run appropriate tooling after making changes to your code to ensure it meets quality standards.
- When you come across a bug or regression, think hard about writing a test and also how to create code that will prevent this from happening again in the future.
- When creating a command line interface, add `--verbose` flag that provides logging output useful for debugging issues.
- Before creating code, brainstorm 5 different approaches to solve the problem and sort them by their probable effectiveness. Then, choose the best approach and implement it.
- Use Test Driven Development (TDD) for all code you write. Write tests before writing the implementation code.
- Collect pytest fixtures in a `conftest.py` file to avoid duplication
- Prefer testing real code where possible. Use doubles and `monkeypatch` when absolute necessary. Try to avoid mocking as much as possible.
- Favor pytest monkeypatch to mock.
- When a test fails, run the last failed test first using `uv run pytest --last-failed`
- Use numpy-style docstrings for all functions and classes you create.
- Include doctests in the docstrings of your functions to provide examples
- Use type hints for all function parameters and return types.
- Use logging to provide insight into failures. Don't use print for debugging. Don't use logging to hide stack traces.Comment by 0123456789ABCDE 4 days ago
Comment by porphyra 4 days ago
As a personal anecdote, I find that a lot of big prompts and skills use up context window budget and in many cases agents will eagerly try to use a skill even if it isn't super relevant or necessary for the current task. So when I have too many skills I have to spend a bunch of time toggling the checkboxes to figure out which ones are needed for the task at hand before starting...
Comment by Royce-CMR 4 days ago
I've run into the same issue and I still end up manually curtailing what's exposed to the model, limiting to the task at hand, but I like the idea of another (smaller I hope) model doing 70% of the clipping instead, automagically.
Comment by bmitc 2 days ago
How? Using the agent SDK or Claude Code? If the latter, it'd be nice if they figured that out. There's a huge amount of quality of life things missing from Claude Code. It's a pretty raw frontend to the backend models. And either Claude Code or the backend models get convinced they don't need skills they've been asked to read or even built-in capabilities like reading PDFs.
Comment by oefrha 3 days ago
You know what, I checked Opus 4.8's instructions to a review subagent the other day and it literally opened with
> You are a senior infrastructure/security engineer doing a thorough, adversarial code review...
I didn't say anything like that myself.
Comment by mathgeek 3 days ago
Comment by tclancy 3 days ago
Comment by jasonswett 4 days ago
Comment by nextaccountic 3 days ago
Comment by disgruntledphd2 4 days ago
Comment by galsapir 4 days ago
Comment by chrisweekly 4 days ago
Comment by 0123456789ABCDE 4 days ago
Comment by krupan 3 days ago
This kind of wisdom used to be cfound in blog posts, or in the beads of more senior developers, but they were never written out as concisely as these skill files. It's kinda funny that billions of dollars had to be spent creating a machine that's a rough human analog needing guidance to get us to produce these documents
Comment by jasonswett 3 days ago
Comment by turlockmike 3 days ago
You don't need elaborate prompts, just a few lines
"All code must have corresponding tests written ahead of time to prove the code meets the specification" is sufficient for most use cases. Prose can help nudge it more if it isn't adhearing consistently.
Comment by gruez 3 days ago
Comment by vikramkr 3 days ago
Comment by vikramkr 3 days ago
Comment by Nizoss 3 days ago
This setup works great especially when you work with multiple agents or sessions in parallel and don’t want to be babysitting TDD. You just know that no TDD shortcuts or violations will be made and can focus on the solution instead. Agents are good at internally justifying shortcuts and lowering what’s good enough as the session goes. You can notice this when you ask them to review their own work compared to when asking a new session to review the changes. The difference is stark.
What’s interesting about the TDD instructions I dogfooded for this is that there is a lot that is implicit about how to interpret operations in terms of TDD violations. For example, earlier versions of the instructions had the validation agent block multi-step refactor changes because there was no guarantee to them that further changes will follow. It would also block changes when a definition is removed while it is still being called. The reasoning is that the code will no longer build and thereby not fulfill the ”refactoring is allowed under green”. Improving the wording and clarifying the process helped from this unwanted false blocks.
If you want to give this approach a try, you’ll find it here. I’m the author and I’m happy to and any further questions: https://github.com/nizos/probity
Comment by ArtRichards 3 days ago
I'm interested in others dping something similar :) I included a docs cli tool in pypi to manage this context:
Comment by fowlie 4 days ago
Comment by dchuk 4 days ago
Comment by Rohunyyy 4 days ago
Comment by kirtivr 4 days ago
Comment by zuzululu 4 days ago
The waterfall approach is better after trying out TDD especially when you have a multi-agent setup. Also I found that in some cases the tests were just superficial hallucinations that never actually tested the components written or there some some context corruption and ultimately triggered a false positive that kicked off a completely unintentional refactoring.
Comment by __mharrison__ 4 days ago
Crazy times here in the development world. I'm always curious to watch other's best practices.
Comment by dools 4 days ago
Almost all the breakages after a big refactor are stale assertions but every time I catch a couple of critical problems that make the entire exercise very worth it.
The whole dev process is so fast compared to writing software manually that I find it absurd that I wouldn’t invest heavily in automated tests.
Comment by __mharrison__ 4 days ago
Comment by rsalus 4 days ago
TLDR; it found test-writing volume only weakly correlates with success and that encoding test-writing principles did not move resolution rates but _did_ materially change cost. Encouraging tests cost +19.8% output tokens for 0% gain; discouraging them saved 33–49% input tokens for ≤2.6pp accuracy loss. Separately, imposing the TDD procedure specifically seems like it can backfire: it actually _increased_ regressions from 6.08% to 9.94%.
IMO, where tests clearly help is primarily as an "oracle" applied after generation. It gives the models a signal that enables them to verify and self-correct if necessary.
Comment by zuzululu 4 days ago
Overall, these findings suggest that agent-written
tests often behave more like a habitual software-development rou-
tine than a dependable source of validation in this setting. More
agent-written tests do not mean more solves; what they more reli-
ably change is the process footprint—API calls, token usage, and
interaction patterns. Improving the value of testing for code agents
may therefore require better oracles and more actionable validation
signals, rather than simply inducing agents to write more tests.
> IMO, where tests clearly help is primarily as an "oracle" applied after generationBingo. I'm not against writing tests it's that the returns are better when its used as verification feedback and as "Oracle" exactly as you put it.
Comment by girvo 4 days ago
That, and even the absolute SOTA models still suck at writing tests.
Which shouldn't be surprising: humans suck at it too most of the time...
Comment by zuzululu 4 days ago
Comment by esperent 3 days ago
> This raises a central question: do such tests meaningfully improve issue resolution, or do they mainly mimic a familiar software-development practice while consuming interaction budget?
This is an important question but it's not the one I'm most interested in when requiring agents to follow TDD. My goal is to lock in behavior because it was happening way too frequently that an agent would successfully fix the issue at hand, but break something else that it wasn't supposed to touch.
The tests add another layer and it's why I always separate out red and green worker subagents. The green worker might get trigger happy and go beyond scope/break something but it's not allowed to fudge the tests so I'll know and can clean up and revert.
It's also why I'm not too bothered about perfect red green TDD. I can add the tests later if needed.
Comment by rsalus 3 days ago
I've been finding enforcing integrations and behavior structurally (e.g., through codegen/schemagen, e2e tests, etc) more reliable than simply instructing the models to write tests. oftentimes these tests are pretty low quality anyway, and results in its own form of tech debt.
Comment by esperent 3 days ago
Comment by necovek 4 days ago
In general — just like with humans — I find "just add more tests" to be counter-productive.
Tests make sense in a testable architecture: TDD can encourage one to be implicitly used, but it is a design, architectural choice that should be made explicit (lean to functional code; use direct, explicit dependency injection; ensure test stubs are just variants of the real implementation and fully tested using the same test as the real one...). LLMs should be prompted with this guidance instead for proper value estimation.
Comment by dnautics 3 days ago
tdd has been invaluable for this project (almost entirely llm written, but i review it) https://github.com/ityonemo/clr
Comment by rsalus 3 days ago
Comment by pramodbiligiri 4 days ago
I've noticed that LLMs tend to generate multiple testcases in one shot (which is not how humans usually go about TDD), and also they don't start with Integration Tests, unless instructed to do so.
Comment by dnautics 4 days ago
how!!??
you write a test, which is one extra function. and maybe a paragraph or so per feature ("i made a RED test"... "i made it GREEN"), everything else is the same between normal development and TDD. this is chump change compared to the rest of development, including thinking tokens
Comment by manmal 4 days ago
And the code will be good.
Comment by rsalus 4 days ago
Comment by bfeynman 4 days ago
Comment by emigre 4 days ago
Comment by mpweiher 4 days ago
Code that is easy to test tends to be well-structured.
Code that is badly structured tends to be hard to test.
TDD is not a QA methodology, it is a design methodology. It also tends to help quality out a lot, but that's a secondary effect.
Comment by rsalus 3 days ago
Comment by manmal 4 days ago
Comment by jzig 4 days ago
Comment by reg_dunlop 4 days ago
I have to push back on the idea that token costs balloon when using TDD within the context of a strong framework such as Jason has laid out here.
If the feature is repurposed/removed/refactored....I'd argue the specification wasn't well thought out prior to burning into tokens.
We're so eager to do a lot of the wrong things quickly, when it may serve us better to do a more precise thing slowly.
Comment by zuzululu 4 days ago
Comment by reg_dunlop 3 days ago
I fail to see the argument you're making...
Features aren't made in a vacuum. If specs are made/written....with the information available now...then it's better than not writing specs.
Writing specs with incomplete information is better than not writing specs with incomplete information.
Comment by SubiculumCode 4 days ago
Comment by jasonswett 4 days ago
Comment by homieg33 4 days ago
Comment by tarrant300 4 days ago
Comment by bmitc 2 days ago
What are "fallbacks routines"?
Comment by victorbjorklund 3 days ago
Comment by enraged_camel 4 days ago
All of this burns more tokens of course, but probably way less than coming back to the code later to fix bugs. It is also slower, but in the long run saves time.
Comment by yaodub 4 days ago
Comment by enraged_camel 3 days ago
Comment by yaodub 3 days ago
Comment by dluxem 4 days ago
If this is encoded in a skill, that skill essentially has to be loaded for everything thing your LLM is doing. This is probably one of the few areas where direct instructions via AGENTS.md is best, and I don't believe it requires much direction here to force the issue.
But I think the OP is just trying to have their agent work in a very specific way -- that is fine too.
> 5. Show me the test and ask for approval before continuing
Comment by jasonswett 4 days ago
Comment by zuzululu 4 days ago
But everybody is free to choose how they work and it may be required in ways that we can't know about.
Comment by realty_geek 4 days ago
The latest one is with "Uncle Bob Martin" who has some interesting takes on coding with AI from .... can I say an oldie?
Comment by ElijahLynn 3 days ago
https://open.spotify.com/episode/2UooZQNEpjXurZYBasds73?si=1...
Comment by jasonswett 4 days ago
Comment by jvuygbbkuurx 4 days ago
Comment by bisonbear 4 days ago
Comment by csbartus 4 days ago
In my version of this workflow I do specify myself, then let the LLM do the rest.
This way 1.) I'm 100% sure the understanding/spec is good 2.) It's translated into an executable format so the implementation can be verified 3.) The implementation has maximum code coverage tests which steers the AI to produce code which follows standards, fits into the existing codebase, and it's very easy to refactor.
So far, this is the one and only advantage of using LLMs in my SWE practice. They glue together (human written) specs with code, with confidence, in no time.
Comment by nullc 4 days ago
Comment by servercobra 4 days ago
Comment by jasonswett 4 days ago
Comment by __mharrison__ 4 days ago
Even more so when coding with agents. I think it is the probably the biggest lever to keep AI in guardrails.
(It's also why I wrote my latest book, Effective Testing, because I routinely find that my clients are very poor at treating.)
Comment by necovek 4 days ago
However, since we are talking about effectiveness, applying a lot of these principles might lead to a non-maintainable codebase — for humans and LLMs alike.
When any change causes 500 tests to break, or it causes nothing to break (see monkey-patching and/or mocking), you've gotten to a point where your testing approach is ineffective.
Most start applying principles of just enough tests and testable architectures too late, yet I believe they are fundamental.
Do you cover these in your book?
Comment by __mharrison__ 3 days ago
Wrt mocking. I'm not a huge fan. Again, look at my AGENTS.md. I prefer monkeypatch as a last resort option. Luckily, if you use TDD, you rarely have to use mocking. If you don't use TDD...
Comment by revlsas 3 days ago
Just work with Codex to fill the gaps, and then get it to one shot the implementation
Do review afterwards if needed
All these md files will be increasingly useless as models improve
Comment by mercutio2 3 days ago
But surely you aren’t suggesting literally every software project is composed of one-shot-able building blocks, or that the building blocks never require modifications to previous one-shots?
Comment by Ampersander 4 days ago
Comment by dev_hugepages 4 days ago
Comment by Ampersander 4 days ago
Comment by deepnotes 4 days ago
Comment by keenseller709 4 days ago
Comment by eddysir 4 days ago
Comment by EvanXue 4 days ago
Comment by tokenfaucet 4 days ago
Comment by Koyukoyu 4 days ago
Comment by behnamoh 4 days ago
Comment by jw1224 4 days ago
Skills are literally just Markdown documents that get loaded into context when the /skill-name is invoked.
Comment by dominotw 4 days ago
they are being sold as more powerful than they are. Like llms are intelligent blank slates that can be customized with mere markdown files.
Comment by calebkaiser 4 days ago
Taken to the extreme, the attitude that there is some special incantation that will unlock all capabilities is silly, and a lot of the "prompt engineering" discourse is similarly kind of dumb, but in-context learning is clearly a real thing.
Comment by dominotw 4 days ago
you are treating skill like sure thing
Comment by krupan 3 days ago
Comment by Zetaphor 4 days ago
Comment by coffeeaddict1 4 days ago
Comment by pramodbiligiri 4 days ago
Comment by john_strinlai 4 days ago
Comment by internet101010 4 days ago
Comment by beezlewax 4 days ago
Comment by wyre 4 days ago
Comment by theptip 4 days ago
Comment by yieldcrv 4 days ago
They do nothing to keep an AI on track in comparison to the aspects that simulate a product manager
And the AI just will correct the test when it fails as opposed to correct the code, because the code didn't miss anything the specification changed
My protip: just write tickets or have the AI write those too. that and the commits and the PRs will function as the AI’s memory better than any client side markdown file masquerading as a soul
Comment by kgdiem 4 days ago
In another project without my rules I’ve noticed I have to tell it to set up data for playwright tests instead of skipping if none exists.
Comment by bob1029 4 days ago
I am currently observing AI authored tests creating a massive sense of complacency because a human no longer owns responsibility for the test suite. It's too easy to reject ownership by way of the various agent prompting schemes. I find myself enjoying the idea of it too, primarily because adding tests to even the most trivial functionality is mandatory due to the TDD policy.
Developing good tests is like an artform. Total coverage is a terrible objective. Correctness does not compose upward. It's a game of chasing ghosts if you think you can build a perfectly clean system bottom up and then magically meet the customer at the top. They're gonna kick your jenga tower over on day one.
Comment by cbcjcyv5 4 days ago
I mostly agree though, I've seen a lot of vapid assertions in my day job recently.
I should note Im specifically not doing tdd with AI.
Comment by steno132 4 days ago
The token cost and tech debt introduced by tests is just not worth it. There's usually no bugs and if there are, you can fix them quickly if and when it's needed.
Comment by Ginop 4 days ago
Testing was and is still very important, as LLMs can still miss important points in business logic or other edge cases I would argue that tests became as important as code, if not more.
Comment by buster 2 days ago
Comment by esafak 4 days ago