I wrote JustHTML using coding agents
Posted by simonw 1 day ago
Comments
Comment by minusf 22 hours ago
isn't this more like a port of `html5ever` from rust to python using LLM, as opposed to creating something "new" based on the test suite alone?
if yes, wouldn't be the distinction rather important?
Comment by EmilStenstrom 20 hours ago
The first iteration of the project created a library from scratch, from the tests all the way to 100% test coverage. So even without the second iteration, it's still possible to create something new.
In an attempt to speed it up, I (with coding agent) rewrote it again based on html5ever's code structure. It's far from a clean port, because it's heavily optimized Rust code, that isn't possible to port to Python (Rust marcos). And it still depended on a lot of iteration and rerunning tests to get it anywhere.
I'm not pushing any agenda here, you're free to take what you want from it!
Comment by simonw 19 hours ago
It looks to me like this is the last commit before the rewrite: https://github.com/EmilStenstrom/justhtml/tree/989b70818874d...
The commit after that is https://github.com/EmilStenstrom/justhtml/commit/7bab3d2 "radical: replace legacy TurboHTML tree/handler stack with new tokenizer + treebuilder scaffold"
It also adds this document called html5ever_port_plan.md: https://github.com/EmilStenstrom/justhtml/blob/7bab3d22c0da0...
Here's the Codex CLI transcript I used to figure this out: https://gistpreview.github.io/?53202706d137c82dce87d729263df...
Comment by minusf 20 hours ago
You also mention that the current "optimised" version is "good enough" for every-day use (I use `bs4` for working with html), was the first iteration also usable in that way? Did you look at `html5ever` because the LLM hit a wall trying to speed it up?
Comment by EmilStenstrom 19 hours ago
As for bs4, if you don't change the default, you get the stdlib html.parser, which doesn't implement html5. Only works for valid HTML.
Comment by simonw 1 day ago
Emil Stenström wrote it with a variety of coding agent tools over the course of a couple of months. It's a really interesting case study in using coding agents to take on a very challenging project, taking advantage of their ability to iterate against existing tests.
I wrote a bit more about it here: https://simonwillison.net/2025/Dec/14/justhtml/
Comment by EmilStenstrom 23 hours ago
Comment by msephton 14 hours ago
Comment by EmilStenstrom 11 hours ago
Comment by gabrielsroka 19 hours ago
I cloned the repo and ran `wc -l` on the src directory and got closer to 9,500. Am i missing something?
Edit: maybe you meant just the parser
Comment by furyofantares 14 hours ago
Comment by EmilStenstrom 11 hours ago
Comment by furyofantares 9 hours ago
Comment by vivzkestrel 15 hours ago
Comment by EmilStenstrom 11 hours ago