Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering
Posted by Anon84 3 days ago
Comments
Comment by monkeydust 2 days ago
You give it a problem, you then refine that problem where a fast, cheaper model asks you questions which you answer to get a better input prompt. You then choose a MA strategy for example take problem break up to sections then final judge concludes or you do multi turn where agents debate then judge summarises debate.
The best approach is what I call 'all angles' where all these strategies run in parallel the final meta-judge synthesise the response - the most useful part of this which I recently added is a view to see the variance in each strategy.
Been using this for life stuff - housing search, schools, family challenges!
Perhaps I should make a video of it in action if people in HN community interested let me know.
Comment by monkeydust 2 days ago
Comment by monkeydust 1 day ago
Comment by ethanwillis 2 days ago
Comment by chrisss395 2 days ago
Comment by Folcon 2 days ago
Comment by monkeydust 2 days ago
Comment by whattheheckheck 2 days ago
Comment by uxhacker 2 days ago
Comment by monkeydust 2 days ago
Comment by Cherub0774 2 days ago
I don't think a specific harness is even necessary to get a boost from 'Refine'. Even a simple custom agent is portable enough... it's easy enough to take the existing 'Plan' agent definition present in VS Code and tweak it to be 'Refine' instead.
Comment by SOLAR_FIELDS 2 days ago
Comment by flowbarai 2 days ago
Comment by saberience 1 day ago
I.e. you cannot end up having a more intelligent output by using more dumber models (that is: dumber than the most intelligent model used).
It's generally always best to refine your prompt and send it (at most) to the two smartest frontier models possible. And then have the smartest model review the output from the second smartest.
Comment by senectus1 3 days ago
Was in a meeting reviewing a potential new product, it was going well until they showed us that they had added AI to it (of course they have). It was pretty obviously just shoehorned in, and one part of that obviousness was that they had a column that showed how many tokens it took to make each query.
I asked who is paying for the tokens, they said its included in the license. I said, so is there a budget or is it all you can eat. they said good question they didnt know and would get back to me. I said the reason i asked was just one query there had a 250k token burn on it. and it was a fairly simple query about one device.
then, one of the execs on their side was heard saying out loud "Why are we even showing this to the customers?"
it have us quite a chuckle. But lesson learned... the cost of adding AI to anything isnt really being accounted for let alone the true cost of actually running the AI.
all things AI are going to get more expensive. even if you dont want the AI aspect.
Comment by prymitive 2 days ago
Comment by sedatk 3 days ago
Such drastic changes tell me that pricing of tokens is arbitrary, and AI business is running out of money fast.
Comment by lucaspiller 3 days ago
Taking SpaceX as an example, they have increased prices across all their consumer products over the past six months. But they definitely aren't short on money with Alphabet and Anthropic combined paying them over $2 billion per month.
Microsoft/GitHub lost out here as they were just repacking other people's products.
Comment by lefra 3 days ago
Comment by jurgenburgen 2 days ago
Rumors are worth squat when they’re most likely put in motion by the people with a vested interest in this industry.
Let’s talk about profits when there’s real data from the IPO documentation.
Comment by NitpickLawyer 2 days ago
You can make some educated guesses and find out some limits on inferencing cost by looking at 3rd party providers on platforms like openrouter. You can get some median cost /tok for a given model size. Then make some educated guesses on SotA model sizes, and you can get an estimate on pure cost of serving a model. Error bars and all that, of course. But still a range, with some limits.
Comment by shimman 2 days ago
Comment by NitpickLawyer 2 days ago
Comment by whattheheckheck 2 days ago
Comment by TSiege 2 days ago
Comment by phyzix5761 2 days ago
Comment by altmanaltman 2 days ago
Also I mean prices in generally for all things are based on underlying factors, that doesn't make them arbitary (i.e. github executives using a random number generator for token pricing would be arbitary)
Comment by sedatk 1 day ago
Comment by bob1029 2 days ago
I'm seeing a ratio of around 10:1 in my usage. A vast majority of the tokens consumed are on the input side. The agent will often read a million tokens just to patch one line of code.
I think if you are seeing something closer to 1:1 or more on the output side, there is either a problem with the agent or the codebase is new / empty.
Comment by kolinko 2 days ago
A million tokens (not cached) sounds like a lot.
Comment by bob1029 2 days ago
I still don't understand how caching helps me very much. I must be misunderstanding it because I thought the user's prompt (which is the biggest variable) necessarily sits prior to all of these token intensive tool calls. How can we cache the reading of codebase if the prefix is always moving?
Comment by Phemist 2 days ago
A new instruction by the user will be appended at the end if it done in the same conversation. Thus only has influence on the cacheability of the original agent prompt, but not of subsequent tool calls.
Comment by uxhacker 2 days ago
Has ai forgotten about high level design? Surely all it needs to know is what the methods, objects or functions in the code base actually does and the actual code it is meant to be fixing?
I wonder if half the issues is that the LLM try to change too much?
Comment by willtemperley 2 days ago
Comment by frumiousirc 2 days ago
But, does every prompt need the entire codebase?
Comment by amazingamazing 2 days ago
Comment by zozbot234 2 days ago
Comment by sakuraiben 3 days ago
Comment by drivebyhooting 3 days ago
Comment by gib444 3 days ago
Their interests are often not your interests. In this case they want you to unnecessary money on useless work (let's stop the euphemism of "tokens" btw)
Comment by simianwords 2 days ago
Comment by gib444 2 days ago
/s
Comment by esperent 3 days ago
If you want a difference kind of dynamic testing besides unit tests, have you tried writing it in as a requirement during the planning/PRD phase?
Comment by make3 3 days ago
Comment by SubiculumCode 3 days ago
[1] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Stee...
Comment by gmerc 3 days ago
Comment by emsign 2 days ago
I wonder what hyperscaled compute farms and models will be good for at that running cost when most AI needs can be fulfilled by on-prem and on-device hardware and models. Probably only customer left are big governments. So in the end the tax payer has to pay for those billions of investments by the AI cartel.
Comment by zozbot234 2 days ago
Also, smaller models can obviously be used but a smaller model will be a lot weaker in real-world knowledge and this tends to limit their smarts in a way that can't be compensated by more thinking.
Comment by emsign 3 days ago
Comment by scotty79 2 days ago
Comment by drivebyhooting 3 days ago
Maybe soon companies will look at how engineers can optimize the token efficiency of AI.
Comment by Retric 3 days ago
Comment by ares623 3 days ago
Comment by jpatt 3 days ago
Now that we have pretty decent open source models, anyone can create a new business to supply more tokens. Sure there’s short term scarcity: energy, GPUs, cooling, but this is a scale up problem. More token demand = more data center build = more energy plant build. This downward pressure will also keep frontier private model prices in check.
Differentiation seems to be happening at the harness level, whereby we can expect token spend to be a metric to compete on and drive down for the customer (at least hoping tools in the application space don’t continue token based billing as their primary revenue stream).
These are not short term hyper growth forces, but a fundamental alignment of incentives.
Comment by avianlyric 2 days ago
But we’re seeing lots of open weight models that are either pretty close to SToA, or more importantly, perfectly capable of doing all the low level token insensitive grunt work when writing code. Pairing them with SToA models for long horizon task management, and you’ve got a very cost effective system.
The frontier labs have put little effort into cost efficient inference, they don’t need to, but folks like DeepSeek clearly are, and have achieved some impressive cost improvements. Given DeepSeeks models give you 70% of the capabilities for 30% of the cost, expect people to start moving lots of workloads to providers that provide cheap inference for open models, and huge competition to appear to provide that cheap inference. It’s truly commodity LLM inference.
In turn expect more companies to focus on building inferences efficient models, because someone that can build a model that provides 70% of SToA capabilities for 10% of the token cost, immediately eats up huge amounts of the available inference market.
Another factor in all this, is it’s becoming increasingly clear that building custom agents/workflows for LLM to operate in, is required to get the best out of these models. That means people are implicitly building the infra needed to use multiple model types and evaluate workflow performance end-to-end. Which in turn means they have everything they need to plugin in future, cheaper, inference providers and quickly evaluate if they can change their model provider.
Comment by mobelkh 3 days ago
i don't think a lot of people know this, but a cluster of GPUs can serve multiple clients without much of a drop in performance, e.i. worst case scenario you band together with 6-16 people to run a 2-3 H100 server to host deepseek V4 Flash or 4-6 to run Pro, and you're getting the same performance as if you ran it alone, this means a lot of companies can afford throwing 50-100k into their own LLM server cluster.
We're at a price point where if you push it further people will move, there's no real vendor lock in, your agent config, skills, MCP servers etc are all reusable with other models and harnesses, so unless you get all providers to collude on a price hike, you risk an exodus of customers
Comment by fc417fc802 3 days ago
In the other direction models continue to grow larger, new customers continue to arrive, and existing customers continue to find ever more creative ways to burn large quantities of tokens as the prices fall.
I doubt anyone can say with certainty where the equilibrium will be 1 or 5 years from now largely because (among many other things) it's impossible to predict how much of the current economy AI will end up eating. In general though the third party providers of open weights models are probably the most reliable data source available since they have little to no incentive to subsidize usage.
Comment by Retric 3 days ago
Betting against that you need to assume exponentially more expensive models every year.
Comment by oersted 2 days ago
Comment by dnlosx 3 days ago
Comment by deadbabe 2 days ago
Comment by scotty79 2 days ago
Comment by zcw100 2 days ago
Comment by satvikpendem 3 days ago
Comment by mariusor 2 days ago
Comment by NewJazz 3 days ago
Comment by alchemism 2 days ago
You're welcome! =)
Comment by dkersten 2 days ago
So what? Terms are reused in different contexts all the time. And most people have moved on from cryptocurrencies anyway, so there’s little chance it’ll confuse anyone.
Comment by becomevocal 2 days ago
Comment by zozbot234 2 days ago
Comment by jwnin 2 days ago
Comment by becomevocal 2 days ago
Comment by friendlygeorge 2 days ago
Comment by jazzen 1 day ago
Comment by winphoto 3 days ago
Comment by knightops_dev 2 days ago
Comment by jlcases 2 days ago
Comment by eddysir 2 days ago
Comment by JoaoBerne 2 days ago
Comment by samdonovan 2 days ago
Comment by baarse 3 days ago
Comment by andrewvu0203 3 days ago
Comment by bonigv 3 days ago
Comment by Waffle2180 3 days ago