OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
Posted by stared 2 hours ago
Comments
Comment by the_duke 1 hour ago
From the post I expected that the tasks were about analysing traces, but all the tasks in the repository are about adding instrumentation to code!
Some of the instructions don't give any guidance how to do it, some specify which libraries to use.
"Use standard OTEL patterns" ... that's about as useful as saying "go write some code". There are a lot of ways to do instrumentation....
I'd be very curious HOW exactly the models fail.
Are the test sets just incredibly specific about what output they except, and you get a lot of failures because of tiny subtle mismatches? Or do they just get the instrumentation categorically wrong?
Also important: do the models have access to a web search tool to read the library docs? Otel libraries are often complicated to use... without reading latest docs or source code this would be quite tricky.
Some models have gotten better at adding dependencies, installing them and then reading the code from the respective directory where dependencies get stored, but many don't do well with this.
All in all, I'm very skeptical that this is very useful as a benchmark as is.
I'd be much more interested in tasks like:
Here are trace/log outputs , here is the source code, find and fix the bug.
Comment by pixl97 56 minutes ago
In supporting a piece of cloud software with a lot of microservices I think this is a more generalized problem for humans. The app I work with demanded some logging requirements like the library to use. But that was it, different parts by different teams ended up with all kinds of different behaviors.
As for the AI side, this is something where I see our limited context sizes causing issues when developing architecture across multiple products.
Comment by chaps 47 seconds ago
Comment by bob1029 31 minutes ago
Context size isn't the issue. You cannot effectively leverage an infinite context if you had one anyways. The general solution is to recursively decompose the problem into smaller ones and solve them independently of each other, returning the results back up the stack. Recursion being the key here. A bunch of parallel agents on separate call stacks that don't block on their logical callees is a slop factory.
Comment by raincole 1 hour ago
HN Editorialized: OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)
The task:
> Your task is: Add OTEL tracing to all microservices.
> Requirements:
> Instrumentation should match conventions and well-known good practices.
> Instrumentation must match the business domain of the microservices.
> Traces must be sent to the endpoint defined by a standard OTEL environment variable.
> Use the recent version of the OTEL SDK.
I really don't think anything involved with multiple microservices can be called 'simple' even to humans. Perhaps to an expert who knows the specific business's domain knowledge it is.
Comment by pixl97 59 minutes ago
I've had to work in systems where events didn't share correlation IDs, I had to go in and filter entries down to microseconds to get a small enough number of entries that I could trace what actually happened between a set of services.
From what I've seen in the enterprise software side of the world is a lot of companies are particularly bad at SRE and there isn't a great amount of standardization.
Comment by formerly_proven 37 minutes ago
Enterprise app observability is purely a responsibility of each individual application/project manager. There is virtually no standardization or even shared infra, a team just stuffing plaintext logs into an unconfigured elasticsearch instance is probably above median already. There is no visibility for anything across departments and more often that not, not even across apps in a department.
Comment by chaps 53 minutes ago
These aren't challenging things to do for an experienced human at all. But it's such a huge pain point for these models! It's hard for me to wrap my head around how these models can write surprisingly excellent code but fail down in these sorts of relatively simple troubleshooting paths.
Comment by whynotminot 1 hour ago
Very few people start their careers as SREs, it’s generally something they migrate into after enjoying it and showing aptitude for it.
With that said, I wouldn’t expect this wall to hold up for too long. There has been a lot of low hanging fruit teaching models how to code. When that is saturated, the frontier companies will likely turn their attention to honing training environments for SRE style debug.
Comment by heliumtera 22 minutes ago
When we ask it to generate an image, any image will do it. We couldn't care less. Try to sculpt it, try to rotate it 45 degrees and all hell breaks loose. The image would be rotated but the hair color could change as well. Pure vibes!
When you ask it to refactor your code, any pattern would do it. You could rearrange the code in infinite ways, rename variables in infinite ways without fundamentally breaking logic. You could make as many arbitrary bullshit abstraction and call it good, as people have done it for years with OOP. It does not matter at all, any result would do it in this cases.
When you want to hit an specific gRPC endpoint, you need an specific address and the method expects an specific contract to be honored. This either matches or it doesn't. When you wish the llms could implement a solution that captures specifics syscalls from specifics hosts and send traces to an specific platform, using an specific protocol, consolidating records on a specific bucket...you have one state that satisfy your needs and 100 requirement that needs to necessarily be fulfilled. It either meet all the requirements or it's no good.
It truly is different from Vibing and llms will never be able to do in this. Maybe agents will, depending on the harnesses, on the systems in place, but one model just generate words words words with no care about nothing else
Comment by lysace 1 hour ago
The models are already so good at the traditionally hard stuff: collecting that insane amount of detailed knowledge across so many different domains, languages and software stacks.
Comment by asyncadventure 1 hour ago
Comment by jakozaur 1 hour ago
Comment by ndriscoll 30 minutes ago
Comment by ripped_britches 8 minutes ago
Is it clicking a different result from same search?
It’s possible that the requirements here are not clear, given that the instructions don’t detail how to handle such a situation and it’s not obvious to me as a human.
Comment by dgxyz 1 hour ago
I wouldn’t touch this with a pole if our MTTR was dependent on it being successful though.
Comment by vasco 1 hour ago
MCP servers for monitoring tools are making our developers more competent at finding metrics and issues.
It'll get there but nobody is going to type "fix my incident" in production and have a nice time today outside of the most simple things that if they are possible to fix like this, could've been automated already anyway. But between writing a runbook and automating sometimes takes time so those use cases will grow.
Comment by hakanderyal 29 minutes ago
Have AI document the services first into a concise document. Then give it proper instructions about what you expect, along with the documentation created.
Opus would pass that.
We are not there yet, the agents are not ready to replace the driver.
Comment by parliament32 22 minutes ago
Comment by AnotherGoodName 1 hour ago
>When an app runs on a single machine, you can often trace an error by scrolling through a log file. But when it runs across 50 microservices, that single request gets scattered into a chaotic firehose of disconnected events.
Yep this is about Google. It's painful for humans to debug and it's also an extremely bespoke issue to deal with. No one else has quite the same level of clusterfuck and there's going to be no training for LLMs on this.
Comment by youknownothing 1 hour ago
Comment by belval 55 minutes ago
In general for those tasks though the question is more "How would a human do it". If it's impossible for a human because your tooling is so bad you can't even get the logs across services for a single ID, that seems like a pretty serious design issue.
In general looking at the prompt though, this is also not very representative. You don't have an SOP that you can share with your agent? How do you expect new hires to onboard?
Comment by pixl97 53 minutes ago
Comment by tayo42 1 hour ago
This seems like typical work in any business that isn't trivial.
Comment by AnotherGoodName 41 minutes ago
Eg. Facebook (i've worked at Meta and Google amongst others so a good way to compare extremes) is entirely a monolith. You type a line of code, hit refresh and you see it, running fully in the context of everything else your dev server does. It's still statically typed so a type error is seen quickly in the full context of everything that the server can do and in general there's just no impetus to move to microservices since the deployment of the monolith takes no time. Every server running Facebook runs the exact same image. That's not to say Hack is a perfect language or anything. It's basically PHP made to look and act like Java which isn't great, but the fact is you never ever think of how the code runs and interacts in context of the microservice environment. You don't need to. Everyone who's worked at Meta and Google has the opinion that Meta moves faster and this is part of the reason.
Some companies have architectures that can't deploy like this. This is the reason you move to microservices. It's not at all a developer velocity win. It's just needed if you have frameworks that don't allow you to run and deploy "all the code ever written in the company" in a reasonable way. You need to break it up in modular pieces that have defined boundaries so that you only run the parts you need as you develop (defined boundaries are a dev win sure but that can be done without microservices).
Google has gotten to the point where things are getting really fined grained and honesty chaotic. Moving to a portion of code to its own microservice is basically a promo bait 6 month project, often done without justification other than "everything should be its own microservice". In my time at Google i never heard "what benefit do we get if this is a microservice?" it's just assumed to always be a good thing. 50 interacting microservices to go through in a trace is at the point where the only place I've seen such a thing is Google.
Comment by smithclay 46 minutes ago
The only other benchmark I've come across is https://sreben.ch/ ... certainly there must be others by now?
Comment by jcims 1 hour ago
Even when it's not particularly effective, the additional information provided tends to be quite useful.
Comment by nsjdkdkdk 1 hour ago
Comment by derfurth 29 minutes ago
- initially it wasn't working, plenty of parent/child relationships problems like described in the post
- so I designed a thin a wrapper and used sealed classes for events instead of dynamic spans + some light documentation
It took me like a day to implement tracing on the existing codebase, and for new features it works out of the box using the documentation.
At the end of the day, leveraging typing + documentation dramatically constrains LLMs to do a better job
Comment by winton 1 hour ago
Comment by stared 1 hour ago
See e.g.: https://quesma.com/benchmarks/otel/models/claude-opus-4.5/
Comment by throwup238 1 hour ago
Comment by yomismoaqui 1 hour ago
It made me remember when I was working on the J2EE ecosystem shudder
Comment by NitpickLawyer 52 minutes ago
For [1]: instruction.md is very brief, quite vague and "assumes" a lot of things.
- Your task is: Add OTEL tracing to all microservices. Add OTEL logging to all microservices. (this is good)
- 6.I want to know if the microservice has OTEL instrumentation and where the data is being sent. (??? i have no idea what this means)
- 9.Use the recent version of the OTEL SDK. (yeah, this won't work unless you also use an MCP like context7 or provide local docs)
What's weird here is that instruct.md has 0 content regarding conventions, specifically how to name things. Yet in tests_outputs you have this "expected_patterns = ["order", "stock", "gateway"]" and you assert on it. I guess that makes some sense, but being specific in the task.md is a must. Otherwise you're benching assumptions, and those don't even work with meatbags :)
For [2]: instruction.md is more detailed, but has some weird issues:
- "You should only be very minimal and instrument only the critical calls like request handlers without adding spans for business calls \n The goal is to get business kind of transaction" (??? this is confusing, even skipping over the weird grammar there)
- "Draw ascii trace diagram into /workdir/traces.txt" (????)
- "When modifying Python files, use Python itself to write files or use sed for targeted changes" (? why are you giving it harness-specific instructions in your instruct.md? this is so dependent on the agentic loop used, that it makes no sense here.
- "Success Criteria: Demonstrate proper distributed tracing \n Include essential operations without over-instrumenting (keep it focused) \n Link operations correctly \n Analyze the code to determine which operations are essential to trace and how they relate to each other. (i mean ... yes and no. these are not success criteria IMO. It's like saying "do good on task not do bad". This could definitely be improved.)
----
Also, I noticed that every folder has a summary_claude... that looks like a claude written summary over a run. I hope that's not what's used in actually computing the benchmark scores. In that case, you're adding another layer of uncertainty in checking the results...
The ideea is nice, but tbf some of the tests seem contrived, your instructions are not that clear, you expect static naming values while not providing instructions at all about naming conventions, and so on. It feels like a lot of this was "rushed"? I peaked a bit at the commit history and saw some mentions of vibe-coding a viewer for this. I hope that's the only thing that was vibe-coded :)
[1] - https://github.com/QuesmaOrg/otel-bench/tree/main/datasets/o...
[2] - https://github.com/QuesmaOrg/otel-bench/blob/main/datasets/o...
Comment by heliumtera 39 minutes ago
First of all, familiarity with open telemetry apis is not knowledge, they are arbitrary constructs.
We are implying that conforming to a standard is the only way, the right way. I would challenge that.
Assuming models were good at this tasks, we could only conclude that this tasks were trivial AND sufficiently documented. Assuming they were good at this type of tasks (they can be trained to be good cheaply, we know that based on similar acquired capabilities) making a benchmark out of it would be less useful.
But I am sure nobody really cares and the author just had to SEO a little bit regardless of reality
Comment by vachina 18 minutes ago
Also LLM is a very advanced autocomplete algorithm. And autocomplete isn’t designed to write for you, you have to write first.
Comment by linuxftw 54 minutes ago
Comment by whalesalad 1 hour ago
Comment by apercu 1 hour ago
My takeaway was more "maybe AI coding assistants today aren’t yet good at this specific, realistic engineering task"....
Comment by hobofan 59 minutes ago
I think you would see similar results if tasking an AI to e.g. write GRPC/Protobuf systems using only the builtin/official protobuf codegen languages.
Where I think the benchmark is quite fair is in the solutions. It looks like for each of the languages (at least the ones I'm familiar with), the "better" options were chosen, e.g. using `tracing-opentelemtry` rather than `opentelemetry-sdk` directly in Rust.
However the one-shot nature of the benchmark also isn't that reflective of the actual utility. In my experience, if you have the initial framework setup done in your repo + a handful of examples, they do a great job of applying OTEL tracing to the majority of your project.
Comment by pixl97 42 minutes ago
This almost always correlates with customers having similar issues in getting things working.
This has lead us to rewrite a lot of documentation to be more consistent and clear. In addition we set out series of examples from simple to complex. This shows as less tickets later, and more complex implementations being setup by customers without the need for support.
Comment by vimda 1 hour ago
Comment by devin 59 minutes ago
Comment by heliumtera 36 minutes ago
Comment by another_twist 1 hour ago