A2UI: A Protocol for Agent-Driven Interfaces
Posted by makeramen 10 hours ago
Comments
Comment by codethief 8 hours ago
(emphasis mine)
Sounds like agents are suddenly able to do what developers have failed at for decades: Writing platform-independent UIs. Maybe this works for simple use cases but beyond that I'm skeptical.
Comment by observationist 36 minutes ago
It's about accomplishing a task, not making a bot accomplish a task using the same tools and embodiment context as a human - there's no upside, unless the bot is actually using a humanoid embodiment, and even then, using a CLI and service API is going to be preferable to doing things with UI in nearly every possible case, except where you want to limit to human-ish capabilities, like with gaming, or you want to deceive any monitors into thinking that a human is operating.
It's going to be infinitely easier to wrap a json get/push wrapper around existing APIs or automation interfaces than to universalize some sort of GUI interactions, because LLM's don't have the realtime memory you need to adapt to all the edge cases on the fly. It's incredibly difficult for humans, and hundreds of billions of dollars have been spent trying to make software universally accessible and dumbed down for users, and still ends up being either stupidly limited, or fractally complex in the tail, and no developer can ever account for all the possible ways in which users interact with a feature for any moderately complex piece of software.
Just use existing automation patterns. This is one case where if an AI picks up this capability alongside other advances, then awesome, but any sort of middleware is going to be a huge hack that immediately gets obsoleted by frontier models as a matter of course.
Comment by giancarlostoro 1 hour ago
Comment by rockwotj 8 hours ago
Comment by hurturue 3 hours ago
Comment by kridsdale3 2 hours ago
A2UI is a superset, expanding in to more element types. If we're going to have the origin of all our data streams be string-output-generators, this seems like an ok way to go.
I've joined an effort inside Google to work in this exact space, though what we're doing has no plan to become open source, other groups are working on stuff like A2UI and we collaborate with them.
My career previous to this was nearly 20 years of native platform UI programming and things like Flutter, React Native, etc have always really annoyed me. But I've come around this year to accept that as long as LLMs on servers are going to be where the applications of the future live, we need a client-OS agnostic framework like this.
Comment by mentalgear 7 hours ago
Comment by awei 3 hours ago
Some examples from the documentation: { "id": "settings-tabs", "component": { "Tabs": { "tabItems": [ {"title": {"literalString": "General"}, "child": "general-settings"}, {"title": {"literalString": "Privacy"}, "child": "privacy-settings"}, {"title": {"literalString": "Advanced"}, "child": "advanced-settings"} ] } } }
{ "id": "email-input", "component": { "TextField": { "label": {"literalString": "Email Address"}, "text": {"path": "/user/email"}, "textFieldType": "shortText" } } }
Comment by epec254 3 hours ago
Most HTML is actually HTML+CSS+JS - IMO, accepting this is a code injection attack waiting to happen. By abstracting to JSON, a client can safely render UI without this concern.
Comment by lunar_mycroft 2 hours ago
Comment by epicurean 2 hours ago
Comment by awei 1 hour ago
Comment by awei 3 hours ago
Comment by epec254 3 hours ago
One challenge is you do likely want JS to process/capture the data - for example, taking the data from a form and turning it into json to send back to the agent
Comment by oooyay 3 hours ago
Comment by awei 2 hours ago
Comment by mbossie 8 hours ago
How many more variants are we introducing to solve the same problem. Sounds like a lot of wasted manhours to me.
Comment by MrOrelliOReilly 8 hours ago
Comment by pscanf 8 hours ago
I completely agree, though I'm personally sitting out all of these protocols/frameworks/libraries. In 6 months time half of them will have been abandoned, and the other half will have morphed into something very different and incompatible.
For the time being, I just build things from scratch, which–as others have noted¹–is actually not that difficult, gives you understanding of what goes on under the hood, and doesn't tie you to someone else's innovation pace (whether it's higher or lower).
Comment by kridsdale3 2 hours ago
The same happened with GPUs in the 90s. When Jensen formed Nvidia there were 70 other companies selling Graphics Cards that you could put in a PCI slot. Now there are 2.
Comment by shireboy 4 hours ago
Comment by meander_water 58 minutes ago
Comment by epec254 4 hours ago
Comment by mystifyingpoi 7 hours ago
Sounds like a lot of people got paid because of it. That's a win for them. It wasn't their decision, it was company decision to take part in the race. Most likely there will be more than 1 winner anyway.
Comment by kridsdale3 2 hours ago
Like you mentioned, its a good time to be employed.
Comment by hobofan 6 hours ago
Comment by p_v_doom 5 hours ago
Comment by askl 7 hours ago
Comment by pedrozieg 7 hours ago
The genuinely interesting bit here is the security boundary: agents can only speak in terms of a vetted component catalog, and the client owns execution. If you get that right, you can swap the agent for a rules engine or a human operator and keep the same protocol. My guess is the spec that wins won’t be the one with the coolest demos, but the one boring enough that a product team can live with it for 5-10 years.
Comment by wongarsu 8 hours ago
Comment by turnsout 3 hours ago
The vision here is that you can chat with Gemini, and it can generate an app on the fly to solve your problem. For the visualized landscaping app, it could just connect to landscapers via their Google Business Profile.
As an app developer, I'm actually not even against this. The amount of human effort that goes into creating and maintaining thousands of duplicative apps is wasteful.
Comment by verdverm 35 minutes ago
How many times are users going to spin GPUs to create the same app?
Comment by jadelcastillo 1 hour ago
Comment by verdverm 25 minutes ago
1. Establish SSE connection
... user event
7. send updates over origin SSE connection
So the client is required to maintain an SSE capable connection for the entire chat session? What if my network drops or I switch to another agent?
Seems an onerous requirement to maintain a connection for the life-time of a session, which can span days (as some people have told us they have done with agents)
Comment by tasoeur 9 hours ago
Comment by uptownhr 4 hours ago
Comment by oddrationale 3 hours ago
Comment by ceuk 5 hours ago
Feels good to have been on the money, but I'm also glad I didn't start a project only to be harpooned by Google straight away
Comment by kridsdale3 1 hour ago
Comment by qsort 9 hours ago
What scares me is that even without arbitrary code generation, there's the potential for hallucinations and prompt injection to hit hard if a solution like this isn't sandboxed properly. An automatically generated "confirm purchase" button like in the shown example is... probably something I'd not make entirely unsupervised just yet.
Comment by barbazoo 4 hours ago
Comment by epec254 4 hours ago
Comment by kridsdale3 1 hour ago
Comment by iristenteije 6 hours ago
Comment by jy14898 9 hours ago
However, I'm happy it's happening because you don't need an LLM to use the protocol.
Comment by zwarag 4 hours ago
Comment by _pdp_ 7 hours ago
It is simple, effective and feels more native to me than some rigid data structure designed for very specific use-cases that may not fit well into your own problem.
Honestly, we should think of Emacs when working with LLMs and kind of try to apply the same philosophy. I am not a fan of Emacs per-se but the parallels are there. Everything is a file and everything is a text in a buffer. The text can be rendered in various ways depending on the consumer.
This is also the philosophy that we use in our own product and it works remarkably well for diverse set of customers. I have not encountered anything that cannot be modelled in this way. It is simple, effective and it allows for a great degree of flexibility when things are not going as well as planned. It works well with streaming too (streaming parsers are not so difficult to do with simple text structures and we have been doing this for ages) and LLMs are trained very well how to produce this type of output - vs anything custom that has not been seen or adopted yet by anyone.
Besides, given that LLMs are getting good at coding and the browser can render iframes in seamless mode, a better and more flexible approach would be to use HTML, CSS and JavaScript instead of what Slack has been doing for ages with their block kit API which we know is very rigid and frustrating to work with. I get why you might want to have a data structures for UI in order to cover CLI tools as well but at the end of the day browsers and clis are completely different things and I don not believe you can meaningfully make it work for both of them unless you are also prepared to dumb it down and target only the lowest common dominator.
Comment by evalstate 8 hours ago
Comment by raybb 8 hours ago
Comment by ChrisArchitect 2 hours ago
Comment by mentalgear 7 hours ago
Comment by lowsong 8 hours ago
Why the hell would anyone want this? Why on earth would you trust an LLM to output a UI? You're just asking for security bugs, UI impersonation attacks, terrible usability, and more. This is a nightmare.
Comment by vidarh 7 hours ago
Comment by DannyBee 6 hours ago
Freeform looks and acts like text, except for a set of things that someone vetted and made work.
If the interactive diagram or UI you click on now owns you, it doesn't matter if it was inside the chat window or outside the chat window.
Now, in this case, it's not arbitrary UI, but if you believe that the parsing/validation/rendering/two way data binding/incremental composition (the spec requires that you be able to build up UI incrementally) of these components: https://a2ui.org/specification/v0.9-a2ui/#standard-component...
as transported/renderered/etc by NxM combinations of implementations (there are 4 renderers and a bunch of transports right now), is not going to have security issues, i've got a bridge to sell you.
Here, i'll sell it to you in gemini, just click a few times on the "totally safe text box" for me before you sign your name.
My friend once called something a babydoggle - something you know will be a boondoggle, but is still in its small formative stages.
This feels like a babydoggle to me.
Comment by vidarh 4 hours ago
There is a wast difference in risk between me clicking a button provided by Claude in my Claude chat, on the basis of conversations I have had with Claude, and clicking a random button on a random website. Both can contain a malicious. One is substantially higher risk. Separately, linking a UI constructed this way up to an agent and let third parties interact with it, is much riskier to you than to them.
> If the interactive diagram or UI you click on now owns you, it doesn't matter if it was inside the chat window or outside the chat window.
In that scenario, the UI elements are irrelevant barring a buggy implementation (yes, I've read the rest, see below), as you can achieve the same things as you can do that way with just presenting the user with a basic link and telling them to press it.
> as transported/renderered/etc by NxM combinations of implementations (there are 4 renderers and a bunch of transports right now), is not going to have security issues, i've got a bridge to sell you.
I very much doubt we'll see many implementations that won't just use a web view for this, and I very much doubt these issues will even fall in the top 10 security issues people will run into with AI tooling. Sure, there will be bugs. You can use this argument against anything that requires changes to client software.
But if you're concerned about the security of clients, mcp and hooks is a far bigger rats nest of things that are inherently risky due to the way they are designed.
Comment by empath75 5 hours ago
Comment by nsonha 7 hours ago
Comment by mannanj 2 hours ago
Yes yes we claim the user doesn’t know what they want. I think that’s largely used as an excuse to avoid rethinking how things should meet the users needs and keep status quo where people are made to rely on systems and walled gardens. The goal of this article is UIs should work better for the user. What better way then to let them imagine (or even nudge them with example actions, buttons, text to click to render specific views) in the UI! I’ve been wanting to build something where I just ask in English from options I know I have or otherwise play and hit edges to discover what’s possible and not.
Anyone else thinking along this direction or think I’m missing something obvious here?
Comment by alexgotoi 2 hours ago
The real question: do UIs even make sense for agents? Like the whole point of a UI is to expose functionality to humans with constraints (screens, mice, attention). Agents don't have those constraints. They can read JSON, call APIs directly, parse docs. Why are we building them middleware to click buttons?
I think this makes sense as a transition layer while we figure out what agent-native architecture looks like. But long-term it's probably training wheels.
Will include this in my https://hackernewsai.com/ newsletter.
Comment by kridsdale3 2 hours ago