LLM Year in Review
Posted by swyx 1 day ago
Comments
Comment by socketcluster 20 hours ago
The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
Comment by Daniel_sk 11 hours ago
Comment by RealityVoid 4 hours ago
Comment by esafak 9 hours ago
Comment by yread 7 hours ago
Comment by dukeyukey 6 hours ago
Comment by yread 4 hours ago
Comment by andai 15 hours ago
(GLM etc. get surprisingly close with good prompting but... $0.60/day to not worry about that is a no brainer.)
Comment by spaceman_2020 11 hours ago
Comment by rolymath 8 hours ago
Comment by kakapo5672 6 hours ago
Comment by disease 6 hours ago
Comment by dgacmu 6 hours ago
Comment by IAmGraydon 8 hours ago
Comment by thefourthchime 6 hours ago
Someone sell me on how Claude Code, I just don't get it.
Comment by senordevnyc 6 hours ago
Fundamentally, I don’t like having my agent and my IDE be split. Yes, I know there are CC plugins for IDEs, but you don’t get the same level of tight integration.
Comment by tarsinge 17 hours ago
Comment by afro88 14 hours ago
Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.
Comment by Ozzie_osman 12 hours ago
Comment by ramoz 11 hours ago
Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.
Comment by senordevnyc 8 hours ago
Comment by andai 15 hours ago
If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)
Comment by wahnfrieden 17 hours ago
Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2
Comment by MrOrelliOReilly 15 hours ago
Comment by woadwarrior01 13 hours ago
> See: https://artificialanalysis.ai
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
Comment by MrOrelliOReilly 10 hours ago
Comment by gessha 9 hours ago
Comment by dr_dshiv 11 hours ago
LM Arena shows Claude Opus 4.5 on top
Comment by HarHarVeryFunny 11 hours ago
In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?
Comment by ramoz 11 hours ago
https://x.com/METR_Evals/status/2002203627377574113
> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
What an insane take for anybody uses these models daily.
Comment by MrOrelliOReilly 10 hours ago
Comment by fzzzy 9 hours ago
Comment by wahnfrieden 9 hours ago
Comment by wahnfrieden 8 hours ago
It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.
Comment by ccmcarey 16 hours ago
Comment by wahnfrieden 8 hours ago
GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.
Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.
Comment by augment_me 17 hours ago
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
Comment by spaceman_2020 11 hours ago
But I had to change how I write because people started calling my writing “AI generated”
Comment by athrowaway3z 9 hours ago
Comment by vatsachak 8 hours ago
Comment by kakapo5672 6 hours ago
We're embarking on a ginormous planetary experiment here.
Comment by karpathy 8 hours ago
Jk jk, now that you pointed it out I can’t unsee it.
Comment by d-lisp 15 hours ago
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."
Comment by killerstorm 12 hours ago
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
Comment by d-lisp 11 hours ago
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
Comment by killerstorm 4 hours ago
So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.
Comment by amelius 11 hours ago
Comment by flakiness 10 hours ago
After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.
Comment by matsemann 11 hours ago
Comment by yard2010 13 hours ago
Comment by another_twist 15 hours ago
Comment by andai 15 hours ago
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)
Comment by sungho_ 6 hours ago
Comment by another_twist 6 hours ago
Comment by ionwake 14 hours ago
Comment by andai 11 hours ago
I mostly use them in Telegram because it auto converts -- into emdash. They are a pain to type everywhere else though!
Comment by dr_dshiv 11 hours ago
Comment by huevosabio 17 hours ago
Comment by nathias 15 hours ago
Comment by thoughtpeddler 1 day ago
Comment by karpathy 1 day ago
Comment by CamperBob2 21 hours ago
Comment by ramoz 11 hours ago
Comment by karpathy 19 hours ago
Comment by magicalhippo 1 day ago
Comment by simonw 23 hours ago
codex --oss -m gpt-oss:20b
Or 120b if you can fit the larger model.Comment by AlexCoventry 23 hours ago
Comment by simonw 21 hours ago
I don't think gpt-oss:20b is strong enough to be honest, but 120b can do an OK job.
Nowhere NEAR as good as the big hosted models though.
Comment by ontouchstart 20 hours ago
Comment by AlexCoventry 19 hours ago
Comment by ramoz 21 hours ago
It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.
Comment by D-Machine 1 day ago
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
Comment by realcul 22 hours ago
Comment by jkubicek 23 hours ago
You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!
Comment by johnfn 17 hours ago
Comment by becquerel 16 hours ago
Comment by Aiisnotabubble 13 hours ago
It's the best ui ever.
It understands a lot of languages and abstract concepts.
It will not be necessary at all to let LLM generate random uis.
I'm not a native English speaker. I sometimes just throw in a German word and it just works.
Comment by tim333 12 hours ago
If you look at how humans actually communicate I'd guess #1 is text/speech, #2 pictures
Comment by starchild3001 1 day ago
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
Comment by HarHarVeryFunny 10 hours ago
I don't see these descriptions as very insightful.
The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).
For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.
Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.
Comment by naasking 4 minutes ago
But they aren't just echoing, that's the point. You really need to stop ignoring the extrapolation abilities in these domains. The point of the jagged analogy is that they match or exceed human intelligence in specific areas in a way that is not just parroting.
Comment by graemefawcett 21 hours ago
https://tech.lgbt/@graeme/115749759729642908
It's a stack based on finishing the job Jupyter started. Fences as functions, callable and composable.
Same shape as an MCP. No training required, just walk them through the patterns.
Literally, it's spatially organized. Turns out a woman named Mrs Curwen and I share some thoughts on pedagogy.
There does in fact exist a functor that maps 18th century piano instruction to context engineering. We play with it
Comment by fourside 9 hours ago
We should keep in mind that currently our LLM use is subsidized. When the money dries up and we have to pay the real prices I’ll be interested to see if we can still consider whipping up one time apps as basically free
Comment by victorbuilds 1 day ago
Comment by mips_avatar 23 hours ago
Comment by HarHarVeryFunny 11 hours ago
Comment by gnerd00 21 hours ago
Comment by mips_avatar 20 hours ago
Comment by lysecret 13 hours ago
Comment by devalexwells 7 hours ago
I spent 5 minutes trying to find a way to unsubscribe and couldn't. Finally, I found it buried in the plan page as one of those low-contrast ellipses on the plan card.
Instead of unsubscribing me or taking me to a form, it opened a convos with an AI chatbot with a preconfigured "unsubscribe" prompt. I have never felt more angry with a UI that I had to waste more time talking to a robot before it would render the unsubscribe button in the chat.
Why would we bring the most hated feature of automated phone calls to apps? As a frontend engineer I am horrified by these trends.
Comment by tim333 12 hours ago
Comment by gessha 9 hours ago
Comment by tim333 9 hours ago
Comment by gessha 8 hours ago
Comment by cheesecompiler 8 hours ago
> LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected
Isn't this concerning? How can we know which one we get? In the realm of code it's easier to tell when mistakes are being made.
> regular people benefit a lot more from LLMs compared to professionals, corporations and governments
We thought this would happen with things like AppleScript, VB, visual programming. But instead, AI is currently used as a smarter search engine. The issue is that's also the area where it hallucinates the most. What do you think is the solution?
Comment by sireat 5 hours ago
This would be a 100 kLOC legacy project written in C++, Python, and jQuery era Javascript circa 2010. Original devs have long left. I would rather avoid C++ as much as possible.
I've been Github Copilot (in VS Code) user since June of 2021 and still use it heavily, but the "more powerful intellisence" approach is limiting me on legacy projects.
Presumably I need to provide more context on larger projects.
I can get pretty far with just ChatGPT plus and feeding bits and pieces of project. However that seems like using the wrong tool.
Codex seems better for building things but not sure about grokking existing things.
Would Cursor be more suitable for just dumping the whole project (all languages) basically 4 different sub projects and then selectively activating what to include in queries?
Comment by sandos 4 hours ago
Comment by TheAceOfHearts 1 day ago
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
Comment by lukax 11 hours ago
Comment by delichon 1 day ago
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
Comment by visarga 18 hours ago
Comment by mvkel 1 day ago
What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?
Comment by simonw 23 hours ago
Give it an image of a maze, it can output that same image with the maze completed (maybe).
There's a fantastic article about that for image-to-video models here: https://video-zero-shot.github.io/
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.
Comment by dragonwriter 1 day ago
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
Comment by andai 15 hours ago
Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)
I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.
Comment by karpathy 8 hours ago
Comment by andai 15 hours ago
Comment by nkko 17 hours ago
Comment by dandelionv1bes 15 hours ago
Comment by swyx 1 day ago
Comment by CamperBob2 21 hours ago
Comment by alexgotoi 16 hours ago
Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.
Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.
What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.
Comment by gessha 9 hours ago
It took a long time to computerize businesses and it might take some time to adopt/adapt to LLMs.
Comment by bgwalter 1 day ago
Comment by zingar 1 day ago
Comment by diamond559 16 hours ago
Comment by augment_me 17 hours ago
Sometimes the point of the software is to make an app with 2 buttons for your mom to help her do her grocery shopping easier
Comment by simonw 23 hours ago
Comment by bgwalter 6 hours ago
Comment by distalx 14 hours ago
Comment by skybrian 10 hours ago
Similarly, we’re all talking to ghosts now, which aren’t real, and yet there is something there that we can talk about. There are obvious behavioral differences depending on what persona the LLM is generating text for.
I also like the hint of danger in “talking to ghosts.” It’s difficult to see how a rational adult could be in any danger from just talking, but I believe the news reports that some people who get too deep into it get “possessed.”
Comment by ngruhn 11 hours ago
Comment by dr_dshiv 10 hours ago
Comment by squidbeak 10 hours ago
Comment by metalman 9 hours ago
Comment by ausbah 23 hours ago