How LLMs work
Posted by 0xkato 6 days ago
Comments
Comment by malwrar 4 days ago
This is to say: the autoregressive decoder-only transformer llm architecture as pioneered by openai is wildly simple for how revolutionary its results are. I was reading about non-learned classical SLAM systems (uses video + handcrafted math to produce 3d mappings of physical spaces while also locating the camera in those spaces) at the time, and comparatively speaking I’d say the math is about as complicated as ONE of the components in those complex formulations. The only reason frontier LLMs need 6-figure computers to run is because the model designers made the middle bit in those models REALLY BIG, dimensionally speaking. They just took the steam engine, made a few gargantuan versions of it, and are selling them as the ultimate source of power.
This was openai’s entire breakthrough. Making this particular model architecture larger leads to emergent capabilities like being able to pick the best ending to a story/set of instructions or answer questions about broad factual knowledge. I’ve been meanwhile watching these AI companies attempt, successfully, to sell this capability as some sort of robot consciousness hand-crafted by supergeniuses. The fact that they are getting away with it is almost as shocking to me as the discovery itself.
Comment by ekunazanu 4 days ago
Basically, the bitter lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson...
Comment by williamstein 3 days ago
Comment by jochembrouwer 3 days ago
Comment by Vachyas 3 days ago
See: https://openai.com/index/chain-of-thought-monitoring/
Quote below:
Chain-of-thought (CoT) reasoning models “think” in natural language understandable by humans. Monitoring their “thinking” has allowed us to detect misbehavior such as subverting tests in coding tasks, deceiving users, or giving up when a problem is too hard.
We believe that CoT monitoring may be one of few tools we will have to oversee superhuman models of the future.
We have further found that directly optimizing the CoT to adhere to specific criteria (e.g. to not think about reward hacking) may boost performance in the short run; however, it does not eliminate all misbehavior and can cause a model to hide its intent. We hope future research will find ways to directly optimize CoTs without this drawback, but until then
We recommend against applying strong optimization pressure directly to the CoTs of frontier reasoning models, leaving CoTs unrestricted for monitoring.
We understand that leaving CoTs unrestricted may make them unfit to be shown to end-users, as they might violate some misuse policies. Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.Comment by cavemandaveman 3 days ago
Comment by xnx 3 days ago
Comment by swyx 3 days ago
> Don’t be distracted by human knowledge, as AI has been historically.
> Instead focus on methods for creating knowledge that scale with computation, like search and learning.
so the lesson is choose methods that scale with computation, not just that blindly scaling up anything (data, params, people, whatever) works, it is choosing the right x axis and the right scaling laws consistently wins out in the long run despite short term wins from other methods.
Comment by jfim 4 days ago
The secret sauce though is all the datasets, RL training, knowledge of what works from doing all kinds of ablation experiments, and a massive compute moat.
Comment by gobdovan 4 days ago
Comment by root-parent 3 days ago
Also would love to know if the same Legal team advised on Gemini...
Comment by miltonlost 3 days ago
Comment by someguyiguess 3 days ago
Comment by achrono 4 days ago
Comment by HarHarVeryFunny 3 days ago
There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.
Comment by gobdovan 4 days ago
- V3 https://arxiv.org/abs/2412.19437
- V2 https://arxiv.org/abs/2405.04434
- R1 https://arxiv.org/abs/2501.12948 (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)
Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.
Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.
Comment by swyx 3 days ago
c.f. hardware lotter https://arxiv.org/abs/2009.06489
Comment by matusp 4 days ago
Comment by ai_slop_hater 4 days ago
Comment by otabdeveloper4 4 days ago
Comment by yababa_y 4 days ago
Comment by jmalicki 3 days ago
If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.
A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.
Comment by TheHalfDeafChef 3 days ago
(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)
Comment by stevenhuang 2 days ago
> doesn’t have a semantic understanding of what it is doing
I hope you realize this is an area of open, active research.
Comment by Chu4eeno 2 days ago
In general we (humans) need to be humble about the limitations of our knowledge about how we function, it's an insanely complicated problem.
Comment by jmalicki 1 day ago
We do.
Which is why we shouldn't be assuming we're more than just probability engines, or be assuming we have more consciousness than a neural network.
Comment by otabdeveloper4 3 days ago
There's diminishing returns and at some point making a model bigger makes it dumber.
Comment by lobocinza 1 day ago
Comment by fizx 3 days ago
Comment by locknitpicker 3 days ago
ReAct loops and tool-calling are the critical development feature. They turn a model from something that generates text into something that can independently influence the world around them.
Without agent features, you have just a chatbot.
Comment by galaxyLogic 3 days ago
It is the combination of LLM and agent-harnesses that make it look really smart. Agent-harness is a programmatic device that lets us tap into the vast knowledge in the LLM.
It is probabaly true that many TV-commentators fail to appreciate this fact and therefore think LLMs are super-intelligent. No, it is the combination of LLM and the programmatic agent-haness that is the breakthrough.
An interesting thought is that the LLM could in theory code the agent-harrness, start it running every time we interact with it. Currently the agent-harrness I think is pretty static I think. In theory it could be dynamically created for every task. Would that make it better don't know.
Comment by locknitpicker 2 days ago
Without ReAct and tool calling, all you have is a chatbot. That's useful, but it's just a chatbot.
ReAct loops and tool calling is what unblocks high value usecases. It enables systems to actually address free-form problem statements, gather data that is not a part of their training set, inspect the current state of services,and trigger actions in external systems. This goes well beyond mere chatbots.
> It is the combination of LLM and agent-harnesses that make it look really smart.
It's really not about "smart". It's about autonomous systems, and being able to consume and analyze new data, and trigger actions in external systems.
Comment by Chu4eeno 2 days ago
And I remember talking about goal directed behavior (which what people are calling "agents" now don't seem to properly have) and autonomous operation decades ago in the intelligent agent course at uni, including react loops.
So no, the huge step with LLMs really was just that attention mechanism from that translation paper everyone forgot until Google brought its marketing to it, everything else is either just optimization/scaling, more money or old ideas suddenly relevant.
Comment by locknitpicker 1 day ago
I completely disagree. The rollout of agentic tools, and even support for agent mode in IDEs, is the whole value proposition of AI code assistant services.
Otherwise you'd just have a glorified search engine in a chat window.
> (...) it's a fairly obvious step once you get something that can operate iteratively and largely independent,
There's some confusion in your reply. ReAct loops is exactly what this "operate iteratively and largely independently" represents.
Comment by forestsitter 3 days ago
Comment by antirez 4 days ago
Comment by dist-epoch 3 days ago
You can go another step - a FFN can be simulated on a Turing machine, thus it just exemplifies the incredible semantical power of the Turing machine model of computation. (in fact you don't even need a Turing machine, since there is no looping in one forward pass).
In theory you can run a huge FFN on the tiniest Turing machine, in practice it's much better to run a Transformer on the latest NVIDIA hardware. Or as they say "quantity (performance) has a quality all its own"
Comment by musebox35 3 days ago
Comment by zbendefy 3 days ago
There is also the case for Markov chains being theoretically able to do these if tuned well. Or even SAT problem.
Comment by CGMthrowaway 3 days ago
Comment by galaxyLogic 3 days ago
Comment by Chu4eeno 2 days ago
Which is incidentally more or less the only thing I remember about Global Workspace Theory (attention facilitating consciousness in a way iirc).
Comment by slickytail 3 days ago
Comment by 10GBps 4 days ago
I have to wonder though. Is this all a human brain is? A similar thing to an LLM just scaled exponentially larger. I mean a brain is not just neurons with simple connections to each other. The neurons, axons, dendrites, <insert_unexplained_thing>, etc in a brain are all holding and processing information in different ways and doing it nearly 100% in parallel. That's a really big model.
The biological discoveries show how complex a biological brain actually is. Even the tiny brains in a bee or spider are able to solve puzzles and use tools. That's crazy.
Comment by ctolsen 4 days ago
Comment by rfv6723 4 days ago
If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.
Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.
Comment by uoaei 4 days ago
As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.
Comment by 0xbadcafebee 3 days ago
Example: a programming language's capability to produce complex software does not come from some inherent quality of language. It comes from binary. 0's and 1's, representing basic logic, and that being built on top of with an abstract "tool" called a language. If the binary logic didn't work, the language wouldn't do anything.
A dolphin can make sounds, and technically has a language, but they can't manipulate or recursively compound concepts (as far as we can tell) in order to create modified ideas. If they could, they probably would have come up with vastly more advanced fishing methods than the (admittedly novel) ones they have now.
Comment by rfv6723 3 days ago
Comment by zaphirplane 3 days ago
Comment by otabdeveloper4 4 days ago
No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.
Clearly "neurons" is an oversimplification just-so story, not a scientific theory.
Comment by adammarples 4 days ago
Comment by formerly_proven 4 days ago
Comment by otabdeveloper4 3 days ago
Comment by redox99 4 days ago
So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.
Comment by sirsinsalot 3 days ago
Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.
Comment by bonoboTP 2 days ago
Comment by redox99 3 days ago
Residual connections are so simple, so obvious and so vital. Yet nobody came up with them until 2015?
Comment by sirsinsalot 3 days ago
I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.
It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.
Comment by redox99 3 days ago
I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.
Comment by bonoboTP 4 days ago
Comment by spacebacon 4 days ago
Comment by foxes 4 days ago
Comment by crossroadsguy 4 days ago
(If I can be honest, and I am not being disparaging about anything lest it might seem so, I am looking at it from a career breakthrough/move perspective rather than an intellectual pursuit.)
Comment by 2muchcoffeeman 3 days ago
If you want to be a researcher and come out with the next breakthrough, get ready to go back to school and learn some math.
If you just need to learn how to use it well and build things with it, then you probably just need to have a high level understanding.
Same as programming. I’d bet most programmers have no idea about the physics that makes computers work.
Comment by bluerooibos 3 days ago
What about improving the efficiency of token consumption, etc., basically opportunities for improving cost/performance?
I keep thinking there has to be a better way to share context with models than dumping entire gigantic skill files of raw text or otherwise into them - I'm betting there's a bunch of low-hanging fruit there.
Comment by coliveira 3 days ago
Comment by la_fayette 3 days ago
Comment by aplomb1026 3 days ago
Comment by sirsinsalot 3 days ago
Which sums up HN these days.
Comment by malwrar 3 days ago
I have no idea about careers at this point, I’m still doing fancy IT work as my day job I and look away from the future with dread. I also haven’t been looking for new roles on the open job market, so who knows maybe there’s multimillion pay packages for anyone who can articulate how attention works in an interview.
Comment by LatencyKills 3 days ago
https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
Comment by tinktank 3 days ago
Comment by LatencyKills 2 days ago
I'm currently working on a robotics project that uses Nvidia's GR00T N1 model, and I was able to understand the research paper. [0]
Comment by tinktank 2 days ago
Comment by wuschel 4 days ago
Comment by sigmoid10 4 days ago
Comment by redox99 4 days ago
MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably.
The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well.
Comment by sigmoid10 3 days ago
Comment by blackbear_ 4 days ago
Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471
Comment by sharma-arjun 4 days ago
Comment by barrenko 4 days ago
Comment by root-parent 3 days ago
"Beating Nyquist with Compressed Sensing" - https://youtu.be/A8W1I3mtjp8
Comment by dominotw 3 days ago
how did you know about the steps and there was math involved. i am curious about your process and you came up with what exactly to learn to unravel the mystery.
Comment by GardenLetter27 4 days ago
Comment by darksim905 4 days ago
Tangentially related: This part always seemed fuzzy to me, especially when dealing with data scientists and how they talk about how 'ML' looks at problems. I had this issue when working at a SIEM vendor where they kept going on about use case development having to be designed a certain way to catch things. It was all very frustrating.
Comment by cloche 3 days ago
Did you mean to link to the video? I would be interested.
Comment by sesm 3 days ago
Comment by coliveira 3 days ago
Comment by giardini 3 days ago
Comment by bluerooibos 3 days ago
Comment by LatencyKills 3 days ago
I don't think there is anything in a transformer I couldn't explain in the smallest detail now.
[0]: https://www.amazon.com/Build-Large-Language-Model-Scratch/dp...
[1]: https://www.amazon.com/Build-DeepSeek-Scratch-Abhijit-Dandek...
Comment by hackinthebochs 3 days ago
If you're up for it I would love to know how and why positional encodings work
Comment by LatencyKills 3 days ago
A vanilla self-attention layer is just a set of token vectors. Without positional info, swapping two identical embeddings changes very little about what attention can compute. We can "fix" this problem by using positional encodings. Text that has meaning isn't just a set of characters; the location and order of those characters is what provides meaning.
Comment by root-parent 3 days ago
Comment by malwrar 3 days ago
I picked it up from trying to teach myself that SLAM stuff. The papers are very short, but highly information dense and at the time there was no ChatGPT to help me. I got through them by just creeping my way through the math with a whiteboard, and something about drawing it out and having it there in my office made it all click. Trying to watch piecemeal lectures on YouTube or grind through foundational books like MVG just didn’t work for me, I used them instead as references for my drawings.
Same happened when I tried learning this GPT stuff. karpathy’s videos were out at the time, but I couldn’t really stay focused on them or connect the math with the code. Most other descriptions I could find were focused on getting you to use their inference library or harness. Assembling the picture together on my whiteboard by focusing on drawing out the block diagram continues to be my personal favorite method for deep understanding of complex systems.
Comment by Gmolomo 4 days ago
It doesn't has any impact?
Ah wait it does. Mh weird.
Why are you not creating a startup and get rich?
Comment by sarjann 4 days ago
Comment by golergka 4 days ago
Comment by dist-epoch 3 days ago
Einstein special relativity is taught these days in high-schools. Doesn't mean it wasn't the very hard part at some point in time.
As they say, shoulders of giants.
Comment by pkoird 4 days ago
Comment by faurroar 4 days ago
Comment by jumploops 4 days ago
We still don’t really know why they work, we just know how to build them.
Comment by trollbridge 4 days ago
My next child took a completely different path to language, including skipping all the non-verbal imitations.
And then at some point, you just suddenly can two-way communicate with them when you couldn't before, and then after that, they can engage in reasoning.
Comment by jumploops 4 days ago
It’s interesting to me how similar attempting to understand LLMs is to neuroscience.
“When we turn this bit off, this other thing happens… if we change these weights the Eiffel Tower is now in Rome”
We’re basically just probing around and trying to reverse engineer an emergent system.
To your point, this system may be quite different from model to model (human to human) although some similarities likely occur.
The comment I was responding to tried to belittle the OP’s understanding of transformers, by mentioning that running an LLM at scale is much harder than the simple white board diagram.
My point was simply that we don’t know why they work, and all the extra optimizations isn’t the “thing” that makes it emergent.
Simply scaling the “GPT” is good enough to see it, so the OP’s awe should stand.
(On a side note, what other architectures can we scale to find similar emergent behavior?)
Comment by galaxyLogic 3 days ago
So what is it that we don't understand about why theyr work? The algorithm? We have the code. Why the specific algorithm makes such good predictions? I see it as a generalization of trying to predict who wins Kentucky Derby.
Comment by trollbridge 3 days ago
Comment by ai_slop_hater 4 days ago
Comment by baq 4 days ago
Adults are expected to have their world models approximately correct in terms of physical environment so they won’t accidentally kill themselves by falling off a cliff; then there are the social norms which adults are expected to conform to so everyone is kinda predictable to everyone else so adults don’t kill each other too often over food or mates. Understanding of neither is expected from children.
Comment by Izkata 3 days ago
I think they're right that kids (at least in the US) are generally treated as less capable than they are, and it ends up slightly delaying their development.
Comment by ai_slop_hater 4 days ago
Comment by mejutoco 4 days ago
Comment by ai_slop_hater 4 days ago
Comment by skydhash 3 days ago
Comment by beezlewax 4 days ago
Comment by trollbridge 3 days ago
My son is very worried about black holes lately when he learned anything that goes into one can't get out. He's pretty concerned astronauts could get stuck in one some day. So I explained to him that Hawking radiation does actually mean you can eventually get out; it just takes some time.
I didn't think it pertinent to mention spaghettification, the fact anywhere near a black hole will be really hot, or that cosmic censorship means whatever Hawking-radiates from a black hole wouldn't be an astronaut anymore.
It was also fun to hear Hawking speak. He wanted to know if Hawking was a robot. I said no, but he has a robot talk for him. Not quite true, but close enough.
Comment by pmg101 4 days ago
Comment by ai_slop_hater 4 days ago
Comment by skydhash 3 days ago
Comment by slopinthebag 4 days ago
Comment by malwrar 3 days ago
Comment by otabdeveloper4 4 days ago
The "bitter lesson" is that fake-it-till-you-make-it is a valid way of doing knowledge work.
(Or not make it, then people will just claim you're holding the LLM wrong and it's not the AI's fault.)
Comment by throw310822 4 days ago
Statistically most likely in what context, given which preconditions? Because each prompt sequence is unique so the probability of any token following it is unknown.
Comment by skydhash 3 days ago
Comment by throw310822 3 days ago
Comment by skydhash 3 days ago
If you’re talking about matrix multiplication, I can use mathematical rules and axioms and proves formally that the multiplication is correct. For next token prediction, I can prove that the set of tokens is finite and that the next token is always part of that set.
But things like grammar correctness, or semantic consistency over a few sentences are not hardcoded rules in the model. They’re emergent properties, mostly due to the amount and quality of data available for training. Quantization is mostly about how much we can shed without loosing a particular emergent properties (like dithering or psycho acoustic audio compression)
Comment by perching_aix 3 days ago
You know it perfectly damn well that a typical person's idea of statistics is not some insanely high cardinality stateful prediction, but a "well a coin toss is a 50:50, and a lottery win is a 1:100000000". You also know it perfectly damn well that as a result, people will just think that all the sentences chatbots ever produced to them were then just somewhere in the massive training set, letter by letter. This insinuation is often even explicitly appealed to.
And that picture is outright false. It's a statistical process, yes, so saying that it does what it does by "just doing statistics" is gonna be a generally correct description, but that's not at all inquisitive to how exactly does it do it, nor is it the zinger you think it is. If you did the aforementioned, you'd just get milquetoast nonsense, like you can see in the countless Markov-chain primers. And while the models do have a lot of the training set lossily captured, they do also absolutely generalize (that's how they can do that lossy compression), and you can quite literally find representations of those generalizations in them, and also see them activate.
It's like summarizing how any program works by just saying "well it just manipulates ones and zeroes". Not very informative, is it? Or how programs are written by just programmers sitting in a cushy office, ryhtmically pressing keys on a keyboard. Not a very fair or insightful description, which you'll know if you've done any amount of programming in your life on your own. Extends to all other white collar jobs too.
It's also not even true in the most literal sense: models can and do absolutely choose a less than maximally likely next token, that's what the various decoding parameters are for. "Maximally likely next token" further conviently skipping over how that likelihood is established in the first place, i.e. the literal point of the question, going in a cute little circle.
I'm so over this "stochastic parrot" bullshit.
Comment by stevenhuang 3 days ago
Comment by otabdeveloper4 3 days ago
If you don't believe me, download llama.cpp and see for yourself.
P.S. I write inference backends in C++ every day. The gall of people like you who figured out how to prompt Claude and think they're hot shit now is simply unbelievable.
Comment by stevenhuang 3 days ago
If you don't see why then you have exactly demonstrated my point in how practitioners like you simply lack the foundational understanding in philosophy, information theory, human consciousness, human cognition, neuroscience, necessary to bridge this conceptual gap.
(Rather, it is that we know so little of how consciousness or what intelligence even is, that we cannot possibly use first principles to preclude LLMs from possessing these qualities)
You don't understand the argument, so you keep repeating first order mechanistic observations that are irrelevant. If you don't want to understand the argument, don't be surprised when people refuse to engage with you, especially when it's evident to those more knowledgeable the position you hold is the ignorant one.
Comment by perching_aix 3 days ago
It's equivalent to professing how you just make apple pies from scratch, while your first step is to always reinvent the universe.
You're further magically blind to this operational fact being weaponized as a trope for furthering anti-ai sentiment (i.e. that it's a political dogwhistle at this point), and to thus you participating in that every time you repeat it?
* Ignoring the decoding caveat I already mentioned, along with the countless ways they're steered. There isn't jack that's likely about some of the responses they produce, and intentionally so. Including the whole chat partner act.
Comment by stevenhuang 3 days ago
Safe to say there's a cognitive block and until he tries to approach this topic in good faith he'll simply never understand. Lol.
Comment by perching_aix 3 days ago
I really don't know what's so interesting about auto-complete or next token prediction that it captures these people's attention so much. They're so blatantly not the salient quality to these products that is of interest to the common discourse, it's just baffling.
Comment by klempner 3 days ago
Comment by firemelt 4 days ago
Comment by lowken10 4 days ago
Comment by robwwilliams 3 days ago
Comment by miki123211 3 days ago
When you are writing an essay and realize midway through a sentence that what you've written doesn't make sense, you go back and edit. An LLM can't do that, the only thing it can do is keep on generating. Because training data typically contains full essays and not half-finished sentences which were then edited, LLMs have a strong preference for "saving face" and producing grammatically correct, internally coherent outputs. They will often do so even if the only way to write themselves out of the corner they wrote themselves into is to lie. To maintain internal coherence, they'll then repeat that lie for the rest of the response.
This is also why changing response structure used to affect LLM performance so dramatically. If you asked an LLM to solve a math problem and all-but-forced it to start with the answer, it would have had to calculate that answer before emitting any tokens, something which it very often wasn't able to do. If it was told to follow up the answer with an explanation, it would produce a plausible-sounding explanation to maintain coherence.
If, on the other hand, it was told to start by "thinking step by step", it would often be able to solve the first step, and then the next one given the results of the first, and so on, until it was able to reach the answer. Because the answer came last, it wasn't committing to anything, so had no reason to "save face" and lie.
This part of the problem is basically solved now with reasoning; reasoning is where all the step-by-step stuff happens, even if users aren't always able to see it. In the process of RLVR, models even train themselves into outputting phrases like "let me check my answer once again" in the chain-of-thought; those serve as their "life rafts" which they can use to both save face and change their answer.
Comment by swyx 3 days ago
instead of going left to right, even with a scratchpad, maybe you start with a rough shape of the big picture all at once, and then you iteratively resolve and things come into focus.
mercury (https://www.youtube.com/watch?v=2fDBeMu6xjk) seems to have made the most progress here, which is not saying a ton but is not nothing. i do think it is telling that of the big labs, only GDM has made any meaningful bet on text diffusion. you can bet your ass all of them have evaluated it for a source of alpha.
Comment by chris_money202 3 days ago
Comment by kzrdude 3 days ago
Comment by FergusArgyll 3 days ago
Why can't an llm tool call `delete(index: int)` or `replace(from_index: int, to_index: int, string: str)` and then it can go back and edit just the way we can?
We also first made the mistake and only afterwards noticed we actually want to change something
Comment by podocarp 2 days ago
Comment by helloplanets 4 days ago
> The intuition: instead of adding position info to each token’s vector, RoPE rotates the vector by an angle that depends on its position
You can't rotate the token's entire vector (or all three vectors, whatever is being implied is unclear). You rotate each token's Query and Key vectors only, so dot product can be used to tell how far apart the tokens are when comparing token 1's Query vector to token 2's Key vector.
Positional embedding should just be explained after explaining the Query, Key and Value vectors. When the article explains those only after that, the reader is building up on a wrong intuition and it gets confusing.
Comment by rhubarbtree 3 days ago
Makes me wonder if the whole thing is just slop.
Is the rest of the article correct?
Anyone suggest an alternative article?
Comment by aaroninsf 3 days ago
Comment by giardini 3 days ago
Comment by alecco 3 days ago
Comment by 10GBps 4 days ago
I've noticed the same thing is possible if you watch the output of a slow LLM. Eventually you start to see the machinery. input tokens = output tokens, it's math. I can't exactly predict the tokens generated but I can see how they are formed. It's a lot like chess. You can't see every possible move but the mechanism is understandable.
Comment by trollbridge 4 days ago
Comment by fragmede 4 days ago
I can only imagine what sort of visualizations are going on today inside of the AI labs.
Comment by helloplanets 4 days ago
Comment by Maledictus 4 days ago
Comment by barrenko 4 days ago
Comment by aabdi 4 days ago
it goes all over the place.
i'm not actually sure who your target audience is.
there's too many side tangents.
just like, structure it plz.
1. customer feels bad cuz they don't understand how llms work
2. provide high level abstracted explanation (don't dive into concepts yet)
3. provide breakdown guide of overall set of components.
4. walk through each component. don't side track. no need to explain, ROPE,GQA etc... it just distracts.
i.e. customers don't know how llms work, leading them to feel bad about their own intelligence.
at a high level llms take in words, do some math on them, and then produce words, one by one.
inside llms have these different components. we walk through them step by step.
1. tokenizer
2. embedding
3. attention
4. heads
5. ffn
6. sampling
## tokenizer
Comment by barrenko 4 days ago
Comment by LearnYouALisp 2 days ago
Comment by yukIttEft 4 days ago
But how does it learn this token-relationship?
All it has is many text samples, but still, nowhere it says how the tokens relate to each other, so where does this information come from?
Comment by HarHarVeryFunny 3 days ago
The model could just as well learn to predict next token from gibberish text as long as there were some statistical gibberish regularities to learn. However, if you train it on real meaningful text then the statistical regularities it needs to learn (and will, thanks to gradient descent, and the capable architecture) will be those reflecting "token relationships" - grammar, semantics, etc.
So, you can say the "token relationships" (incl word meanings) are reflected in the statistical regularities of the training data, and the model architecture and training algorithm are just very capable of learning those regularities whatever they may be.
You can consider it related to Word2Vec word embeddings, which are based on the idea that the meaning of words comes from how they are used, which to a first approximation can be implemented by considering the meaning of words to be defined by the words they appear next to(!), which is what the Word2Vec embedding training algorithm does, and famous examples such as "(king - man) + woman = queen" prove that this is in fact learning the meanings of words.
Comment by inkysigma 4 days ago
Comment by dist-epoch 3 days ago
It's the same thing here, you randomly try various token-relationship values and the ones which are slightly better will be favoured.
Comment by MagicMoonlight 3 days ago
Comment by andai 4 days ago
Comment by vocram 4 days ago
Comment by hiworld6543 3 days ago
Also, the author’s other public writings have similar errors/choices in style. When “consumer AI” writes or rewrites, it’s impossible for it to mimic one’s writing style so similarly. Literally impossible, because it can’t disregard everything else that it “knows” (voluminous training, guardrails, interface design, social boundaries) for it to 1. Disregard all that training after processing the user’s prompt 2. output a complete article in the user’s style 3. Turn back on its knowledge 4. then continue to function. That’s just not how consumer products work.
Comment by Ampersander 4 days ago
Comment by bspammer 4 days ago
Comment by lateral_cloud 4 days ago
Comment by Mali- 3 days ago
Saying an article is of inferior quality just because editing was done by my 13 year old nephew is like saying a book is lower quality just because it was printed rather than written by hand
Obviously, this doesn't work because the content is changing. Printing a piece vs writing it by hand doesn't change the content, only the medium of transmission.
Comment by possibleworlds 3 days ago
Comment by Laurel1234 4 days ago
I hope you do some introspection and start consciously recognizing that the human input and the clanker slop is just debasing it.
Comment by avazhi 3 days ago
Comment by janalsncm 4 days ago
There’s good AI writing and bad organic writing. But it’s easier to point out a few LLM-isms than to actually identify the problems with text.
Comment by blharr 3 days ago
Sure, but the LLM-isms in AI writing are mentally exhausting to see in every way at this point.
The whole point of reading, frankly, is to understand the voice of other people. When you pass that through a distorted filter that makes everyone sound the same... its bad, lossy, frustrating communication
It's also dishonest. When you publish something that is direct output without your wording. Digital catfishing at best.
The only good AI writing is providing the prompt, because the question is way more interesting, and way more constructive to learning than the answer
Comment by janalsncm 3 days ago
Comment by wj 3 days ago
I think gp may want to know if a <person> has an interesting idea rather than <person + llm>.
Comment by janalsncm 1 day ago
In other words, since the idea generation component can completely independent from the writing component, what you’re asking is not possible in practice.
Comment by AltruisticGapHN 4 days ago
I'm a developer but not very good at maths and I still don't understand any of it.
A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.
How is that "predicting the next word"?
Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".
What I mean, is the LLM is able to represent things in space . That part I don't understand.
I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?
Comment by dev_hugepages 4 days ago
Comment by Borealid 4 days ago
At all times the LLM is, indeed, predicting the next token. Anything it does emerges from that.
It did not "figure anything out". It predicted that text describing the use of a radial gradient was likely to follow text describing your problem.
Comment by hackinthebochs 3 days ago
The point is that saying they're just "predicting the next token" is not at all explanatory nor providing insight. Saying the brain is just firing action potentials gives you no understanding about how the brain does what it does or what the space of its capabilities are. Similarly, predicting the next token tells you nothing about the capabilities of LLMs.
Comment by galaxyLogic 3 days ago
Then the next question becomes "HOW do they predict the next token?" There are many ways that can be done, why is this particular algorithm so GOOD?"
When people say "We don't understand how LLM works" isn't it really saying we don't understand how this specific algorithm used to predict the next token works? No, it is not, because "we" do understand how all those algorithms work there are many descriptions of them available.
So the question then really is "Why is the prediction this algorithm makes, so good, as compared to some other statistical algorithms?"
It's not about "Why does AI work so well?". It should be "Why does this particular XYZ algorithm work so well?"
Comment by podocarp 2 days ago
Comment by Borealid 3 days ago
The capability of the LLM is not to reason, it's to generate text that matches the patterns seen in the training corpus. It's possible that all you need to "reason" is plausible text generation. I'm not saying it's not. But nothing the LLM does fails to be explained by plausible-text-generation.
I contend that the best way to understand an LLM's capabilities is to understand the nature of the probability distribution that produced it. For instance, why does an "angry" prompt tend to produce more help than a "polite" one? Trying to explain that in terms of emotions or reasoning doesn't make sense, but it's readily possible to explain through the connections between text in the training corpus...
Comment by hackinthebochs 3 days ago
But we can simply note that this description applies to any machine learning algorithm. Yet LLMs are lightyears better than, say, Markov chains. What people are after is something that elucidates the features of LLMs that allow them to be so productive over what came before.
Comment by Borealid 2 days ago
In other words, a Markov chain and a Transformer model are exactly equivalent in power (there is NOTHING that can be done with one and not the other). The Transformer model is just better pretrained and a more efficient compression/generation.
Comment by hackinthebochs 2 days ago
Nonsense. Markov chains treat the past context as a single unit, an N-tuple with no internal structure. LLMs leverage the internal structure of the context which allows a large class of generalization that Markov chains necessarily miss.
Comment by Borealid 2 days ago
Both are a lookup table whose key is the entire context window and whose value is a probability distribution for what the next token should be.
You can say the choice of probability distribution in the value is "leveraging the internal structure of the context" or not, but the same tokens in two different orders are two different lookup keys and saying it's impossible to achieve some result with a Markov chain is factually incorrect.
https://arxiv.org/pdf/2410.02724 describes the equivalence formally.
Comment by hackinthebochs 2 days ago
>but the same tokens in two different orders are two different lookup keys
This is necessarily true for Markov chains and not necessarily true for Transformers. Transformers learn invariance over certain kinds of semantically irrelevant transformations. The Markov chain simply has to learn each input variant independently, resulting in an explosion of state space and data requirements compared to the functionally equivalent transformer. Expressive power matters.
I really don't get people's love for saying X is "just" Y (it's just a Markov chain, it's just a Kernel method). It's a strange pathology to focus on the superficial similarity while downplaying the boost in expressive power from where the models diverge.
Comment by Borealid 21 hours ago
Do you have some concrete example of a transformer that cannot be represented as a mapping from inputs to probability distribution of outputs?
I say they're equivalent because it is possible to losslessly convert one to the other by wasting massive amounts of disk space and time.
As a second example proving the point, imagine you sampled a transformer's output for a certain context 85 trillion times, and put the output token frequencies in a table. Repeat for all possible inputs (of which there are a finite number). Then you built literally a hash map looking up the context and spitting out the distribution. That certainly is NOT a transformer any more (it's a hash map!!!), but the output approaches indistinguishability as the sample count increases - if the transformer is reasoning, so is the hash map built from it.
I'm not talking hot air here, they really are provably equivalent because a 1:1, onto mapping exists.
For the record, "X is more expressive than Y" means "there exists at least one thing that Y cannot represent and X can". Nothing to do with size or time.
Comment by layla5alive 3 days ago
Comment by Borealid 3 days ago
If you train the LLM on a corpus that shows people saying the sky is red, you get an LLM that is predisposed to say the sky is red. This is true even if it's also trained on all of the science that explains how and why the sky is blue.
If it were to "figure out" or "reason", it would not have such a predisposition to emit "red" after "the sky is" just because that matches the reward during training.
In other words, the token prediction is important because it both explains the successes AND the failures of the LLM. If there were situations in which a bird could fail to fly, then how it tried to fly would also be crucial knowledge.
Comment by layla5alive 3 days ago
You're caught up on the mechanics of token processing (floating point matrix ALU math) and ignoring the context that p(next token) as a function being "computed" is doing so over a trillion parameters. You can poorly train a model, sure, but assuming you don't indoctrinate it too much, properties like cognition emerge - it learns to reason; why? Reasoning is more efficient and compact than memorizing answers.
Comment by Borealid 3 days ago
I'm not trying to argue a model cannot "reason" or have "cognition", whatever those things are. I'm only saying that it's absolutely the case that whatever those things are, they come from its mechanism of predicting one token at a time ad infinitum, and that throwing away a deep understanding in favor of a shallow one is foolish. Just because it might seem to be "reasoning" does not mean it IS doing so, and certainly giving the appears of reasoning does not mean it is NOT a token predictor.
If I knew deeply how the human brain works I would use that understanding instead of saying things like "this person reasons" or "this person thinks".
In summary, I'm not "caught up in" anything - I'm just trying to point out that the original poster here is incorrect in saying that clearly LLMs aren't working through token prediction. They are, and all their behavior is 100% explained by token prediction. That's more than enough for interesting behavior!
Comment by stevenhuang 3 days ago
Comment by podocarp 2 days ago
Comment by qsera 3 days ago
Comment by antran22 4 days ago
Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.
I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.
Comment by 360MustangScope 3 days ago
For instance I’ve written a few custom languages to learn how to write a VM and the lexer/parser/compiler/etc. that it had never seen before and then just gave it the syntax which is different than what it had ever seen before. Simply due to the fact I made it and it had never been trained on it.
After giving it my documentation, it was able to write the language just like a language that it had been trained on. I’ve also seen this behavior at work where there are weird quirks to do things and definitely not standard and it can handle it.
Comment by qsera 3 days ago
But I think it will have difficulty in crossing paradigm boundaries, by simply using documentation.
Comment by skydhash 3 days ago
The exact syntax does not matter, only the grammar. If you give it the grammar, and then the keywords, it can find something that has similar grammar and then use your keywords.
Comment by YeGoblynQueenne 3 days ago
As a for instance, back in the day some academics wrote a paper that compared GPT 3.5 to a couple of inductive programming systems (including one of mine) on solving programming problems in a certain well-known esoteric language which I shall call "L". The task was to solve those programming problems one-shot. The authors asserted that the "L" problem sets were unlikely to be in 3.5's training set, but I found them without much search in a public github repo. I mean the entire dataset was right there. In this case the researchers are colleagues and friends and I know they weren't simply negligent or malicious, they just missed the fact that their "unlikely to be in the training set" data was on the web.
So I'd always assume that if an LLM can perform a task that's because it's seen examples of the task during its training.
Without forgetting that LLMs have this really shockingly powerful ability to interpolate between examples and they can improve their performance on say Task A by training on Task B, where A and B are different but similar.
e.g. they seem to get better at translating between language pairs of which they have few examples of parallel text by training on other pairs of languages for which they have more parallel text; they seem to learn something about language translation in general by training on more examples of translation. I haven't got a good reference on that handy but it's well-known (and of course over-hyped and exaggerated by tech CEOs).
So without wanting to diminish your work, I'd guess that your new language's syntax is different and novel but everything else about it is more ordinary and the similarities are such that an LLM can wing it and write you a lexer etc. After all, the whole point about parser generators and similar tools is that the task can be abstracted and separated from syntax in the first place.
In fact LLMs are very good at that sort of thing, filling in the blanks as it were. I'm old enough to remember the excitement about GPT 3.5 being able to form syntactically correct sentences with nonsensical words give to it.
For example, I just asked Chat [1]:
Hey chat. The gostak distims the doshes. What happens to the doshes?
And it promptly answered: The doshes get distimmed.
See, it even got the spelling right!_________________
[1] https://chatgpt.com/c/6a242b65-e248-83ed-9a6e-f238a1e871b6
Comment by layla5alive 3 days ago
Comment by layla5alive 3 days ago
Comment by Marha01 3 days ago
Emergent properties of complex systems should not be diminished just because the underlying operating principle is simple.
Comment by layla5alive 3 days ago
All of life arises (maybe) from very simple subatomic particles, and at each stage you can repeat this refrain, complexity increasing as you stack.
Comment by podocarp 2 days ago
Comment by galaxyLogic 3 days ago
Comment by nchie 4 days ago
Comment by raincole 3 days ago
Why do you think this is mutually exclusive to "LLM predicts the next token"?
If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.
If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.
There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.
Comment by layla5alive 3 days ago
And sure, it does, but the person you're replying to was trying to understand why it also seems to reason about the query to give an answer consistent with it, despite not being trained on that query or answer. Your answer seems to imply that its just another slick complex encoding.
But the emergent property of trillions of digital neurons predicting the next token is that in the process of being trained to do so, they can also learn to reason.
At some scale, it is efficient to encode cognition which is capable of mimicing the cognition which generated the input tokens.
Comment by locallost 3 days ago
Comment by throw310822 4 days ago
Comment by otabdeveloper4 3 days ago
No, they generate grammatically coherent text. That is because human language grammars are fundamentally mathematical structures that can be approximated with matrix operations.
They don't generate meaningful text because they have no inherent knowledge of the world.
If you've used LLMs for any amount of time you've already noticed how often they get confused about numeric quantities - like confusing notions of "bigger than" and "less than" or being unable to count letters in words.
This is because any meaning in their output is only accidental.
Comment by qsera 3 days ago
It is imitating the text written by humans who can represent things in space.
Comment by YeGoblynQueenne 3 days ago
If I can do my best to answer, Gemini is a multi-modal system. That means it's trained not only on text but also still images, video and also sound. The training happens in parallel and the representation of each modality is usually different, so the image recognition part is not trained on text tokens but pixels, the video part (probably) on video frames etc. There is some kind of integrated training that goes on so that text can be generated that is correlated to an image and so on, but I don't know the specifics about Gemini in particular. This kind of thing is not exactly new either, you can find systems that captioned images before the rise of LLMs simply by training on examples of images coupled to their textual descriptions.
In that sense it's not entirely correct to call Gemini an "LLM" because it's not only a "language" (or, more precisely, text) model. But LLM I guess becomes a bit of a shorthand for everything based on, or combined with, an LLM.
Anyway that's what's going on: it's not just predicting the next word. It's also predicting the next image frame or the next set of pixels etc associated with the next word.
Comment by MagicMoonlight 3 days ago
It has read all of stackoverflow, so it has seen your kind of problem before. Try asking it something really unusual and it will shit the bed.
Comment by mjmsmith 3 days ago
Can stochastic parrots understand irony?
Comment by Ampersander 4 days ago
Comment by otabdeveloper4 3 days ago
int n_tokens = 0;
while (n_tokens < TOKENS_MAX) {
int next_token = decode(context, ++position);
print(token_to_text(next_token));
++n_tokens;
}
If you don't believe me then just download llama.cpp and see for yourself.Comment by layla5alive 3 days ago
Now, take that for loop, and replace the implementation of decode(context, ++position) and pass it to a human who was bored enough to play along and use a notebook to organize their thoughts and translate them to/from this encoding (you might write a helper function to do this for the human in the front-end of the new decode() impl, but the data flow in and out of decode() will remain the same):
decode(context, position)
{
cached string_answer = ask_human_question_via_context(context);
return decode_human_answer_to_tokens(cached string_answer, position);
}Is the output you get not thinking anymore because it passed through this harness? Did the human's mind somehow get reduced to mere interpolation?
The human mind is still a human mind. Putting a simple harness in front of a mind does not affect its fundamental properties.
In an LLM, decode() is calling into a trillion parameter connectome.
Comment by otabdeveloper4 3 days ago
Deal with it.
Comment by layla5alive 3 days ago
Comment by podocarp 2 days ago
Comment by layla5alive 1 day ago
"If this is flight, it's really boring. We can't even build a mechanical sparrow that can lay eggs and catch flies. You're telling me we can't run sparrow.exe but we've created flight?"
We didn't build something that flies by flapping its wings until 2010. We'd been building functional airplanes for more than 100 years before we were able to build something that worked in a (more primitive, but) similar way to how a sparrow flies.
I'm sorry that modern machine intelligence is so boring to you.
It isn't boring to me, I'm fascinated both by the ways I'm still far more capable than trillion parameter LLMs, and also by the ways they are already far more capable than I am.
FWIW, while I am not bored by nascent machine intelligence, I am bored by predictable human reactions to it: greed, exploitation, hubris, etc.
Comment by otabdeveloper4 2 days ago
Comment by layla5alive 2 days ago
Comment by rramadass 3 days ago
Comment by enrich717 2 days ago
Comment by oceansky 3 days ago
But apparently, they either just emit a [UNK] token or translate the unrecognized character into raw UTF-8 bytes.
Comment by cubefox 4 days ago
Comment by lateral_cloud 4 days ago
Comment by alansaber 3 days ago
Comment by melvinroest 4 days ago
Comment by disgruntledphd2 4 days ago
Comment by whyage 3 days ago
Comment by lhd1 4 days ago
Comment by blackoil 4 days ago
Comment by dialsMavis 4 days ago
I imagine if resources were spent writing this text then one benefit of using it is not using more resources or the pollution caused from a chatbot.
Comment by zemo 4 days ago
> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. One neuron might activate strongly on Eiffel-Tower-related text. Another on programming languages. Another on past-tense verbs.
People don't really write like this and they don't really talk like this (and no, people don't necessarily write exactly how they talk because they don't read exactly how they listen; the written word can be backtracked while the heard cannot, and speakers/writers know this, either consciously or unconsciously). A person would probably structure this more like:
> Researchers have found that some neurons inside the FFN are strongly associated with specific concepts or facts. For example, there could be one neuron that activates strongly on Eiffel-Tower-related text, another that activates strongly on programming languages, a third neuron activating on past-tense verbs, and so on.
Usually people wouldn't write "Another on programming languages." as a standalone sentence like that because the periods introduce an unnatural pause like they're giving a TED talk, unless of course they were punctuating that way for effect, but you'd essentially never communicate with that effect full time.
Comment by mattnewton 4 days ago
Comment by thin_carapace 4 days ago
Comment by Izkata 3 days ago
The one they're pointing out (the short punchy sentences) also apply to things like politicians and news articles. Blog posts are a bit odd.
* And here I mean those literal exact words. People are also extrapolating to similar patterns that use different or more words than "it's not" and "it's", but those flow better and aren't what I'm referring to here.
Comment by AgentMatt 4 days ago
Comment by timmytokyo 3 days ago
Comment by zemo 3 days ago
Comment by wizzwizz4 3 days ago
Comment by MagicMoonlight 3 days ago
Comment by rippeltippel 4 days ago
Comment by zenfoxai 3 days ago
Comment by brcmthrowaway 3 days ago
Comment by agumonkey 3 days ago
Comment by metaquestions 3 days ago
Comment by lionkor 4 days ago
Good article, but when sharing it I will have to preface "yes it's slop, but it's a good explanation".
Absolutely embarrassing that the author didn't catch that these LLM-isms are a (and here I'll use one) bad signal.
In fact, I would go so far as to say that publishing in this style stems from a lack of reading experience and writing experience, which does not bode well for someone pretending to be an expert. I gave this article to someone highly intelligent who doesn't know the first thing about how LLMs work internally, and she immediately called out that it reads like AI text.
Comment by Ampersander 4 days ago
Comment by janalsncm 4 days ago
From my read, it is fine. The brief history of LLMs is complicated since every single component has papers introducing enhancements. So it’s easy to ignore them or get bogged down with details.
The author appears to be a security researcher learning about LLMs for the purpose of defending against common attacks. So this piece is that person giving themselves a crash course on the topic. The fact that they cleaned up their notes with an LLM is frankly completely irrelevant.
Comment by rishbz 3 days ago
Comment by spaceisballer 3 days ago
Comment by Gareth321 3 days ago
I sense that statistics and benchmarks and research and statements from the world’s greatest academics won’t sway you, so maybe I’ll give you a personal anecdote. I have suffered from a condition my whole life called bile acid malabsorption. It caused chronic diarrhoea, pain, arthritis, dehydration, insomnia, and more. I spent decades searching for an answer. Dozens of different tests. Eventually doctors just said I was depressed and prescribed me antidepressants. They didn’t help. On the bad days I considered ending my life.
In desperation I turned to ChatGPT. Over months I described my symptoms, triggers, diet, timing, etc. We “sparred” with each other over assumptions and ideas. I gave it all my medical history. All the tests. Eventually it concluded that BAM was likely (plus another few options). So I pushed my doctor for a specialist referral. The specialist agreed to a scan based on the symptoms. It was confirmed. I’ve been taking some cheap medication each day now and it has changed my life.
I know others for whom ChatGPT has changed their lives in similar ways. Research shows LLMs are better than doctors already in many cases at diagnosis. They are improving at an exponential rate.
Comment by spaceisballer 3 days ago
Comment by layla5alive 3 days ago
When you got a reply illustrating how incorrect that claim from your first argument was, you shifted to focusing on the other argument (the one I actually happen to agree with - the cost to society of hitching increasing dependency on big tech will make the social media harms look like childs play).
I think your argument will be better received if you focus on the very valid concerns of societal harms, and acknowledge the ways LLMs are tremendously capable, without downplaying that.
I'm with the person you replied to in seeing how capable LLMs can be when you spar with them appropriately. They confabulate, but that's your job to catch as a sparring partner. But they do bring useful knowledge of thousands of PhDs into conversations - and even if you're among the most erudite humans on the planet, this is still an asset in intellectual search for truth on many topics.
Back to the genuine problems, and they are many: this power, concentrated in the hands of big tech, is a multiplier on the power already concentrated there, with many new capabilities - especially scary being the capabilities for subtly influencing and manipulating both individual and group behavior - for profit or otherwise - by the companies or their customers, or governments, or... The possibility space of harm and abuse is large..
On net, I think we should all be pushing to educate everyone around us on the pros, the nuances, the risks, and the big cons, and working to try to build a future of offline models rather than subscription service dependency..
Comment by spaceisballer 1 day ago
Comment by singpolyma3 4 days ago
Comment by inkysigma 4 days ago
https://arxiv.org/abs/2604.21691
There's of course empirical results and relatively weak theoretical results like the UAT but I also don't think that answers your question fully, especially since it seems impossible to definitively answer questions that the industry seems to betting on like whether or not there is a lower bound to their error rate or whether hallucination as a problem can be solved. We have much stronger ideas of what linear regression is doing relative to what LLMs are doing.
Comment by skydhash 4 days ago
Comment by krackers 4 days ago
https://www.youtube.com/watch?v=5MdSE-N0bxs is remarkably prescient given that it was written before LLMs
Comment by soupspaces 4 days ago
Comment by sheeshkebab 4 days ago
Comment by qsera 3 days ago
Comment by mathisdev7 3 days ago
Comment by codeakki 4 days ago
Comment by whateveracct 4 days ago
Comment by sspoisk 11 hours ago
Comment by eddysir 3 days ago
Comment by transkey 4 days ago
Comment by rbrown46 4 days ago
Comment by mgc_blackbox 4 days ago
Comment by youareartree 3 days ago
Comment by runfuyngunasdlj 3 days ago
Comment by dotdev_prem 3 days ago
Comment by spacebacon 4 days ago
Comment by stalfie 4 days ago