Introspective Diffusion Language Models

Posted by zagwdt 7 days ago

Comments

Comment by thepasch 6 days ago

If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any “native diffuser,” actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.

And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).

I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.

Comment by oliver236 6 days ago

I think your excitement is justified. The paper is claiming a serious bridge between AR quality and parallel decoding, and the lossless LoRA-assisted mode is the wildest part.

Comment by awestroke 6 days ago

I don't understand how you can compare against the base model output without generating with the base model, in which case what's the point?

Comment by radarsat1 6 days ago

Because the nature of transformers is that running a bunch of pregenerated tokens through them is a parallel operation, not autoregressive. That's how it works at training time, but speculative decoding uses it at inference time. So if you just want to check whether a set of known tokens is "likely" given the base model, you can run them all through and get probability distributions, no need to sample.

It's the same reason there's a difference in speed between "prompt processing" and "generation". The former is just taking the pre-generated prompt and building the KV cache, which is parallel, not autoregressive and therefore way faster.

Comment by qeternity 6 days ago

I haven't read TFA yet but a common technique is speculative decoding where a fast draft model will generate X tokens, which are then verified by the larger target model. The target model may accept some Y <= X tokens but the speedup comes from the fact that this can be done in parallel as a prefill operation due to the nature of transformers.

So let's say a draft model generates 5 tokens, all 5 of these can be verified in parallel with a single forward pass of the target model. The target model may only accept the first 4 tokens (or whatever) but as long as the 5 forward passes of the draft model + 1 prefill of the target model is faster than 4 forward passes of the target, you will have a speedup while maintaining the exact output distribution as the target.

Comment by nodja 6 days ago

Same reason why prompt processing is faster than text generation.

When you already know the tokens ahead of time you can calculate the probabilities of all tokens batched together, incurring significant bandwidth savings. This won't work if you're already compute bound so people with macs/etc. won't get as much benefits from this.

Comment by Majromax 6 days ago

Are Macs/etc compute bound with their 'it fits in unified memory' language models? Certainly by the time you're streaming weights from SSD you must be back in a bandwidth-bound regime.

Comment by dd8601fn 5 days ago

From what I understood, if we’re talking a single user on a mac (not batching) you’re rarely compute bound in the first place. More rows per pass is nearly free that way when cores were sitting idle anyway.

If that’s wrong I would certainly appreciate being corrected, though. But if it’s right, a 2.9x speed-up after rejected tokens, nearly for free, sounds amazing.

Comment by nodja 5 days ago

That will depend on the model, but they'll hit compute limits before a typical GPU in almost all cases. Macs will still benefit a speedup from this, just not one as big as the one reported.

Comment by Balinares 6 days ago

Isn't that exactly how draft models speed up inference, though? Validating a batch of tokens is significantly faster than generating them.

Comment by anentropic 6 days ago

presumably that happens at training time?

then once successfully trained you get faster inference from just the diffusion model

Comment by a1j9o94 6 days ago

You would only use the base model during training. This is a distillation technique

Comment by porridgeraisin 6 days ago

Eh. There is nothing diffusion about this. Nothing to do with denoising. This setup is still purely causal, making it quite a dishonest framing IMO. There is no more introspection here than what happens in MTP + SD setups.

Let me explain what is going on here. This is basically a form of multi-token prediction. And speculative decoding in inference. See my earlier post[1] to understand what that is. TL;DR, in multi-token prediction you train separate LM heads to predict the next as well as next to next token as well as... Upto chosen next kth token. Training multiple LM heads is expensive and can be unnecessary, so what people typically do is have a common base for all the k heads, explained further in [1]. These guys do another variant.

Here is what they do mechanically, given a sequence p consisting of five tokens PE([p1, p2, p3, p4, p5]). Where PE(.) adds relative position info to each token.

1. Create an augmented sequence PE([p1 MASK MASK MASK MASK]). Do a training pass on that, with the ground truth sequence p1..5. Here it is trained to, for example, to predict p3 given p1+pos=-2 MASK+pos=-1 MASK+pos=0, loosely notating.

2. Then separately[2], train it as usual on PE([p1 p2 p3 p4 p5]).

Step (1) teaches it to do multi-token prediction, essentially the single LM head will (very very loosely speaking) condition on the position `k` of the special MASK token and "route" it to the "implicit" k'th LM head.

Step (2) teaches it to be a usual LLM and predict the next token. No MASK tokens involved.

So far, you have trained a multi-token predictor.

Now during inference

You use this for speculative decoding. You generate 5 tokens ahead at once with MASK tokens. And then you run that sequence through the LLM again. This has the same benefits as usual speculative decoding, namely that you can do matrix-matrix multiplication as opposed to matrix-vector. The former is more memory-bandwidth efficient due to higher arithmetic intensity.

here is an example,

query = ["what", "is", "2+2"]) prompt = PE([...query, MASK*5]) you run output = LLM(prompt). Say output is ["what", "is", "2+2", "it", "is", "4"]. Note that the NN is trained to predict the kth next token when faced with positionally encoded MASK tokens. So you get all 5 in one go. To be precise, it learns to predict "4" given ["what", "is", "2+2", MASK, MASK]. Since it does not need the "it" and "is" explicitly, you can do it in parallel with generating the "it" and the "is". "is" is predicted given ["what", "is", "2+2", MASK], for example, and that also doesn't depend on the explicit "it" being there, and thus can also be done in parallel with generating "it", which is just normal generating the next token given the query. And then you use this as a draft in your speculative decoding setup.

Their claim is that using a multi-token predictor this way as a draft model works really well. To be clear, this is still causal, the reason diffusion models have hype is because they are capable of global refinement. This is not. In the same thread as [1], I explain how increasing the number of MASK tokens, i.e increasing `k`, i.e the number of tokens you predict at once in your multi-token prediction setup quickly leads to poor quality. This paper agrees with that. They try out k=2,3,4,8. They see a drop in quality at 8 itself. So finally, this is 4-token-prediction with self-speculative decoding(sans LayerSkip or such), removing seemingly no existing limitation of such setups. It is definitely an interesting way to train MTP though.

[1] https://news.ycombinator.com/item?id=45221692

[2] Note that it is computationally a single forward pass. Attention masks help you fuse steps 1 and 2 into a single operation. However, you still have 2 separate loss values.

Comment by Reubend 5 days ago

After trying to understand their method, I think you're right. Doesn't seem like anything that I would personally call "diffusion". Much closer to MTP + speculative decoding.

Then again, their results with it are great. It would be interesting to benchmark it against standard SD on a model that already uses MTP.

Comment by porridgeraisin 4 days ago

Yeah, I think it's a super neat way to do MTP. Conceptually much more pleasing and simple than existing methods. Especially since this way scaling `k` as models get better will be easier. Wish it had been presented as such.

Comment by radarsat1 5 days ago

This reminds me a lot of the tricks to turn BERT into a generative model. I guess the causal masking that keeps it to essentially be autoregressive is an important difference though. Kind of best of both worlds.

Comment by krackers 1 day ago

Masked language modeling has been compared loosely to text diffusion [1], so the paper's title claim may be loosely true in some sense even if it's misleading.

[1] https://nathan.rs/posts/roberta-diffusion/

Comment by 1 day ago

Comment by xiphias2 6 days ago

How does this compare to DFlash?

https://z-lab.ai/projects/dflash/

And DDTree?

https://liranringel.github.io/ddtree/

Comment by andsoitis 7 days ago

Is anyone here experimenting seriously with Diffusion for text generation? I’d love to learn about your experiences!

Comment by recsv-heredoc 7 days ago

https://www.inceptionlabs.ai/

This startup seems to have been at it a while.

From our look into it - amazing speed, but challenges remain around time-to-first-token user experience and overall answer quality.

Can absolutely see this working if we can get the speed and accuracy up to that “good enough” position for cheaper models - or non-user facing async work.

One other question I’ve had is wondering if it’s possible to actually set a huge amount of text to diffuse as the output - using a larger body to mechanically force greater levels of reasoning. I’m sure there’s some incredibly interesting research taking place in the big labs on this.

Comment by IanCal 6 days ago

The overall speed rather than TTFT might start to be more relevant as the caller moves from being a human to another model.

However quality is really important. I tried that site and clicked one of their examples, "create a javascript animation". Fast response, but while it starts like this

``` Below is a self‑contained HTML + CSS + JavaScript example that creates a simple, smooth animation: a colorful ball bounces around the browser window while leaving a fading trail behind it.

<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>JavaScript Bounce Animation</title> <style> body, html { margin: 0; padding: 0;

```

the answer then degrades to

``` radius: BALL_RADIUS, color: BALL_COLOR, traivD O] // array of previous {x,y} positions }; ```

Then more things start creeping in

``` // 3⃣ Bounce off walls if (ball.G 0 ball.radius < 0 || ball.x + ball.radius > _7{nas.width) { ball.vx *= -1; ibSl.x = Math.max(ball.radius, Math.min(ball.x, canvbbF4idth - ball.radius)); } if

```

and the more it goes on the worse it gets

``` Ho7 J3 Works 0 Atep | Description | ```

and

``` • prwrZ8}E6on 5 jdF wVuJg Ar touc> 2ysteners ,2 Ppawn \?) balls w>SFu the 8b$] cliM#]9 ```

This is for the demo on the front page, so I expect this is a pretty good outcome compared to what else you might ask.

Comment by cataflutter 6 days ago

Weird; I clicked through out of curiosity and didn't get any corruption of the sort in the end result.

I also asked it some technical details about how diffusion LLMs could work and it provided grammatically-correct plausible answers in a very short time (I don't know the tech to say if it's correct or not).

Comment by RugnirViking 6 days ago

I got the exact same thing. But trying out another few prompts I couldn't get it to happen again. I wonder if its a bug with the cahcing/website? I can't imagine they actually run interference each time you use one of the sample prompts?

Comment by nl 6 days ago

Mercury 2 is better than that in my testing, but it does have trouble with tool calling.

Comment by girvo 6 days ago

It's being explored right now for speculative decoding in the local-LLM space, which I think is quite interesting as a use-case

https://www.emergentmind.com/topics/dflash-block-diffusion-f...

Comment by roger_ 6 days ago

DFlash immediately came to my mind.

There are several Mac implementations of it that show > 2x faster Qwen3.5 already.

Comment by moostee 7 days ago

I have. It requires a distinct intuition compared to a normal language model. Very well suited to certain problems.

Comment by andsoitis 6 days ago

Can you tell us more?

Comment by Topfi 6 days ago

I've found the latency and pricing make Mercury 2 extremely compelling for some UX experiments focused around automated note tagging/interlinking. Far more than the Gemini Flash Lite I used before, it made some interactions nearly frictionless, very close to how old school autocomplete/T9/autocorrect works in a manner that users don't even think about the processes behind it.

Sadly, it does not perform at the level of e.g. Haiku 3.5 for tool calling, despite their own benchmarks claiming parity with Haiku 4.5, but it does compete with Flash Lite there too.

Anything with very targeted output, sufficient existing input and that benefits from a seamless feeling lends itself to dLLMs. Could see a place in tab-complete too, though Cursors model seems to be sufficiently low latency already.

Comment by nl 6 days ago

If you like Mercury 2 you should try Xiaomi Mimo-v2-flash.

I have an agentic benchmark and it shows Mercury 2 at 19/25 in 58 seconds and Mimo v2 Flash at 22/25 in 109 seconds

https://sql-benchmark.nicklothian.com/?highlight=xiaomi_mimo... (flip to the Cost vs Performance tab to see speed more graphically too)

Comment by Topfi 6 days ago

Thanks for the recommendation and sharing your evals, will take a closer look at them. Yes, the Mimo models are very interesting, end-to-end pricing wise especially, though in my tool call runs, GLM 4.7 Flash did slightly better at roughly equal speed and full run cost. Is of course very task dependent and both are amazing options in the price range, but latency wise, nothing feels like Mercury 2 at the moment.

Comment by nl 6 days ago

Yeah the speed is super impressive.

https://chatjimmy.ai/ from Taalas seems down at the moment but if you really want speed.... 18,000 tps is something to experience

Comment by feznyng 6 days ago

Did you get a chance to evaluate coding performance?

Comment by Topfi 6 days ago

Yes, nothing to write home about. It's all relative of course, what stack, what goal, what approach on which models perform best, but for regular day-to-day coding, I do not find it usable given alternatives.

Kimi, Mimimax and GLM models provide far more robust coding assistance at sometimes no cost (financed via data sharing) or for very cheap. Output quality, tool calling reliability and task adherence tend to be far more reliable across all three over Mercury 2, so if you consider the time to get usable code including reviews, manual fixes, different prompting attempts, etc. end-to-end you'll be faster.

Only "coding" task I have found Mercury 2 to have a place for code generation is a browser desktop with simple generated applets. Think artefacts/canvas output but via a search field if the applet has been generated previously.

With other models, I need to hide the load behind a splash screen, but with Mercury 2 it is so fast that it can feel frictionless. The demo at this point is limited by the fact that venturing beyond a simple calculator or todo list, the output becomes unpredictable and I struggle to get Mercury 2 to rely on pre-made components, etc. to ensure consistent appearance and a11y.

Despite the benchmarks, cost and speed figure suggesting something different, I have had the best overall results with Haiku 4.5, simply because GPT-5.4-nano is still unwilling to play nice with my approach to UI components. I am currently experimenting with some routing, using different models for different complexity, then using loading spinners only for certain models, but even if that works reliably, any model that I cannot force to rely on UI components in a consistent manner isn't gonna work, so for the time being it'd just route between less expensive and more expensive Anthropic models.

Coding wise, one more exception can be in-line suggestions, though I have no way to fairly compare that because the tab models I know about (like Cursors) are not available via API, but Mercury 2 seems to perform solidly there, at least in Zed for a TS code base.

Basically, whether code or anything else, unless your task is truly latency dependent, I believe there are better options out there. If it is, Mercury 2 can enable some amazing things.

Comment by LoganDark 6 days ago

I've been playing with a Swift implementation of a diffusion language model (WeDLM), but performance is not yet acceptable and it still generates roughly from left-to-right like a language model (just within a sliding window rather than strictly token-by-token... but that doesn't matter when the sliding window is only like 16 tokens.)

Comment by simianwords 6 days ago

Can diffusion models have reasoning steps where they generate a block, introspect and then generate another until the output is satisfactory?

Comment by moeadham 6 days ago

Well, you can take the output of a first pass and pass it back through the model like AR “reasoning” models do at inference time.

Comment by simianwords 6 days ago

Yes and has this been tried?

Comment by Topfi 6 days ago

Yes, Mercury 2 is a reasoning model [0].

[0] https://docs.inceptionlabs.ai/get-started/models#mercury-2

Comment by mlmonkey 6 days ago

I'm no expert (just a monkey... ;) ), but isn't Diffusion supposed to generate ALL of the output at once? From their diagram, it looks like their I-LDM model seems to use previously generated context to generate the next tokens (or blocks).

Comment by sdenton4 6 days ago

Block auto regressive generation can give you big speedups.

Consider that outputting two tokens at a time will be a (2-epsilon)x speedup over running one token at a time. As your block size increases, you quickly get to fast enough that it doesn't matter sooooo much whether you're doing blocks or actual all-at-once generation. What matters, then, is there quality trade-off for moving to block-mode output. And here it sounds like they've minimized that trade-off.

Comment by RugnirViking 6 days ago

can it go back and use future blocks as context? Thats what i'm most interested in here - fixing line 2 because of a change/discovery we made in the process of writing line 122. I think that problem is a big part of the narrowsightedness of current coding models

Comment by mlmonkey 5 days ago

Exactly. The current (streaming) way means that once it makes a decision, it's stuck with it. For example, variable naming: once it names it something, it's stuck using that name in the future. Where as a human would just go back and change the name.

Maybe "thinking" will fix this aspect, but I see it as a serious shortcoming.

Comment by ilaksh 6 days ago

Does this mean I should switch to sglang? How hard is it to add the capability for these type of models to vLLM? Or does it already handle them?

Comment by zhangchen 6 days ago

[dead]

Comment by shepardrtc 5 days ago

Last year, there was a period of a week or two where I would see Gemini responses diffusing in. I don't know if they were experimenting with it, or if it was just an effect. It didn't last long, but it was interesting to see.

Comment by ramon156 6 days ago

> 2025-04-12: Initial code release with training and inference support.

> 2025-04-12: Released I-DLM-8B, I-DLM-32B, and I-DLM-8B-LoRA on HuggingFace.

Is this old already? Not saying that's a bad thing, since it seems very sophisticated. Just curious if there's an update

Comment by oersted 6 days ago

It's clearly a typo on the year, April 12 was two days ago, a quick check in HuggingFace shows that they were uploaded 5 days ago.

Comment by scotty79 6 days ago

So can you just use this and have a faster Qwen32b?

https://huggingface.co/yifanyu/I-DLM-32B/tree/main

Comment by 2001zhaozhao 6 days ago

I always thought some kind of block-based diffusion architecture would be the future of LLMs, especially some architecture that can dynamically alter its token generation rate as well as "reason and generate at the same time", and have an opportunity to correct tokens that it has just generated. Something like the equivalent of a short term "working memory" for humans. But I have no understanding of the math. Fingers crossed.

Comment by keyle 6 days ago

This looks great. Can we use it yet?

Comment by enesz 6 days ago

[dead]

Comment by akcd 6 days ago

[dead]

Comment by Openpic 6 days ago

3倍向上したとこのとですが、ボトルネックはMemory BandwidthからComputeに移行したの？それともMemory Bandwidthが支配的ですか？

Comment by salviati 6 days ago

This translates to

> I understand it improved by 3x, but has the bottleneck shifted from Memory Bandwidth to Compute? Or is Memory Bandwidth still dominant?

But why did you post your comment in Japanese? We have so many good options for automated translation nowadays!

でも、なぜ日本語でコメントを投稿したんですか？最近は自動翻訳の良い選択肢がたくさんあるのに！

Comment by flakiness 6 days ago

Native Japanese speaker here.

The original Japanese comment is clearly machine translated from another language to English. @Openpic is trolling.

I'd just downvote.

Comment by fumblebee 6 days ago

I'm not in on the joke, can someone ELI5

Comment by Tade0 6 days ago

Perhaps there is none.

I'm not a native English speaker and every now and then I see a comment in my mother tongue (downvoted to all hell of course). It's usually some kind of offhand remark.