Continuous Diffusion Rivals Discrete in Language Modeling

For the first time, continuous diffusion rivals discrete counterparts on standard language modeling benchmarks.

There is an old belief that continuous diffusion is born to be inferior to discrete diffusion in language modeling. Now, we are challenging it.

A brief history of diffusion language models

TL; DR: Performance of diffusion language models increases with autoregressive resemblance over years, with continuous diffusion being the earliest approach believed to be inferior.

The methodology development of diffusion language models (unconditional generation). Early works focused more on the exploration of continuous diffusion, while a wave of discrete diffusion emerged after 2023 because of its advanced performance. Recently, continuous diffusion has begun to attract renewed attention; however, there is still no evidence demonstrating its performance at scale.

Continuous diffusion has dominated the generation of images and video for years. A natural idea is to expand its domain to text generation, where early works made a lot of efforts during 2022-2023. However, the empirical result seems to only prove the inferiority of continuous diffusion in language modeling, especially under the scaled-up scenario. For example, Plaid, the only scaled-up continuous diffusion, requires 1B parameters to match the performance of autoregressive Transformer with ~100M parameters . Other continuous diffusion like Diffusion-LM even struggle to generate fluent sentences unconditionally.

The failure attempts of continuous diffusion language models nudged people to turn to discrete diffusion, which generates data from uniform states (random tokens) or absorbing (fully-masked) states. Not surprisingly, discrete diffusion was quickly proved to work better on discrete text data in 2024. By diffusing text from fully-masked states, SEDD for the first time achieved a perplexity (PPL) with a gap of <10 compared to autoregressive Transformer. Masked Diffusion Language Model (MDLM) furthermore narrowed this gap with a simplified loss similar to autoregressive. A recent milestone came from block diffusion (BD3-LM) , where an interpolation between MDLM and autogressive Transformer yield a PPL only ~3 larger than autoregressive’s.

The evolution of diffusion language models up to mid-2025 can thus be summarized as:

A clear trend emerges: the more likely a diffusion behaves like autoregressive, the closer its performance gets to autoregressive’s. This has become a tenet in the industrial practice of diffusion language models, where people scale up block diffusion with causal attention, absorbing states, and blocksizes of 4-32 - very similar to autoregressive Transformers with multi-token prediction.

However, while the performance gap between masked diffusion and autoregressive Transformer is being narrowed, we also witness the loss of diffusion’s characteristics, which limits two original potentials of diffusion:

1) Few-step generation. In the image & video domain, continuous diffusion maps different noise points to different data samples, making one-step generation possible through learning this many-to-many mapping. Masked diffusion, however, has its prior distribution collapsed into one single fully masked state. Because of this, few-step generation of masked diffusion suffers from parallel decoding dilemma , which mixes up different modes due to the variety of targets. As shown in the figure below, this phenomenon will degrade the sample quality, reflected by exploding generative perplexities under low numbers of function evaluations (NFEs) .

Masked diffusion may mix up different target modes in few-step generation.

2) Controllability. Image and video diffusion can be semantically controlled by manipulating continuous latents. For example, classifier-based and classifier-free guidance . By contrast, masked diffusion only supports token-level editing like clozing.

The strengths and trade-offs of masked diffusion raise an important question: if diffusion models must increasingly mimic autoregressive models to perform well, what defines diffusion as an independent family of language models?

Frontiers in diffusion language models have taken action on this question in 2025: They traced back to architectures with dispersed priors, trying to get some lost diffusion characteristics back. One well-known attempt is DUO , an improved version of uniform-state discrete diffusion. Although it underperforms masked diffusion under language modeling benchmarks like OpenWebText, DUO maintains better sample quality after distillation for few-step generation while adopting the guidance mechanism designed for uniform-state discrete diffusion. A more recent study showed that DUO beat both masked diffusion and autoregressive in a scaled-up setting on GSM8K, a widely-used math benchmark. This result challenges the trend of limitlessly making diffusion look like autoregressive.

In our recent work, we go backward a step further than DUO to try the canonical form of diffusion: continuous diffusion. We show that the previous underperformance of continuous diffusion language models is not intrinsic, but led by the underoptimization in their training recipes and evaluating protocols. With our optimization of 1) noise schedule, 2) self-conditioning, and 3) the NLL bound, continuous diffusion is able to rival discrete diffusion and even autoregressive under the setup and scale of GPT-2-small. Specifically, our method, LangFlow, beats autoregressive on 4 of 7 benchmarks of zero-shot PPL evaluation.

Next, we will explain how to achieve that.

Training Continuous Diffusion on Text

Continuous diffusion in the embedding space

TL; DR: Our model predicts clean logits from noisy text embeddings.

Discrete diffusion predicts logits from noisy discrete tokens. Our continuous diffusion predicts clean logits from noisy word embeddings, thus able to share the same network architecture with discrete diffusion.

We begin by reframing embedding-space diffusion, a classical framework of continuous diffusion language models. Let \(\mathcal{E} \in \mathbb{R}^{V \times D}\) be an embedding matrix over a vocabulary of size \(V\), mapping each text token \(x^{(i)}\) to a \(D\)-dimensional vector \(\mathbf{e}_{x^{(i)}}\). A sequence of tokens \(\mathbf{x} = (x^{(1)}, \ldots, x^{(L)})\) is thus embedded into a continuous matrix \(\mathbf{z} = (\mathbf{e}_{x^{(1)}}, \ldots, \mathbf{e}_{x^{(L)}}) \in \mathbb{R}^{L \times D}\), and we define the generative target of the flow as this embedding \(\mathbf{z}\).

Flow Matching (FM) constructs a velocity field \(\mathbf{u}_t(\mathbf{z}_t)\) that transports a Gaussian prior \(p_\text{prior} = \mathcal{N}(\mathbf{0}, \mathbf{I})\) to the data distribution \(p_\text{data}\) by solving an ODE \(d\mathbf{z}_t = \mathbf{u}_t(\mathbf{z}_t)\, dt\). In practice, we use an affine Gaussian probability path:

\[\mathbf{z}_t = \alpha_t \mathbf{z} + \sigma_t \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}),\]

where \(\alpha_t\) grows from \(0 \to 1\) and \(\sigma_t\) decays from \(1 \to 0\) as \(t\) goes from \(0\) to \(1\).

Training Rather than directly regressing on the velocity field, we train a model \(\hat{\mathbf{x}}_\theta^{(i)}(\mathbf{z}_t, t)\) to predict the probability distribution over tokens at each position \(i\). This is principled via Bregman divergence : for any convex function \(f\), the Bregman divergence between a target \(\mathbf{p}\) and a predictor \(\mathbf{q}\) is

\[\mathcal{D}_f(\mathbf{p}, \mathbf{q}) = f(\mathbf{p}) - f(\mathbf{q}) - \nabla f(\mathbf{q}) \cdot (\mathbf{p} - \mathbf{q}).\]

Training by minimizing the expected Bregman divergence between the one-hot ground truth \(\mathbf{1}_{x^{(i)}}\) and the predicted probabilities \(\hat{\mathbf{x}}_\theta^{(i)}\) achieves posterior matching: the optimal predictor always recovers the true posterior \(p(x^{(i)} \mid \mathbf{z}_t)\), regardless of the specific choice of \(f\). Cross-entropy (CE) is the special case with \(f(\mathbf{p}) = \mathbf{p} \cdot \log \mathbf{p}\), where for one-hot \(\mathbf{p} = \mathbf{1}_{x^{(i)}}\) the divergence simply reduces to \(-\log \hat{x}_\theta^{(i, x^{(i)})}\). This yields the CE training objective:

\[\mathcal{L}_\text{CE}(\theta) = \int \lambda(t)\, \mathbb{E}\left[-\sum_{i=1}^L \log \hat{\mathbf{x}}_\theta^{(i, x^{(i)})}(\mathbf{z}_t, t)\right] dt.\]

Here, \(\hat{\mathbf{x}}_\theta\) is parameterized by a diffusion transformer (DiT) with bidirectional attention that takes noisy embedding \(z_t\) and time conditioning \(t\) as inputs and predicts token logits. The continuous denoiser \(\hat{\mathbf{z}}_\theta\) needed for ODE integration is then recovered analytically as the embedding-space expectation over the predicted token distribution:

\[\hat{\mathbf{z}}_\theta^{(i)}(\mathbf{z}_t, t) = \mathcal{E}^\top \hat{\mathbf{x}}_\theta^{(i)}(\mathbf{z}_t, t).\]

which we can use to obtain \(\mathbf{v}_\theta(\mathbf{z}_t, t)\) to progressively denoise \(z_t\).

The key insight of our training design is two-fold: First, optimizing the model by the cross-entropy loss avoids mode collapse. Different from latent diffusion for images, our embedding layer needs to be trained together with the model. Regressing the velocity directly can then converge trivially when embeddings for different tokens are the same, which leads to mode collapse. Second, under this design, our model is fed with (noisy) embeddings and predicts discrete token likelihoods, similar to discrete diffusion and autoregressive Transformer. This means that it can share the same network architecture with discrete diffusion , and we can therefore apply fair comparison among them.

Sampling With a trained velocity function $\mathbf{v}_\theta(\mathbf{z}_t, t)$, we sample text examples by a two-stage process: First, we generate clean text embedding $z_{\gamma_N}$ by progressively denoising a sampled Gaussian prior $z_{\gamma_0}$ as standard Flow Matching; Second, we apply the discrete token predictor $\hat{\mathbf{x}}_\theta(\mathbf{z}_t, t)$ (Denoiser in the figure, which is part of $\mathbf{v}_\theta(\mathbf{z}_t, t)$ as described in Training) to the clean text embedding $z_{\gamma_N}$ to obtain the final probability distribution $\hat{x}_\theta$ and sample the token $\hat{x}$ from this distribution via Argmax.

Noise scheduling

TL; DR: Good noise schedules for text focus on highly noisy districts, substantially different from those for images. Choosing such noise schedules correctly is the key of tuning continuous diffusion on text.

The problem with standard schedulers

Training continuous diffusion requires choosing how to schedule the noise level \(\sigma_t\) (and signal \(\alpha_t\)) across the diffusion time \(t \in [0, 1]\). For image models, popular design choices place training emphasis around intermediate noise levels, where the visual signal-to-noise ratio is moderate (around \([0.5, 2]\)). This heuristic works because the image loss profile fairly well spreads out across \(t\).

Text embeddings, however, tell a very different story. When training with a naive uniform \(t\) schedule, we observe that our cross-entropy loss is nearly zero for \(t \in [0.2, 1.0]\). This means that the model will add nothing to the generated data during this interval and that all meaningful information is generated in the first \(20\%\) of the diffusion path. This is intuitive because different text embeddings are far more isolated to each other compared to images. Standard image schedulers therefore waste more than half of their training and sampling steps on noise levels that carry no useful information.

We then notice that our loss changes dramatically mainly on highly noisy districts. Hence, instead of conditioning the network on \(t\), we introduce the logarithmic noise-to-signal ratio (logNSR)

\[\gamma_t = \log\!\left(\frac{\sigma_t^2}{\alpha_t^2}\right),\]

as our time conditioning, which strictly maps \(t \in [0, 1]\) to \(\gamma \in (+\infty, -\infty)\). Pure noise corresponds to \(\gamma \to +\infty\); perfectly recovered data to \(\gamma \to -\infty\). Restricting to variance-preserving (VP) paths where \(\alpha_\gamma^2 + \sigma_\gamma^2 = 1\), the noisy embedding at a given level is simply:

\[\mathbf{z}_\gamma = \sqrt{\sigma(-\gamma)}\, \mathbf{z} + \sqrt{\sigma(\gamma)}\, \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}),\]

where \(\sigma(\cdot)\) is the sigmoid function. The main advantage of conditioning on \(\gamma\) is: when NSR doubles, the time conditioning only shifts by \(1\), which smoothens the curve of time conditioning when NSR dramatically increases. This allows us to allocate more resolution on highly noisy districts, which concentrates the main change of our training loss.

Matching the schedule to information gain

Plotting the per-noise-level CE loss \(\mathcal{L}_\gamma = \frac{1}{L}\sum_i \mathbb{E}[-\log q_\theta^{(i)}(x^{(i)} \mid \mathbf{z}_\gamma, \gamma)]\) over \(\gamma\) reveals something striking: the loss curve is remarkably stable across training stages (from 50k to 1M steps). This is not coincidence: By the standard information-theoretic decomposition,

\[\mathbb{E}_{x^{(i)} \sim p}\left[-\log q_\theta^{(i)}(x^{(i)})\right] = \mathrm{KL}(p \,\|\, q_\theta^{(i)}) + H(p),\]

the KL term quickly converges to zero as the model learns the posterior, leaving the loss dominated by the irreducible posterior entropy \(H(x^{(i)} \mid \mathbf{z}_\gamma)\). In other words, the loss profile is a fingerprint of the information content of each noise level — it is a property of the data, not the model. In other words, after the very early stage of our training, we have \(\mathcal{L}_\gamma\approx H_\gamma\).

This means that reversing the path of diffusion is actually a process of gaining information for the generated data progressively, which motivates a heuristic principle: to allocate training steps proportionally to the information gained per unit of \(\gamma\), i.e., proportionally to \(H'_\gamma = \frac{\partial}{\partial \gamma} H_\gamma\). Regions where the model learns more about tokens per noise level should receive both more training steps and more ODE sampling steps. This becomes our principle to sample \(\gamma\) during training and inference.

A Gumbel proposal for uniform-information scheduling

To apply our principle, we reuse the curve of loss-\(\gamma\) above (now we know it is exactly the curve of \(H_\gamma\)-\(\gamma\)). Comparing \(H'_\gamma\) from checkpoints at 50k, 100k, and 150k steps, we find that the derivative is strictly unimodal and positively skewed. Surprisingly, we discover that a Gumbel distribution provides an excellent match for the \(H_\gamma\)-\(\gamma\) distribution:

\[H_\gamma \propto H(+\infty) \cdot \exp\!\left(-\exp\!\left(-\frac{\gamma - P_\mu}{P_\beta}\right)\right)\]

Hence, we parameterize \(H_\gamma\)-\(\gamma\) by \(H'_\gamma \sim \mathrm{Gumbel}(\gamma; P_\mu, P_\beta)\). In our experiments, we set \(P_\mu\), \(P_\beta\), and \(H(+\infty)\) as trainable parameters to dynamically fit the ground truth \(H_\gamma\)-\(\gamma\) curve.

To summarize: During training, we sample \(\gamma\) from this Gumbel distribution and perturb data accordingly; During ODE sampling, we schedule the integration steps according to the same distribution — concentrating solver steps precisely where the per unit information gain is rich.

The impact is dramatic: ablations show this Gumbel proposal alone reduces Gen. PPL (↓, measured by GPT-2-Large) of LangFlow by a factor of 7× (from ~1000 to ~150). This design choice is the key milestone that allows continuous diffusion to match its sample quality to that of discrete diffusion.

Self-conditioning

TL; DR: Self-conditioning works differently on continuous and discrete diffusion. For discrete, it reduces generative perplexity at the cost of higher perplexity. For continuous, it significantly decreases both perplexity and generative perplexity.

Self-conditioning is an established trick in diffusion language models, applicable to both discrete and continuous variants. The mechanism originally targets compounding error during sampling: at each solver step, the denoiser \(\hat{\mathbf{z}}_\theta(\mathbf{z}_\gamma, \gamma)\) produces an estimate of the clean embedding \(\hat{\mathbf{z}}\), which is used to compute the velocity and the next noisy state \(\mathbf{z}_{\gamma+\Delta\gamma}\). Hence, any imprecision in \(\hat{\mathbf{z}}\) corrupts the input of the next solver step, whose denoising error then corrupts the step after that — errors cascade along the entire ODE trajectory. Self-conditioning breaks this cascade by feeding the previous step’s prediction \(\hat{\mathbf{z}}_\theta^{\text{prev}}\) back as an auxiliary input to the network, giving it an explicit semantic anchor at each step.

In our practice, self-conditioning is applied randomly with probability 0.25 during training that the model is fed by the prediction of itself for one round, so the model learns to exploit the prior prediction when available. In the sampling, self-conditioning is always enabled, initialized to zero at the first step and set as the previous prediction at the following-up steps.

Because self-conditioning is specifically designed to reduce step-wise error accumulation, it was widely assumed to have no effect on perplexity (PPL) — a metric computed over the full data distribution rather than along particular generation trajectories. For this reason, prior works have commonly disabled self-conditioning when comparing discrete and continuous diffusion by PPL. For example, in the paper of DUO, DUO achieves a PPL of 43.0 at 100k training steps on LM1B, against Plaid’s 89.9 (with self-conditioning disabled). This gap has then been taken as evidence of continuous diffusion’s fundamental inferiority.

However, we find that such evaluation protocol is unfair to continuous diffusion, because self-conditioning actually has a dramatically different effect on continuous diffusion compared to that on discrete diffusion. Turning on and off self-conditioning for MDLM (discrete) and LangFlow (our continuous), and measuring both their PPL and Gen. PPL reveals a fundamental asymmetry:

Model Self-Cond. Gen. PPL ↓ PPL ↓
MDLM âś— 103.9 31.0
MDLM âś“ 94.9 (-9.0) 32.7 (+1.7)
LangFlow âś— 154.2 49.0
LangFlow âś“ 81.5 (-72.7) 30.0 (-19.0)

For discrete diffusion (MDLM), self-conditioning improves Gen. PPL — confirming the expected reduction in sampling-time error accumulation — but slightly degrades PPL. This trade-off is well-known and explains why self-conditioning is disabled in PPL-focused comparisons of discrete diffusion. For continuous diffusion (LangFlow), the picture is entirely different: self-conditioning significantly improves both Gen. PPL and PPL, which is enough to close the gap between the PPL of continuous and that of discrete diffusion.

Evaluating continuous diffusion by an ODE-based bound of negative log-likelihood

TL; DR: The likelihood bound of Flow Matching (ODE) provides tight estimation for NLL in continuous diffusion language models.

Evaluating the likelihood (corresponding to PPL) of a language model is as important as generating good samples. For autoregressive models, exact log-likelihood is computed token by token. For discrete diffusion, a variational lower bound derived from the stochastic forward process (the negative evidence lower bound, NELBO) is the standard metric. But for continuous diffusion with ODE sampling, the SDE-derived bound is structurally mismatched — the ODE and SDE traverse different probability pathways over the same data.

LangFlow adopts a deterministic generative process via probability flow ODE. We therefore derive a likelihood bound tailored specifically to this ODE trajectory, following Flow Matching . For a sequence \(\mathbf{x} = (x^{(1)}, \ldots, x^{(L)})\) with embedding dimension \(D\), the log-likelihood of LangFlow satisfies the following evidence lower bound:

\[\log p(\mathbf{x}) \ge \mathbb{E}_{\mathbf{z}} \Bigg[ \frac{LD}{2} - \frac{\|\mathbf{z}_b\|^2}{2\sigma_b^2} + \sum_{i=1}^L \log \hat{\mathbf{x}}_{\theta}^{(i, x^{(i)})}(\mathbf{z}_a, a) + \int_a^b \frac{\alpha_\gamma}{2} \nabla \cdot \hat{\mathbf{z}}_\theta(\mathbf{z}_\gamma, \gamma)\, d\gamma \Bigg],\]

where \(\mathbf{z}_a \sim \mathcal{N}(\alpha_a \mathcal{E}^\top \mathbf{x},\, \sigma_a^2 \mathbf{I})\), \(\mathbf{z}_\gamma\) for \(a \le \gamma \le b\) is the reverse ODE trajectory given \(\mathbf{z}_a\), and \(\nabla \cdot \hat{\mathbf{z}}_\theta\) denotes the divergence of the denoiser with respect to \(\mathbf{z}_\gamma\).

There are four terms to unpack here. The first two account for the log-probability of the initial noisy state under the Gaussian prior. The third is the token-level cross-entropy at the end of the reverse trajectory — essentially measuring how well the model recovers the original tokens from near-clean embeddings. The fourth term captures the volume change induced by the ODE flow, analogous to the change-of-variables formula in normalizing flows; it accounts for how the deterministic flow compresses or expands probability mass along its trajectory.

This bound is appropriate in the sense that it is derived directly from the ODE dynamics rather than from a stochastic surrogate. It closes the evaluation gap between continuous and discrete diffusion language models, enabling apples-to-apples comparison of their data likelihoods.

Experiments: Continuous diffusion rivals discrete in language modeling

TL; DR: Putting our designs together helps continuous diffusion to match or outperform discrete diffusion on LM1B and OWT, measured by PPL and Gen PPL. Our model also achieves advanced zero-shot transfer among all diffusion models.

We evaluate LangFlow on two standard benchmarks: LM1B, the most widely-used sentence-level language modeling benchmark, and OpenWebText (OWT), the language corpus similar to the dataset used to train GPT-2.

Basically we have two experimental settings:

In the above experiments, all models are 130M-parameter diffusion transformers (DiT) trained for 1M steps, following the setup of existing literature. We report the upper bound of the perplexity (PPL) of diffusion, and the generative perplexity (Gen. PPL) measured by GPT-2 Large on 1024 samples. For LM1B, the sample length is 128, with NFE=128. For OWT, the sample length is 1024, with NFE=1024.

Language Modeling

On LM1B, LangFlow achieves a Gen. PPL of 81.5, outperforming the best discrete diffusion baseline (DUO at 97.6) by more than 15 points. In terms of dataset PPL, LangFlow (30.0) beats all discrete diffusion. On OWT, LangFlow (24.6) also yields perplexity only with a gap of 1.4 compared to MDLM (23.2). This is the first time that continuous diffusion rivals discrete diffusion in standard language modeling benchmarks.

Zero-shot transfer

Across 7 zero-shot benchmarks, LangFlow beats the autoregressive baseline on 4 out of 7 (PTB, Lambada, PubMed, Arxiv) and MDLM on 3 out of 7 (PTB, Wikitext, Lambada). Overall, LangFlow performs better in zero-shot transfer compared to existing continuous and uniform diffusion methods.

Discussion: Why continuous diffusion language models?

TL;DR: Diffusion should complement autoregressive models in controllability and multimodality rather than replace them. Continuous diffusion is well suited to this goal.

In previous sections, we showed that continuous diffusion can empirically rival discrete diffusion in language modeling. However, a substantial performance gap between diffusion and autoregressive methods remains. Experimental results at scale consistently indicate that diffusion does not outperform autoregressive models overall in language modeling. This brings us back to the question raised at the beginning: why should we pursue diffusion language models?

Sampling efficiency alone is not a sufficiently strong justification. While much of the literature presents it as a potential advantage, recent large-scale comparisons show that diffusion achieves higher tokens-per-second than autoregressive Transformers only at small batch sizes. This implies that diffusion primarily trades off aggregate serving throughput for per-request latency, rather than delivering consistent efficiency improvements across different system regimes. The result is intuitive: in autoregressive Transformers, each query to the KV cache produces one token; in diffusion models, only a subset of queries result in decoded tokens, while each forward pass predicts logits for all positions. Considering the difference in the attention mask, even a one-step diffusion will require more compute than the autoregressive model to decode the same number of tokens. As a result, diffusion can only be more efficient when compute is not the bottleneck, which places a clear upper bound on its efficiency advantage.

Given that autoregressive Transformers have become the gold standard for both effectiveness and efficiency in language modeling, a more practical goal for diffusion language models is to complement them rather than replace them. From this perspective, we can identify two key limitations of autoregressive models where diffusion may offer distinct advantages:

1) Controllability. Autoregressive models are inherently limited in controllability because they do not leverage latent variables during sampling. In contrast, diffusion models transform noisy latent variables into data samples through editable trajectories. This creates opportunities for controllable generation, such as guidance mechanisms. Diffusion models can therefore serve as controllable drafters in speculative decoding or provide a new degree for post-training—complementing the limited controllability and interpretability of current autoregressive-based systems.

2) Multimodality. The physical world consists of continuous signals—images, video, audio, and actions. Representing these signals as discrete tokens in autoregressive models is not a natural approach, particularly when multiple modalities interact. Because diffusion operates in continuous spaces, it is a promising foundation for multimodal paradigms grounded in the physical world, such as vision-centric world models.

Admittedly, without large-scale empirical validation, these hypotheses about the potential of diffusion language models remain unconfirmed. Nevertheless, we believe the future of language modeling lies not in a single unified model, but in a system of diverse models that each leverage their own strengths in a coordinated framework. From this perspective, the goal of developing diffusion language models should not be to chase performance by mimicking autoregressive methods at the cost of losing their defining characteristics. Instead, we should optimize diffusion while preserving what makes it unique, and then identify its role within the broader ecosystem of language models. In other words, we should no longer force diffusion into the autoregressive paradigm. Let diffusion be diffusion first.

This leads to our answer to why we work on continuous diffusion language models: they may not be the most performant variant, but they best preserve the defining properties of diffusion itself—and that is precisely why diffusion language models are worth pursuing.