HyperAIHyperAI

Command Palette

Search for a command to run...

Continuous Diffusion Scales Compétitivement avec Discrete Diffusion pour Langage

Zhihan Yang Wei Guo Shuibai Zhang Subham Sekhar Sahoo Yongxin Chen Arash Vahdat Morteza Mardani John Thickstun

Résumé

Bien que la diffusion ait récemment suscité un intérêt considérable au sein de la communauté de la modélisation du langage, la diffusion continue a été perçue comme moins évolutive que les approches discrètes. Pour contester cette idée reçue, nous revisitons Plaid, un modèle de langage à diffusion continue basé sur la vraisemblance (DLM, pour Diffusion Language Model), et construisons RePlaid en alignant l’architecture de Plaid sur celle des DLM discrets modernes. Dans ce cadre unifié, nous établissons la première loi d’échelle pour les DLM continus qui rivalise avec celle des DLM discrets : RePlaid présente un écart de calcul (compute gap) de seulement 20x par rapport aux modèles autoregressifs, surpasse Duo tout en utilisant moins de paramètres, et surpasse MDLM (Multi-Dimensional DLM) dans le régime de surapprentissage (over-trained regime). Nous évaluons RePlaid par rapport aux DLM continus récents : sur OpenWebText, RePlaid atteint une nouvelle limite d’état de l’art (SOTA) en termes de PPL (Perplexity) de 22.1 parmi les DLM continus, ainsi qu’une qualité de génération supérieure. Ces résultats suggèrent que la diffusion continue, lorsqu’elle est entraînée via la vraisemblance, constitue une alternative hautement compétitive et évolutive aux DLM discrets. De plus, nous apportons des insights théoriques pour comprendre l’avantage de l’entraînement basé sur la vraisemblance. Nous démontrons que l’optimisation du schedule de bruit pour minimiser la variance de l’ELBO (Evidence Lower Bound) conduit naturellement à une entropie croisée linéaire (perte d’information) au fil du temps. Cela permet de répartir uniformément la difficulté de denoising sans aucune reparamétrisation temporelle spécifique à un cas.

One-sentence Summary

The authors construct RePlaid, a likelihood-based continuous diffusion language model, by aligning Plaid’s architecture with modern discrete DLMs to challenge scalability assumptions, establishing the first scaling law for continuous diffusion that rivals discrete approaches with a compute gap of only 20x compared to autoregressive models and a new state-of-the-art perplexity bound of 22.1 on OpenWebText among continuous DLMs, suggesting likelihood-based training as a competitive and scalable alternative.

Key Contributions

  • This work constructs RePlaid by aligning the architecture of the likelihood-based continuous diffusion model Plaid with modern discrete diffusion language models. The resulting unified setting enables rigorous scalability comparisons with discrete approaches.
  • Experiments establish the first scaling law for continuous diffusion language models, revealing a compute gap of only 20x compared to autoregressive models. RePlaid outperforms Duo while using fewer parameters and surpasses MDLM in the over-trained regime.
  • Benchmarks on OpenWebText show that RePlaid achieves a new state-of-the-art perplexity bound of 22.1 among continuous diffusion language models. The paper also provides theoretical insights demonstrating that optimizing the noise schedule to minimize ELBO variance yields linear cross-entropy over time.

Introduction

Continuous diffusion language models offer unique advantages in controllability and sampling efficiency, yet they have historically underperformed discrete diffusion and autoregressive models in terms of scalability due to substantial compute overhead. To challenge this narrative, the authors introduce RePlaid, a modernized likelihood-based model that aligns continuous diffusion architectures with established discrete scaling protocols. Their unified benchmark reveals that continuous diffusion scales competitively by reducing the compute gap to only 20×20\times20× relative to autoregressive baselines while achieving state-of-the-art perplexity. Additionally, the work provides theoretical insights demonstrating that likelihood-based training naturally optimizes noise schedules and embedding geometries to improve performance.

Dataset

  • Composition and Sources: The authors utilize a reference corpus to precompute a dominant Universal POS tag for each GPT-2 subword token.
  • Processing and Alignment: They align subwords to spaCy word spans via character offsets to resolve the mismatch between GPT-2 BPE splits and whole-word POS definitions.
  • Usage: This data facilitates the analysis of embedding geometry conditioned on syntactic roles.

Method

The authors leverage a Variational Diffusion Model (VDM) framework adapted for text generation, referred to as Plaid. The methodology encompasses a robust data processing pipeline to construct valid input sequences, followed by a continuous diffusion process operating on low-dimensional token embeddings.

Input Representation and Data Processing

To ensure the input sequences align with the embedding space, the authors implement a specific pipeline for tokenization and part-of-speech (POS) alignment. Given a corpus of token IDs, the text is decoded and processed to recover character offsets for each subword. A POS tagger analyzes the decoded text at the word level, and these tags are inherited by the subwords based on character span overlap. This alignment ensures that syntactic information is preserved in the input representation.

The fast tokenizer then returns subwords with specific character spans, which may split a single word into multiple tokens. For instance, a verb like purring might be split into distinct subword units. The alignment procedure attributes all subwords belonging to a single word to the same POS tag.

To facilitate this alignment, a character-to-word map is allocated as an integer array. Each index in the array corresponds to a character position in the text and is filled with the index of the spaCy word covering that position. For each subword, the slice of this map corresponding to its span is analyzed, and the majority spaCy index is assigned as the tag.

Model Architecture and Diffusion Process

Once the input sequences are prepared, Plaid identifies a length-LLL sequence x\mathbf{x}x with a matrix in {0,1}L×V\{0, 1\}^{L \times V}{0,1}L×V, where VVV is the vocabulary size. The sequence is embedded into a continuous space via a learnable token-embedding matrix ERV×de\mathbf{E} \in \mathbb{R}^{V \times d_e}ERV×de, resulting in an embedded sequence e:=xERL×de\mathbf{e} := \mathbf{x}\mathbf{E} \in \mathbb{R}^{L \times d_e}e:=xERL×de. The authors use low-dimensional embeddings with de=16d_e = 16de=16 to reduce computational cost compared to high-dimensional one-hot injections.

The forward process qqq applies Gaussian noising to the embedding e\mathbf{e}e:

q(ztx)=N(αte,σt2I),t[0,1],q ( \mathbf { z } _ { t } \mid \mathbf { x } ) = \mathcal { N } ( \alpha _ { t } \mathbf { e } , \sigma _ { t } ^ { 2 } \mathbf { I } ) , \quad t \in [ 0 , 1 ] ,q(ztx)=N(αte,σt2I),t[0,1],

where αt\alpha_{t}αt and σt\sigma_{t}σt are smooth scalar functions satisfying the variance-preserving constraint αt2+σt2=1\alpha_{t}^{2} + \sigma_{t}^{2} = 1αt2+σt2=1. The reverse process is parameterized by a time-conditioned denoising model xθ\mathbf{x}_{\theta}xθ that outputs a categorical distribution over the vocabulary. The model predicts the clean embedding eθ(zt,t):=xθ(zt,t)E\mathbf{e}_{\theta}(\mathbf{z}_{t}, t) := \mathbf{x}_{\theta}(\mathbf{z}_{t}, t)\mathbf{E}eθ(zt,t):=xθ(zt,t)E.

Training Procedure and Loss

Training minimizes the Negative Evidence Lower Bound (NELBO), which comprises three terms: prior loss, reconstruction loss, and diffusion loss. The prior loss regularizes the latent distribution at t=1t=1t=1. The reconstruction loss focuses on the clean data at t=0t=0t=0, while the diffusion loss optimizes the denoising trajectory across intermediate timesteps.

During training, the batch is adaptively split into a reconstruction sub-batch and a diffusion sub-batch. The reconstruction sub-batch samples time t=0t=0t=0 exactly, while the diffusion sub-batch samples ttt from a low-discrepancy distribution over [0,1][0, 1][0,1]. The prior loss utilizes the entire batch. Additionally, self-conditioning is employed where, for a fraction of the batch, an initial gradient-free forward pass estimates the clean data to condition the subsequent prediction.

The noise schedule is learnable, parameterized as γ(t)=γ0+(γ1γ0)γ~(t)\gamma(t) = \gamma_0 + (\gamma_1 - \gamma_0)\tilde{\gamma}(t)γ(t)=γ0+(γ1γ0)γ~(t). The endpoints γ0\gamma_0γ0 and γ1\gamma_1γ1 minimize the diffusion loss directly, while the interior shape γ~(t)\tilde{\gamma}(t)γ~(t) is updated to minimize the variance of the loss estimator. This learning process ensures that the per-timestep diffusion loss remains constant, effectively distributing denoising difficulty uniformly across time.

The overall training step involves computing the NELBO terms, backpropagating through the loss, and updating the optimizer. The schedule parameters and the denoiser weights are updated jointly to maximize the likelihood of the data under the model.

Experiment

The evaluation utilizes a unified scaling benchmark on SlimPajama and generation tests on OpenWebText and LM1B to compare RePlaid against discrete and continuous diffusion language models. IsoFLOP analysis reveals that RePlaid scales competitively with autoregressive baselines while demonstrating superior parameter efficiency and outperforming MDLM in over-trained regimes. Furthermore, likelihood and sampling evaluations confirm that RePlaid achieves the best perplexity bounds among considered models and generates high-quality text comparable to discrete counterparts, highlighting the benefits of optimizing a true variational bound.

The experiment assesses the stability of the ODE-based likelihood estimator by varying solver settings and divergence estimation parameters. Results show that the mean Perplexity remains consistent across different configurations, indicating that the baseline solver is sufficiently converged and the estimator is unbiased. Increasing computational complexity, such as using more Hutchinson samples, does not yield significant performance gains. Solver variations, including higher step counts and adaptive methods, result in negligible changes to the likelihood estimate. The divergence estimator maintains consistent performance whether using Rademacher or Gaussian distributions. Raising the Hutchinson sample count increases computational cost substantially while having little effect on the final Perplexity score.

The authors investigate the impact of the chain-rule term on ODE-based likelihood estimation for RePlaid and LangFlow. They find that omitting this term significantly deflates PPL scores, creating a substantial bias, whereas including it ensures the estimate serves as a valid upper bound on negative log-likelihood. RePlaid consistently demonstrates superior likelihood performance compared to LangFlow under the corrected protocol. Excluding the chain-rule term creates a substantial downward bias in PPL estimation for both models. RePlaid achieves lower perplexity than LangFlow when the chain-rule correction is properly applied. The corrected estimation method produces results consistent with VDM NELBO baselines, unlike the uncorrected version which yields implausible scores.

The the the table compares test perplexity across autoregressive, discrete diffusion, and continuous diffusion language models on LM1B and OpenWebText datasets. Results show that RePlaid with self-conditioning achieves the best performance among diffusion models, outperforming strong discrete baselines like MDLM and Duo on OpenWebText. Even without self-conditioning, RePlaid demonstrates superior performance compared to other continuous diffusion methods and the Duo baseline. RePlaid with self-conditioning achieves the lowest perplexity among diffusion models on OpenWebText. On LM1B, RePlaid with self-conditioning outperforms Duo but trails MDLM. RePlaid without self-conditioning outperforms Duo and LangFlow on OpenWebText.

The the the table presents an ablation study evaluating the contribution of specific architectural components to the RePlaid model's perplexity performance. The full configuration with self-conditioning achieves the optimal results, whereas removing learnable embeddings leads to the most severe degradation in model quality. Other components, including the learnable noise schedule and output prior, also demonstrate positive contributions to the final performance metrics. The complete RePlaid model with self-conditioning achieves the lowest perplexity among all tested configurations. Removing learnable embeddings results in the largest performance drop, significantly worsening the perplexity score. The learnable noise schedule and self-conditioning components provide substantial gains over their respective ablated versions.

Experiments evaluate the stability of an ODE-based likelihood estimator, demonstrating that baseline solver settings yield consistent results without requiring increased computational complexity. Further analysis validates the necessity of a chain-rule correction to prevent biased perplexity scores, confirming that RePlaid outperforms LangFlow when the estimation is properly calibrated. Comparative and ablation studies reveal that RePlaid with self-conditioning achieves superior performance against various diffusion and discrete baselines, with learnable embeddings identified as the most critical architectural component.


Créer de l'IA avec l'IA

De l'idée au lancement — accélérez votre développement IA avec le co-codage IA gratuit, un environnement prêt à l'emploi et le meilleur prix pour les GPU.

Codage assisté par IA
GPU prêts à l’emploi
Tarifs les plus avantageux

HyperAI Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp