Nemotron-Labs targets speed-of-light text with diffusion models
NVIDIA has unveiled Nemotron-Labs Diffusion, a new family of language models designed to overcome the speed limitations inherent in traditional autoregressive text generation. While autoregressive models have been the industry standard for code generation, summarization, and problem-solving, they process text token by token. This sequential approach forces the model to load weights from memory for every single token, creating a bottleneck that leaves significant GPU computing power underutilized. Additionally, once a token is generated, it cannot be revised, allowing errors to propagate throughout the output. Nemotron-Labs Diffusion introduces a paradigm shift by utilizing diffusion language models. Instead of generating tokens one by one, these models generate multiple tokens in parallel and then iteratively refine them. This approach not only improves runtime performance by leveraging modern GPU architectures more efficiently but also enables the model to revise previous tokens, making it ideal for editing existing text and filling in middle sections of content. The model family includes text versions at 3 billion, 8 billion, and 14 billion parameters, all available under the commercially friendly NVIDIA Nemotron Open Model License. Additionally, an 8 billion parameter vision-language model is available under the NVIDIA Source Code License. NVIDIA is also releasing the training code through the Megatron Bridge framework to support broader research flexibility. A key innovation of this release is the unification of autoregressive and diffusion capabilities within a single model. Developers can choose from three distinct generation modes without altering their application code. The autoregressive mode operates like standard left-to-right language models, ensuring compatibility with existing workflows. The diffusion mode generates text in blocks, gradually refining tokens over multiple steps. A third self-speculation mode uses diffusion to draft multiple candidate tokens and then employs autoregressive decoding to verify them, combining the speed of diffusion with the reliability of standard models. This flexibility allows for ultra-fast generation speeds across various batch sizes, including single-query workloads. Performance benchmarks demonstrate significant gains. The 8 billion parameter version of Nemotron-Labs Diffusion achieved a 1.2% improvement in average accuracy compared to Qwen3 8B. In terms of inference speed, measured in tokens per forward pass, the diffusion mode reached 2.6 times the throughput of autoregressive models. The self-speculation mode pushed this further, achieving 6 times the speed for linear speculation and 6.4 times for quadratic speculation, while maintaining comparable accuracy across evaluated tasks. The training methodology builds on recent advancements that convert pretrained autoregressive models into diffusion models by altering attention mechanisms and performing continued pretraining. Nemotron-Labs Diffusion was trained on 1.3 trillion tokens from NVIDIA Nemotron Pretraining datasets and underwent supervised fine-tuning using 45 billion tokens from post-training datasets. This joint objective allows the model to retain its original autoregressive capabilities while gaining the ability to draft and refine in parallel. Deployment is being integrated into the SGLang inference framework, allowing developers to serve the same model checkpoint in three different ways via a simple configuration change. With Nemotron-Labs Diffusion, NVIDIA offers a practical solution for developers seeking to accelerate text generation without abandoning familiar workflows. By enabling draft-refine-verify cycles and offering multiple inference modes, the new model family provides a robust tool for building latency-sensitive applications and improving overall generation quality.
