Seed Diffusion: 2000+ tokens/s, Outperforms Gemini Diffusion
Researchers from Tsinghua University's Institute for AI Research (AIR), in collaboration with ByteDance’s Seed team and SIA-Lab, have unveiled a groundbreaking advancement in large language models: Seed Diffusion Preview. This new diffusion-based model achieves a remarkable inference speed of over 2,146 tokens per second—surpassing Google’s Gemini Diffusion and demonstrating a 5.4x speed improvement over comparable autoregressive models—while maintaining competitive performance. This leap in efficiency and capability could redefine the future of language model architecture, potentially establishing diffusion models as the dominant paradigm in next-generation AI systems. In recent years, multimodal large models (MLLMs) have made astonishing progress—from image captioning to video understanding. But do they truly “see” and “reason”? Can they navigate complex, multi-step tasks in visual environments like a human? To investigate this, AIR’s Executive Director Professor Yang Liu, along with teams from Tsinghua University’s Department of Computer Science and Fudan University, introduced EscapeCraft: a 3D escape room environment designed to test whether large models can perform genuine visual reasoning and decision-making. The results were striking. Even advanced models like GPT-4o frequently failed: they saw doors but walked in circles; picked up keys but forgot how to use them; attempted to “grab” sofas, reasoning they might contain hidden compartments. These weren’t isolated errors—they revealed a systemic flaw: seeing is not understanding. Only a small fraction of subtasks were completed through real reasoning; most were accidental successes. To address these limitations, the team developed Seed Diffusion Preview, a diffusion-based language model optimized for code generation. The model achieves 2,146 tokens per second in inference speed—far exceeding traditional autoregressive models—while matching or surpassing them in accuracy on major benchmarks. Crucially, it excels in tasks requiring global planning and structural awareness, such as code editing (e.g., CanitEdit), where its ability to generate coherent, well-structured outputs gives it an edge over autoregressive approaches. The success of Seed Diffusion Preview stems from four key innovations: Two-Stage Curriculum Learning To overcome the limitations of standard masked diffusion models—which focus only on masked positions and lack global consistency—the team implemented a two-stage training strategy. The first stage focuses on local reconstruction, while the second refines the entire sequence with global coherence, enabling more accurate and context-aware generation. Structured Prior Incorporation Natural language and code exhibit strong causal dependencies (e.g., variables must be declared before use). Standard diffusion models, which generate tokens in arbitrary order, often ignore these structures. To fix this, the team introduced constraint-based sequential training. Using a pre-trained model, they generated and filtered high-quality generation trajectories, then distilled them to teach the diffusion model proper dependency awareness. Same-Strategy Learning Paradigm Despite the theoretical promise of parallel decoding in diffusion models, practical implementation has been hindered by high computational cost per step and the risk of quality degradation when reducing steps. The team proposed a same-strategy learning approach: training the model to minimize the number of generation steps (|τ|) while using a verifier (V) to ensure output quality. To stabilize training, they introduced a surrogate loss based on edit distance between steps—encouraging the model to converge faster and more efficiently. This process implicitly “prunes” low-quality or inefficient paths, similar to mode filtering in non-autoregressive models. System-Level Engineering Optimization To balance speed and latency, the team designed a block-level parallel sampling scheme that preserves causal ordering between blocks. They employed KV-caching to reuse previously generated block information as context for subsequent blocks. Combined with a custom infrastructure optimized for diffusion sampling, this system enables flexible, high-speed inference. Experiments show that block size significantly impacts performance, with optimal configurations achieving the best trade-off between speed and quality. Extensive testing confirms that Seed Diffusion Preview not only accelerates inference by 5.4x but also matches or exceeds autoregressive models in code generation accuracy. It performs particularly well in tasks requiring global planning and structural integrity—highlighting the inherent advantages of the diffusion framework for complex reasoning. While speed is the most immediate benefit, the team believes the deeper value lies in the model’s ability to support more sophisticated, structured reasoning at scale. Seed Diffusion Preview is not just a faster model—it’s a new paradigm with the potential to transform how we build and deploy large language models, especially for tasks requiring long-range coherence, planning, and logical consistency.