Qwen3-Next Unveils Hybrid Attention and High-Sparsity MoE for Faster, Smarter Inference
Qwen3-Next introduces a significant architectural advancement through its hybrid attention mechanism and high-sparsity Mixture of Experts (MoE) design, marking a major step forward in balancing efficiency and performance for large language models. The model has been officially integrated into the Hugging Face Transformers library, solidifying its place in the open ecosystem. At the core of Qwen3-Next’s innovation is the hybrid attention framework, which combines a 3-layer Gated DeltaNet module with a single-layer Gate SoftmaxAttention module. This 3:1 ratio was determined through extensive experimentation to achieve optimal trade-offs between computational efficiency and modeling capability. The Gated DeltaNet reduces attention complexity from O(n²) to O(n), enabling faster inference on long sequences while maintaining strong contextual understanding. Gated DeltaNet operates by processing input through a linear projection to generate parameters a and b. From these, two key components are derived: beta, computed via sigmoid on b, and g, derived from a trainable weight A_log and bias dt_bias using a softplus activation. These values guide a recurrent attention mechanism that mimics RNN dynamics but operates across layers rather than time steps. The process involves updating a key-value memory using the current key and a delta derived from the difference between value and memory, modulated by beta. This delta update dynamically adjusts the recurrent state, allowing the model to capture long-range dependencies efficiently. The final attention output is computed by projecting the updated state with the query, resulting in a linear-time operation. A key enhancement in this architecture is the inclusion of a Z projection during QKV computation, followed by a normalization and gating step using RMSNorm with a SiLU activation. This RMSNormGated module, designed with a zero-centered initialization, helps stabilize training by preventing unbounded growth of norm parameters. The weight starts at zero, making the layer behave like pure normalization early in training, which reduces gradient instability and improves convergence. The MoE component in Qwen3-Next achieves remarkable sparsity—only 3.7% of parameters are activated per token during inference—making it one of the most efficient MoE designs to date. A novel dual-track structure is introduced: one path routes tokens to sparse, specialized experts via a router, while a shared expert handles general-purpose processing. This design mirrors a medical consultation system, where the shared expert acts as a general practitioner for foundational language patterns, and the sparse experts serve as specialists for niche knowledge. This dual-path mechanism enhances robustness and ensures consistent performance even under high sparsity. Additionally, Qwen3-Next adopts Multi-Token Prediction (MTP), which further accelerates inference by predicting multiple tokens in parallel, reducing latency without sacrificing accuracy. Together, these innovations—hybrid attention, ultra-low sparsity MoE, and MTP—position Qwen3-Next as a leading model in the next generation of efficient, scalable AI systems. The integration of Zero-Centered RMSNorm further strengthens training stability, ensuring reliable performance across deep architectures. The architectural choices reflect a broader industry trend: moving beyond pure attention models toward hybrid, sparse, and optimized designs. From Google’s Infini-Attention to MiniMax’s Lightning Attention, the shift toward efficient, scalable attention mechanisms is no longer experimental—it’s becoming the standard. Qwen3-Next exemplifies this evolution, delivering high performance with significantly reduced computational cost.
