HyperAIHyperAI

Command Palette

Search for a command to run...

NeurIPS 2025 Best Paper Reveals How Gated Attention Boosts LLM Stability, Scaling, and Context Length

The NeurIPS 2025 Best Paper Award was given to the Qwen team for their groundbreaking work, "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free." This research offers a systematic and practical exploration of attention gating in transformers, delivering clear, actionable insights that are already influencing the next generation of LLM development. At its core, the paper investigates how inserting a simple gating mechanism—typically a learned sigmoid-modulated weight—after the scaled dot product attention (SDPA) output can dramatically improve model behavior. The key finding is that placing the gate at the SDPA output (G1) provides the most significant benefits, including enhanced training stability, the ability to use higher learning rates, and improved scaling properties across model depth and context length. The Qwen team’s work reveals that attention gating introduces two critical advantages: non-linearity and input-dependent sparsity. By modulating the attention output with a learned gate, the model can suppress irrelevant or noisy interactions, leading to cleaner, more focused attention patterns. This is especially effective in curbing the "attention sink" phenomenon, where early tokens—particularly the first token—dominate attention across layers, distorting representation learning and causing training instability. Gating mitigates this by reducing extreme activations, allowing the model to maintain balanced attention even in deep networks. The paper also demonstrates that gated models can be extended to much longer context lengths without retraining. By first training on a 4096-token sequence, then expanding the RoPE base from 10k to 1M, and finally applying YaRN (Yet Another RoPE eNhancement) to scale to 128k, the model maintains performance and stability. The gated variant outperforms the baseline in the long-context regime, suggesting it is less dependent on the attention sink for stable operation and more robust to positional encoding changes. The study evaluates multiple configurations, including gate placement (on Q, K, V, or SDPA output), activation functions (sigmoid vs. SiLU), and whether gates are shared across heads or head-specific. The results are clear: head-specific, elementwise, sigmoid-gated attention at the SDPA output (G1) delivers the best performance. Multiplicative gating is superior to additive, and shared gates degrade model specialization. Importantly, the paper shows that the gating module adds less than 2% latency, making it highly efficient. The authors also emphasize the value of open research, noting that such findings—especially those derived from industrial-scale experiments—are rare in today’s competitive AI landscape. For practitioners, the takeaway is straightforward: implementing a simple SDPA-output gate with a sigmoid activation and head-specific weights can improve training stability, enable higher learning rates, reduce attention sink effects, and support longer context lengths with minimal overhead. This technique is immediately applicable to both dense and MoE-based LLMs. The paper stands as a testament to the power of fundamental architectural refinement. While the transformer has been the dominant architecture for years, the Qwen team shows that even small, well-placed modifications can yield substantial gains. Their work is not just a technical advance but a methodological one—demonstrating the value of systematic, data-driven analysis in a field often driven by scale and speculation. In a conference that saw a record 21,575 submissions and a surge in reinforcement learning, agent systems, and multimodal AI, the Qwen paper stands out as a rare example of deep, principled innovation. It reminds the community that progress in AI is not just about bigger models, but about better understanding the mechanisms that make them work.

Related Links