Command Palette
Search for a command to run...
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Abstract
In this technical report, we present the Ring-linear model series,specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, whileRing-flash-linear-2.0 contains 104B parameters and 6.1B activations. Bothmodels adopt a hybrid architecture that effectively integrates linear attentionand softmax attention, significantly reducing I/O and computational overhead inlong-context inference scenarios. Compared to a 32 billion parameter densemodel, this series reduces inference cost to 1/10, and compared to the originalRing series, the cost is also reduced by over 50%. Furthermore, throughsystematic exploration of the ratio between different attention mechanisms inthe hybrid architecture, we have identified the currently optimal modelstructure. Additionally, by leveraging our self-developed high-performance FP8operator library-linghe, overall training efficiency has been improved by 50%.Benefiting from the high alignment between the training and inference engineoperators, the models can undergo long-term, stable, and highly efficientoptimization during the reinforcement learning phase, consistently maintainingSOTA performance across multiple challenging complex reasoning benchmarks.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.