HyperAI
Back to Headlines

Google's New MoR Architecture: Doubling Transformer Inference Speed with Half the Memory

16 days ago

A New Challenger for Transformer? Google Unveils a Model Architecture that Doubles Inference Speed with Half the Memory Usage Google DeepMind and researchers from the Korea Advanced Institute of Science and Technology (KAIST) have recently introduced a novel language model architecture called "Mixture-of-Recursions" (MoR). This architecture reportedly doubles inference speed, reduces training compute requirements, and cuts KV cache memory usage by about 50%, while maintaining model performance. Since its introduction in 2017, the Transformer architecture has been the cornerstone of large language models, underpinning almost all advanced AI systems. However, as models grow in size, the computational and memory demands of the Transformer architecture have become increasingly prohibitive, making training and deployment extremely costly. Previous efficiency improvements have often focused on a single aspect, such as reducing model size through parameter sharing or optimizing compute allocation with adaptive methods. MoR, however, stands out by addressing multiple efficiency goals within a unified framework. The core innovation of MoR lies in combining recursive computation with dynamic routing mechanisms, allowing different tokens to be processed with varying depths based on their complexity. In standard Transformer models, each token passes through the same number of computational layers. MoR, on the other hand, adapts the processing depth dynamically. It uses shared parameter blocks to enhance parameter efficiency and a lightweight "router" to determine how many recursive steps each token should undergo. The research team tested various routing strategies, such as expert-choice and token-choice, to balance computational load and avoid logical issues in information processing. When it comes to parameter sharing, the "Middle-Cycle" strategy proved most effective. This approach keeps the first and last layers with unique parameters while sharing weights among intermediate layers, achieving a good balance between parameter efficiency and model expressiveness. Memory management is another critical improvement in MoR. Despite parameter sharing, traditional recursive models still generate separate KV caches for each layer, leading to high memory usage. MoR introduces two new strategies to address this: Recursive Caching: This method stores KV data only for tokens routed to specific recursive steps, restricting attention calculations to local data. This reduces both memory usage and data read/write operations. Recursive Sharing: This further optimizes memory by caching KV data only in the first recursive block and reusing it for subsequent steps. This maximizes memory savings. The researchers conducted extensive tests across models with parameter scales ranging from 135 million to 1.7 billion. The results were impressive: MoR models, despite having nearly half the parameters of baseline Transformer models, achieved a mean accuracy of 43.1% in few-shot learning tasks, surpassing the 42.3% accuracy of the baseline. Moreover, the higher computational efficiency of MoR allowed it to process more training data within the same compute budget, enhancing overall model performance. In fixed training data experiments, an MoR configuration outperformed the baseline using 25% less training compute, with a 19% reduction in training time and a 25% decrease in peak memory usage. MoR's inference performance is particularly noteworthy. By employing a continuous depth batching technique, MoR groups tokens from different computational stages into the same batch for processing, leveraging shared parameter blocks. This, combined with early exit mechanisms, significantly boosts throughput. For example, in a 360 million parameter model, the MoR-4 configuration achieved an inference speedup of up to 2.06 times in specific settings. The researchers observed that MoR models allocate more recursive steps to semantically rich tokens—such as “People” or “defensively confident”—while simpler tokens, like “and,” require fewer steps. This suggests that the model efficiently allocates more computational resources to processing important information. MoR builds on earlier research by Google DeepMind, including the Mixture-of-Depths (MoD) technique, which explored dynamic compute resource allocation. Additionally, recursive Transforms provided a foundation for parameter sharing in MoR. These developments have collectively shifted the focus from single-dimensional optimizations to a more holistic approach, integrating parameter, compute, and memory optimization. While it is premature to declare MoR a complete replacement for the Transformer architecture, its potential to significantly enhance both performance and efficiency makes it a compelling direction for future language model design. This could substantially reduce the deployment and usage costs of large language models, offering practical benefits for a wide range of applications. References: 1. https://arxiv.org/abs/2507.10524 Operated and Edited by Chen Long

Related Links