a month ago

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun

View Paper Details View Code

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive
Token-Level Computation

Abstract

Scaling language models unlocks impressive capabilities, but the accompanyingcomputational and memory demands make both training and deployment expensive.Existing efficiency efforts typically target either parameter sharing oradaptive computation, leaving open the question of how to attain bothsimultaneously. We introduce Mixture-of-Recursions (MoR), a unified frameworkthat combines the two axes of efficiency inside a single Recursive Transformer.MoR reuses a shared stack of layers across recursion steps to achieve parameterefficiency, while lightweight routers enable adaptive token-level thinking bydynamically assigning different recursion depths to individual tokens. Thisallows MoR to focus quadratic attention computation only among tokens stillactive at a given recursion depth, further improving memory access efficiencyby selectively caching only their key-value pairs. Beyond these coremechanisms, we also propose a KV sharing variant that reuses KV pairs from thefirst recursion, specifically designed to decrease prefill latency and memoryfootprint. Across model scales ranging from 135M to 1.7B parameters, MoR formsa new Pareto frontier: at equal training FLOPs and smaller model sizes, itsignificantly lowers validation perplexity and improves few-shot accuracy,while delivering higher throughput compared with vanilla and existing recursivebaselines. These gains demonstrate that MoR is an effective path towardslarge-model quality without incurring large-model cost.