4 months ago

Abstract

Reasoning over long contexts is essential for large language models. Whilereinforcement learning (RL) enhances short-context reasoning by inducing "Aha"moments in chain-of-thought, the advanced thinking patterns required forlong-context reasoning remain largely unexplored, and high-difficulty RL dataare scarce. In this paper, we introduce LoongRL, a data-driven RL method foradvanced long-context reasoning. Central to LoongRL is KeyChain, a synthesisapproach that transforms short multi-hop QA into high-difficulty long-contexttasks by inserting UUID chains that hide the true question among largecollections of distracting documents. Solving these tasks requires the model totrace the correct chain step-by-step, identify the true question, retrieverelevant facts and reason over them to answer correctly. RL training onKeyChain data induces an emergent plan-retrieve-reason-recheck reasoningpattern that generalizes far beyond training length. Models trained at 16Keffectively solve 128K tasks without prohibitive full-length RL rollout costs.On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QAaccuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reachesa score of 74.2, rivaling much larger frontier models such as o3-mini (74.5)and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all128K needle-in-a-haystack stress tests, and preserves short-context reasoningcapabilities.

Source PDF