QwenLong-L1.5 : Recette de post-entraînement pour le raisonnement à longue portée et la gestion de la mémoire
QwenLong-L1.5 : Recette de post-entraînement pour le raisonnement à longue portée et la gestion de la mémoire
Abstract
Nous introduisons QwenLong-L1.5, un modèle qui atteint des capacités de raisonnement à longue portée supérieures grâce à des innovations systématiques en post-entraînement. Les principaux progrès techniques de QwenLong-L1.5 sont les suivants : (1) Pipeline de synthèse de données à longue portée : nous avons développé un cadre systématique de synthèse permettant de générer des tâches de raisonnement exigeantes, nécessitant un ancrage multi-sauts sur des preuves réparties mondialement. En décomposant les documents en faits atomiques et leurs relations fondamentales, puis en composant de manière programmatique des questions de raisonnement vérifiables, notre approche produit à grande échelle des données d'entraînement de haute qualité, dépassant largement les tâches simples de recherche pour permettre des capacités réelles de raisonnement à longue portée. (2) Apprentissage par renforcement stabilisé pour l'entraînement à longue portée : afin de surmonter l'instabilité critique observée dans l'entraînement par renforcement à longue portée, nous introduisons un échantillonnage équilibré par tâche avec une estimation d'avantage spécifique à chaque tâche, afin de réduire le biais de récompense, et proposons une optimisation de politique à contrôle d'entropie adaptative (AEPO), qui régule dynamiquement le compromis exploration-exploitation. (3) Architecture augmentée par mémoire pour des contextes ultra-longue : reconnaissant que même des fenêtres de contexte étendues ne peuvent pas accueillir des séquences arbitrairement longues, nous avons conçu un cadre de gestion de mémoire basé sur un entraînement par renforcement à fusion multi-étapes, permettant une intégration transparente du raisonnement en une seule passe et du traitement itératif basé sur la mémoire pour des tâches dépassant 4 millions de tokens. Basé sur Qwen3-30B-A3B-Thinking, QwenLong-L1.5 atteint des performances comparables à celles de GPT-5 et Gemini-2.5-Pro sur des benchmarks de raisonnement à longue portée, dépassant son modèle de base de 9,90 points en moyenne. Sur des tâches ultra-longues (1M à 4M tokens), le cadre de mémoire-agent de QwenLong-L1.5 obtient un gain de 9,48 points par rapport à la version baseline. En outre, la capacité de raisonnement à longue portée acquise se traduit par une amélioration des performances dans des domaines généraux tels que le raisonnement scientifique, l’utilisation d’outils mémoire et les dialogues étendus.
One-sentence Summary
The authors from Tongyi Lab and Alibaba Group propose QwenLong-L1.5, a memory-augmented model that achieves GPT-5- and Gemini-2.5-Pro-level long-context reasoning via a scalable data synthesis pipeline, stabilized reinforcement learning with adaptive entropy control, and iterative memory-based processing, enabling robust performance on tasks up to 4M tokens and enhancing capabilities in scientific reasoning and extended dialogue.
Key Contributions
- The paper addresses the critical gap in post-training long-context reasoning by introducing a scalable data synthesis pipeline that generates complex, multi-hop reasoning tasks through structured decomposition of documents into atomic facts and relationships, enabling training on verifiable, globally distributed evidence rather than simple retrieval.
- It proposes a stabilized reinforcement learning framework with task-balanced sampling and adaptive entropy-controlled policy optimization (AEPO), which mitigates reward bias and enables stable training on progressively longer sequences, overcoming key instabilities in long-context RL.
- A memory-augmented architecture with multi-stage fusion RL training allows QwenLong-L1.5 to handle tasks exceeding 4 million tokens by combining single-pass reasoning within a 256K context window with iterative memory-based processing, achieving a 9.48-point gain over baselines on ultra-long tasks and improving performance across general domains like scientific reasoning and extended dialogue.
Introduction
Long-context reasoning is essential for advanced LLM applications such as single-pass inference and multi-turn agent systems, enabling models to perform complex, multi-hop reasoning over extensive information. However, prior work has largely focused on pre- and mid-training techniques or architectural changes, leaving a critical gap in mature, end-to-end post-training solutions for long-context tasks. Existing methods often rely on simplistic data like "needle-in-a-haystack" retrieval or single-hop RAG, lacking the complexity needed for robust reasoning over globally distributed evidence. The authors introduce QwenLong-L1.5, a comprehensive post-training recipe that addresses these limitations through three key contributions: a principled, scalable data synthesis pipeline that generates complex, multi-hop reasoning tasks from structured facts; a novel reinforcement learning framework with task-balanced sampling and Adaptive Entropy-Controlled Policy Optimization (AEPO) to stabilize training on long sequences; and a memory management architecture that combines single-pass reasoning with iterative memory updates to extend reasoning beyond the model’s context window. This integrated approach enables significant performance gains on long-context benchmarks and generalizes to diverse domains like math, science, and dialogue.
Dataset
- The dataset for QwenLong-L1.5 is built from a multi-source corpus of long documents, including code repositories, academic literature, professional documents, general knowledge content, and simulated multi-turn dialogues, totaling 82,175 high-quality documents and approximately 9.2 billion tokens after filtering.
- From this corpus, the authors synthesized 42.7k initial long-context question-answer pairs using a large-scale LLM-based pipeline, focusing on complex reasoning tasks such as numerical calculation, multi-hop reasoning, temporal analysis, viewpoint analysis, long in-context learning, causal analysis, and hypothetical scenarios.
- The synthesis process involved three key steps: (1) generating challenging QA pairs by leveraging structured data and a multi-agent self-evolution framework, (2) extending context length by inserting irrelevant documents to increase difficulty, and (3) applying rigorous validation checks—knowledge grounding and contextual robustness—to ensure answers depend solely on the provided context and remain stable under perturbations.
- After filtering, deduplication, and test set decontamination, the final training set contains 14.1k high-quality samples, a significant increase in scale and diversity compared to QwenLong-L1.
- The dataset emphasizes long-context complexity, with a substantial portion of samples exceeding 64K tokens, enabling training on highly demanding reasoning tasks.
- The training data is used in a mixture ratio tailored for reinforcement learning, with samples drawn across multiple question types to ensure balanced exposure to different reasoning modalities.
- Contexts are strategically expanded with irrelevant content during synthesis to simulate real-world information retrieval challenges, and metadata such as question type, domain, and reasoning complexity are explicitly constructed to support training and evaluation.
Method
The authors leverage a multi-stage training paradigm to systematically enhance the long-context reasoning capabilities of QwenLong-L1.5, building upon the Qwen3-30B-A3B-Thinking base model. The overall training process is structured to progressively scale the model's ability to handle increasingly complex and lengthy inputs, culminating in a unified architecture capable of both single-pass full-context reasoning and iterative memory-based processing for ultra-long contexts. The framework begins with a series of three full-context reinforcement learning (RL) stages, each designed to extend the model's input and output length capabilities. The first stage operates with a maximum input of 32K tokens and a maximum output of 12K tokens, followed by a second stage with 60K input and 20K output, and a third stage with 120K input and 50K output. This progressive length extension is designed to avoid training instability that would arise from an abrupt transition to long-context patterns. During the transition between these stages, a difficulty-aware retrospective sampling strategy is employed to filter training data based on the input-output length settings of the subsequent stage, ensuring a smooth progression in task complexity.

Following the completion of the third full-context RL stage, the model undergoes a specialized training phase focused on memory management. This is achieved by continuing RL training on the QwenLong-L1.5-RL-Stage3 model to create an expert model specifically for memory processing. To integrate this capability without compromising the stability of the full-context reasoning skills, the authors employ a model merging technique. The expert memory model is merged with the QwenLong-L1.5-RL-Stage3 model using the SCE (Spectral Clustering-based Expert) algorithm. This merging process results in a single, cohesive model that possesses both long-context reasoning and memory management capabilities. The final step in the training pipeline is a fourth full-context RL stage, where the merged model is trained again to refine its overall performance and ensure the seamless integration of its dual capabilities. This multi-stage fusion paradigm allows the model to scale to ultra-long contexts, with the memory management framework enabling it to process sequences exceeding 4 million tokens by breaking down the input into manageable chunks and iteratively updating a compact memory representation.

The core of the long-context reasoning capability is built upon a robust reinforcement learning framework. The authors formulate the task as a policy optimization problem, where the goal is to maximize a reward function that evaluates the quality of the generated response. To address the computational intractability of standard PPO methods on long inputs due to quadratic attention complexity, they employ Group Relative Policy Optimization (GRPO). This method eliminates the need for a separate value network by estimating the advantage through group-wise reward z-score normalization, which is computed by normalizing the sequence-level rewards of a group of candidate responses. The training objective is further refined by setting the KL regularization coefficient to zero and operating in a strictly on-policy setting with a single gradient update per batch, which simplifies the objective and enhances stability. To ensure stable and efficient training, the authors implement several key innovations. Task-balanced sampling is used to prevent distributional drift by ensuring an equal number of samples from each of the five primary task types (multiple choice, doc multi-hop reasoning, general reading comprehension, dialogue memory, and corpus-level numerical calculation) are drawn in each training batch. This is complemented by task-specific advantage estimation, which computes the reward standard deviation within each task type, providing a more accurate and isolated advantage signal that mitigates bias from noisy samples and accommodates the distinct reward distributions across different tasks.

To address the challenge of training instability caused by the high similarity between correct and incorrect reasoning paths in long-context tasks, the authors introduce a novel negative gradient clipping strategy. This approach clips a portion of the negative gradients generated by incorrect responses, which are often high-entropy tokens that produce large gradients and increase optimization variance. The clipping is guided by the policy's entropy, with high-entropy tokens or sequences being identified as candidates for gradient reduction. This helps to stabilize the training process by preventing excessive penalization of exploratory behavior, which is crucial for the model to correct erroneous paths. Building upon this, the authors propose the Adaptive Entropy-Controlled Policy Optimization (AEPO) algorithm. AEPO dynamically masks rollout sequences with negative advantages based on the current batch-level entropy. If the entropy exceeds a predefined upper bound, all negative samples are masked, effectively performing an advantage-weighted online rejection sampling to reduce entropy. Conversely, if the entropy drops below a lower bound, negative gradients are reintroduced to prevent entropy collapse and maintain exploration. This dynamic control mechanism provides a stable and effective way to balance exploration and exploitation, enabling the model to scale RL training to a larger number of steps without degradation.
Experiment
- Conducted multi-stage reinforcement learning post-training with synthetic data and AEPO algorithm, validating improved long-context reasoning; ablation studies show +3.27 average score gain over baseline with GRPO, and +7.47 over Qwen3-30B-A3B-Thinking-2507 on Qwen3-4B-Thinking-2507.
- Achieved state-of-the-art performance on MRCR (82.99) and strong results on CorpusQA (81.25), outperforming flagship models like GPT-5 and Gemini-2.5-Pro on key long-context benchmarks.
- On LongBench-V2, Frames, and DocMath, QwenLong-L1.5-30B-A3B achieved average scores of 55.27, 74.76, and 66.26 respectively, surpassing baseline by +6.16, +4.49, and +4.00 points.
- Demonstrated significant generalization: +15.60 gain on LongMemEval (dialogue memory), +5.80 on Memory-KV (agentic memory), and +3.65 on AIME25, indicating transferable information integration skills.
- Achieved robust ultra-long context performance: 22.53 on MRCR (512K~1M) and 14.29 on CorpusQA (4M), outperforming full-context models and agent-based methods at extreme scales.
- Multi-stage training progression shows consistent improvement, with full-context RL Stage-1 delivering the largest initial gain, and memory-RL followed by model merging enabling balanced full-context and memory-agent capabilities.
The authors use a multi-stage reinforcement learning framework to enhance long-context reasoning in Qwen3-30B-A3B-Thinking, resulting in significant performance improvements across multiple benchmarks. Results show that the final model, QwenLong-L1.5-30B-A3B, achieves an average score of 71.82, outperforming the baseline by 9.90 points and demonstrating strong gains on tasks requiring complex information integration, such as MRCR and CorpusQA.

The authors use the AEPO algorithm to improve long-context reasoning in Qwen3-4B-Thinking-2507, with ablation experiments showing that adding AEPO increases the average score from 52.79 to 59.36 across benchmarks. The results indicate that AEPO enhances performance across all evaluated tasks, particularly on MRCR and CorpusQA, where the average score rises by 7.03 and 15.31 points, respectively.

The authors compare the performance of Qwen3-30B-A3B-Thinking-2507 and QwenLong-L1.5-30B-A3B on general, agentic memory, and dialogue memory benchmarks. Results show that QwenLong-L1.5-30B-A3B achieves higher scores across most tasks, with notable improvements on AIME25 (+3.65), GPQA-Diamond (+0.90), Memory-KV (+5.80), and LongMemEval (+15.60), indicating that long-context training enhances general reasoning and memory capabilities without significant degradation on other domains.

The authors compare QwenLong-L1 and QwenLong-L1.5, showing that the latter uses a significantly larger and more diverse training dataset, including synthetic data and additional domains such as code repositories and dialogue data. This expansion results in a more than doubling of the maximum input length, from 59,563 to 119,932 tokens, while also increasing the average input length, indicating a stronger focus on handling longer and more complex contexts.

The authors use ablation experiments to evaluate the impact of different optimization strategies on the AEPO algorithm. Results show that combining task-balanced sampling with batch-standardization and task-batch-standardization leads to the highest average score of 58.62, demonstrating that these techniques significantly improve performance across multiple benchmarks compared to the baseline.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.