HyperAIHyperAI

Command Palette

Search for a command to run...

Console

QwenLong-L1.5:長文脈推論およびメモリ管理のためのポストトレーニングレシピ

Abstract

QwenLong-L1.5を紹介します。本モデルは、体系的な後処理学習の革新により、優れた長文脈推論能力を実現しています。QwenLong-L1.5の主な技術的革新は以下の通りです。(1)長文脈データ合成パイプライン:グローバルに分散した証拠を対象とした多段階の根拠に基づく推論を要する困難なタスクを生成する体系的な合成フレームワークを構築しました。ドキュメントを原子的な事実およびその背後にある関係に分解し、検証可能な推論問題をプログラム的に構成することで、単なる検索タスクをはるかに超える高品質な訓練データをスケール的に生成することが可能となり、本質的な長距離推論能力の実現を可能にしました。(2)長文脈学習向け安定化強化学習:長文脈における強化学習の重大な不安定性を克服するため、タスクごとに適切な報酬バイアスを軽減するためのタスクバランスサンプリングとタスク固有のアドバンテージ推定を導入しました。さらに、探索と活用のトレードオフを動的に制御する「適応的エントロピー制御型ポリシー最適化(Adaptive Entropy-Controlled Policy Optimization: AEPO)」を提案しました。(3)超長文脈向けメモリ拡張アーキテクチャ:長文脈ウィンドウの延長でも任意の長さのシーケンスを完全に扱うことは不可能であることを認識し、マルチステージ融合強化学習訓練を用いたメモリ管理フレームワークを構築しました。これにより、1回のパスによる推論と反復的なメモリベース処理をシームレスに統合し、400万トークンを超える超長文脈タスクに対応可能となりました。Qwen3-30B-A3B-Thinkingを基盤として開発されたQwenLong-L1.5は、長文脈推論ベンチマークにおいてGPT-5およびGemini-2.5-Proと同等の性能を達成し、ベースラインモデルに対して平均9.90ポイントの向上を実現しました。特に100万~400万トークンの超長文脈タスクにおいては、メモリアジェントフレームワークにより、アジェントベースライン比で9.48ポイントの性能向上を達成しました。さらに、習得した長文脈推論能力は、科学的推論、メモリツールの活用、拡張対話などの一般領域においても、性能の向上に寄与しています。

One-sentence Summary

The authors from Tongyi Lab and Alibaba Group propose QwenLong-L1.5, a memory-augmented model that achieves GPT-5- and Gemini-2.5-Pro-level long-context reasoning via a scalable data synthesis pipeline, stabilized reinforcement learning with adaptive entropy control, and iterative memory-based processing, enabling robust performance on tasks up to 4M tokens and enhancing capabilities in scientific reasoning and extended dialogue.

Key Contributions

  • The paper addresses the critical gap in post-training long-context reasoning by introducing a scalable data synthesis pipeline that generates complex, multi-hop reasoning tasks through structured decomposition of documents into atomic facts and relationships, enabling training on verifiable, globally distributed evidence rather than simple retrieval.
  • It proposes a stabilized reinforcement learning framework with task-balanced sampling and adaptive entropy-controlled policy optimization (AEPO), which mitigates reward bias and enables stable training on progressively longer sequences, overcoming key instabilities in long-context RL.
  • A memory-augmented architecture with multi-stage fusion RL training allows QwenLong-L1.5 to handle tasks exceeding 4 million tokens by combining single-pass reasoning within a 256K context window with iterative memory-based processing, achieving a 9.48-point gain over baselines on ultra-long tasks and improving performance across general domains like scientific reasoning and extended dialogue.

Introduction

Long-context reasoning is essential for advanced LLM applications such as single-pass inference and multi-turn agent systems, enabling models to perform complex, multi-hop reasoning over extensive information. However, prior work has largely focused on pre- and mid-training techniques or architectural changes, leaving a critical gap in mature, end-to-end post-training solutions for long-context tasks. Existing methods often rely on simplistic data like "needle-in-a-haystack" retrieval or single-hop RAG, lacking the complexity needed for robust reasoning over globally distributed evidence. The authors introduce QwenLong-L1.5, a comprehensive post-training recipe that addresses these limitations through three key contributions: a principled, scalable data synthesis pipeline that generates complex, multi-hop reasoning tasks from structured facts; a novel reinforcement learning framework with task-balanced sampling and Adaptive Entropy-Controlled Policy Optimization (AEPO) to stabilize training on long sequences; and a memory management architecture that combines single-pass reasoning with iterative memory updates to extend reasoning beyond the model’s context window. This integrated approach enables significant performance gains on long-context benchmarks and generalizes to diverse domains like math, science, and dialogue.

Dataset

  • The dataset for QwenLong-L1.5 is built from a multi-source corpus of long documents, including code repositories, academic literature, professional documents, general knowledge content, and simulated multi-turn dialogues, totaling 82,175 high-quality documents and approximately 9.2 billion tokens after filtering.
  • From this corpus, the authors synthesized 42.7k initial long-context question-answer pairs using a large-scale LLM-based pipeline, focusing on complex reasoning tasks such as numerical calculation, multi-hop reasoning, temporal analysis, viewpoint analysis, long in-context learning, causal analysis, and hypothetical scenarios.
  • The synthesis process involved three key steps: (1) generating challenging QA pairs by leveraging structured data and a multi-agent self-evolution framework, (2) extending context length by inserting irrelevant documents to increase difficulty, and (3) applying rigorous validation checks—knowledge grounding and contextual robustness—to ensure answers depend solely on the provided context and remain stable under perturbations.
  • After filtering, deduplication, and test set decontamination, the final training set contains 14.1k high-quality samples, a significant increase in scale and diversity compared to QwenLong-L1.
  • The dataset emphasizes long-context complexity, with a substantial portion of samples exceeding 64K tokens, enabling training on highly demanding reasoning tasks.
  • The training data is used in a mixture ratio tailored for reinforcement learning, with samples drawn across multiple question types to ensure balanced exposure to different reasoning modalities.
  • Contexts are strategically expanded with irrelevant content during synthesis to simulate real-world information retrieval challenges, and metadata such as question type, domain, and reasoning complexity are explicitly constructed to support training and evaluation.

Method

The authors leverage a multi-stage training paradigm to systematically enhance the long-context reasoning capabilities of QwenLong-L1.5, building upon the Qwen3-30B-A3B-Thinking base model. The overall training process is structured to progressively scale the model's ability to handle increasingly complex and lengthy inputs, culminating in a unified architecture capable of both single-pass full-context reasoning and iterative memory-based processing for ultra-long contexts. The framework begins with a series of three full-context reinforcement learning (RL) stages, each designed to extend the model's input and output length capabilities. The first stage operates with a maximum input of 32K32K32K tokens and a maximum output of 12K12K12K tokens, followed by a second stage with 60K60K60K input and 20K20K20K output, and a third stage with 120K120K120K input and 50K50K50K output. This progressive length extension is designed to avoid training instability that would arise from an abrupt transition to long-context patterns. During the transition between these stages, a difficulty-aware retrospective sampling strategy is employed to filter training data based on the input-output length settings of the subsequent stage, ensuring a smooth progression in task complexity.

Following the completion of the third full-context RL stage, the model undergoes a specialized training phase focused on memory management. This is achieved by continuing RL training on the QwenLong-L1.5-RL-Stage3 model to create an expert model specifically for memory processing. To integrate this capability without compromising the stability of the full-context reasoning skills, the authors employ a model merging technique. The expert memory model is merged with the QwenLong-L1.5-RL-Stage3 model using the SCE (Spectral Clustering-based Expert) algorithm. This merging process results in a single, cohesive model that possesses both long-context reasoning and memory management capabilities. The final step in the training pipeline is a fourth full-context RL stage, where the merged model is trained again to refine its overall performance and ensure the seamless integration of its dual capabilities. This multi-stage fusion paradigm allows the model to scale to ultra-long contexts, with the memory management framework enabling it to process sequences exceeding 444 million tokens by breaking down the input into manageable chunks and iteratively updating a compact memory representation.

The core of the long-context reasoning capability is built upon a robust reinforcement learning framework. The authors formulate the task as a policy optimization problem, where the goal is to maximize a reward function that evaluates the quality of the generated response. To address the computational intractability of standard PPO methods on long inputs due to quadratic attention complexity, they employ Group Relative Policy Optimization (GRPO). This method eliminates the need for a separate value network by estimating the advantage through group-wise reward z-score normalization, which is computed by normalizing the sequence-level rewards of a group of candidate responses. The training objective is further refined by setting the KL regularization coefficient to zero and operating in a strictly on-policy setting with a single gradient update per batch, which simplifies the objective and enhances stability. To ensure stable and efficient training, the authors implement several key innovations. Task-balanced sampling is used to prevent distributional drift by ensuring an equal number of samples from each of the five primary task types (multiple choice, doc multi-hop reasoning, general reading comprehension, dialogue memory, and corpus-level numerical calculation) are drawn in each training batch. This is complemented by task-specific advantage estimation, which computes the reward standard deviation within each task type, providing a more accurate and isolated advantage signal that mitigates bias from noisy samples and accommodates the distinct reward distributions across different tasks.

To address the challenge of training instability caused by the high similarity between correct and incorrect reasoning paths in long-context tasks, the authors introduce a novel negative gradient clipping strategy. This approach clips a portion of the negative gradients generated by incorrect responses, which are often high-entropy tokens that produce large gradients and increase optimization variance. The clipping is guided by the policy's entropy, with high-entropy tokens or sequences being identified as candidates for gradient reduction. This helps to stabilize the training process by preventing excessive penalization of exploratory behavior, which is crucial for the model to correct erroneous paths. Building upon this, the authors propose the Adaptive Entropy-Controlled Policy Optimization (AEPO) algorithm. AEPO dynamically masks rollout sequences with negative advantages based on the current batch-level entropy. If the entropy exceeds a predefined upper bound, all negative samples are masked, effectively performing an advantage-weighted online rejection sampling to reduce entropy. Conversely, if the entropy drops below a lower bound, negative gradients are reintroduced to prevent entropy collapse and maintain exploration. This dynamic control mechanism provides a stable and effective way to balance exploration and exploitation, enabling the model to scale RL training to a larger number of steps without degradation.

Experiment

  • Conducted multi-stage reinforcement learning post-training with synthetic data and AEPO algorithm, validating improved long-context reasoning; ablation studies show +3.27 average score gain over baseline with GRPO, and +7.47 over Qwen3-30B-A3B-Thinking-2507 on Qwen3-4B-Thinking-2507.
  • Achieved state-of-the-art performance on MRCR (82.99) and strong results on CorpusQA (81.25), outperforming flagship models like GPT-5 and Gemini-2.5-Pro on key long-context benchmarks.
  • On LongBench-V2, Frames, and DocMath, QwenLong-L1.5-30B-A3B achieved average scores of 55.27, 74.76, and 66.26 respectively, surpassing baseline by +6.16, +4.49, and +4.00 points.
  • Demonstrated significant generalization: +15.60 gain on LongMemEval (dialogue memory), +5.80 on Memory-KV (agentic memory), and +3.65 on AIME25, indicating transferable information integration skills.
  • Achieved robust ultra-long context performance: 22.53 on MRCR (512K~1M) and 14.29 on CorpusQA (4M), outperforming full-context models and agent-based methods at extreme scales.
  • Multi-stage training progression shows consistent improvement, with full-context RL Stage-1 delivering the largest initial gain, and memory-RL followed by model merging enabling balanced full-context and memory-agent capabilities.

The authors use a multi-stage reinforcement learning framework to enhance long-context reasoning in Qwen3-30B-A3B-Thinking, resulting in significant performance improvements across multiple benchmarks. Results show that the final model, QwenLong-L1.5-30B-A3B, achieves an average score of 71.82, outperforming the baseline by 9.90 points and demonstrating strong gains on tasks requiring complex information integration, such as MRCR and CorpusQA.

The authors use the AEPO algorithm to improve long-context reasoning in Qwen3-4B-Thinking-2507, with ablation experiments showing that adding AEPO increases the average score from 52.79 to 59.36 across benchmarks. The results indicate that AEPO enhances performance across all evaluated tasks, particularly on MRCR and CorpusQA, where the average score rises by 7.03 and 15.31 points, respectively.

The authors compare the performance of Qwen3-30B-A3B-Thinking-2507 and QwenLong-L1.5-30B-A3B on general, agentic memory, and dialogue memory benchmarks. Results show that QwenLong-L1.5-30B-A3B achieves higher scores across most tasks, with notable improvements on AIME25 (+3.65), GPQA-Diamond (+0.90), Memory-KV (+5.80), and LongMemEval (+15.60), indicating that long-context training enhances general reasoning and memory capabilities without significant degradation on other domains.

The authors compare QwenLong-L1 and QwenLong-L1.5, showing that the latter uses a significantly larger and more diverse training dataset, including synthetic data and additional domains such as code repositories and dialogue data. This expansion results in a more than doubling of the maximum input length, from 59,563 to 119,932 tokens, while also increasing the average input length, indicating a stronger focus on handling longer and more complex contexts.

The authors use ablation experiments to evaluate the impact of different optimization strategies on the AEPO algorithm. Results show that combining task-balanced sampling with batch-standardization and task-batch-standardization leads to the highest average score of 58.62, demonstrating that these techniques significantly improve performance across multiple benchmarks compared to the baseline.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています