HyperAIHyperAI

Command Palette

Search for a command to run...

QwenLong-L1.5: 긴 컨텍스트 추론 및 메모리 관리에 대한 후기훈련 레시피

vLLM+Open WebUI를 사용하여 QwenLong-L1.5 배포

단 20시간의 RTX 5090 컴퓨팅 리소스 $1 (가치 $7)
노트북으로 이동

초록

우리는 체계적인 후처리 훈련 혁신을 통해 우수한 장문맥 추론 능력을 달성한 QwenLong-L1.5 모델을 소개합니다. QwenLong-L1.5의 핵심 기술적 돌파구는 다음과 같습니다: (1) 장문맥 데이터 합성 파이프라인: 전 세계적으로 분산된 증거를 다단계 기반으로 요구하는 복잡한 추론 과제를 생성하는 체계적인 합성 프레임워크를 개발했습니다. 문서를 원자적 사실과 그 뒤에 있는 관계로 분해한 후, 검증 가능한 추론 질문을 프로그래밍 방식으로 조합함으로써, 단순한 검색 과제를 뛰어넘어 진정한 장거리 추론 능력을 가능하게 하는 고품질 훈련 데이터를 대규모로 생성할 수 있었습니다. (2) 장문맥 훈련을 위한 안정화된 강화학습: 장문맥 강화학습에서 발생하는 핵심적인 불안정성 문제를 해결하기 위해, 보상 편향을 완화하기 위해 과제별로 균형 잡힌 샘플링과 과제별 특화된 이점 추정을 도입하였으며, 탐색과 활용 간의 균형을 동적으로 조절하는 적응형 엔트로피 제어 정책 최적화(Adaptive Entropy-Controlled Policy Optimization, AEPO)를 제안했습니다. (3) 초장문맥을 위한 메모리 증강 아키텍처: 심지어 확장된 문맥 창이 임의로 긴 시퀀스를 수용할 수는 없음을 인지하고, 다단계 융합 강화학습 훈련을 기반으로 한 메모리 관리 프레임워크를 개발하여, 400만 토큰을 초과하는 과제에 대해 단일 패스 추론과 반복적 메모리 기반 처리를 원활하게 통합할 수 있도록 했습니다. Qwen3-30B-A3B-Thinking 기반으로 개발된 QwenLong-L1.5는 장문맥 추론 벤치마크에서 GPT-5 및 Gemini-2.5-Pro 수준의 성능을 달성하였으며, 기준 모델 대비 평균 9.90점 향상률을 기록했습니다. 초장문 과제(100만~400만 토큰)에서는 메모리 에이전트 프레임워크가 에이전트 기준 대비 9.48점의 성능 향상을 이끌어냈습니다. 또한, 획득한 장문맥 추론 능력은 과학적 추론, 메모리 도구 사용, 장기 대화 등 일반 영역에서의 성능 향상으로도 이어졌습니다.

One-sentence Summary

Tongyi Lab, Alibaba Group introduces QwenLong-L1.5, featuring a novel long-context data synthesis pipeline for multi-hop reasoning, task-balanced reinforcement learning with Adaptive Entropy-Controlled Policy Optimization (AEPO) for training stability, and a memory-augmented architecture handling contexts up to 4M tokens. It achieves 9.9-point average gains over baselines on long-context benchmarks while enhancing scientific reasoning and extended dialogue capabilities.

Key Contributions

  • Addressing the limitation of prior long-context models that rely on simple retrieval tasks rather than genuine multi-hop reasoning, the authors introduce a systematic data synthesis pipeline that deconstructs documents into atomic facts and programmatically composes verifiable reasoning questions requiring globally distributed evidence, enabling scalable training for complex long-range reasoning. This approach moves beyond basic retrieval to generate high-quality synthetic data that directly targets challenging reasoning scenarios.
  • To overcome critical instability in long-context reinforcement learning, the method proposes task-balanced sampling with task-specific advantage estimation and Adaptive Entropy-Controlled Policy Optimization (AEPO), which dynamically regulates exploration-exploitation trade-offs and mitigates reward bias. These techniques enable stable training on sequences of progressively increasing length, a key hurdle in scaling long-context capabilities.
  • Recognizing inherent context window limitations, the work develops a memory-augmented architecture with multi-stage fusion RL training that integrates single-pass reasoning and iterative memory-based processing for ultra-long contexts exceeding 4M tokens. Evaluated on benchmarks, this framework achieves a 9.48-point improvement over agent baselines on 1M–4M token tasks while matching GPT-5 and Gemini-2.5-Pro performance on standard long-context reasoning.

Introduction

Handling long-context reasoning is critical for real-world applications like legal document analysis or scientific research, where models must process extensive inputs while maintaining coherent reasoning and memory efficiency. Prior approaches faced limitations in data diversity—relying heavily on text-only inputs and short outputs—and struggled with reinforcement learning instabilities caused by coarse reward assignment across entire reasoning steps, leading to brittle training dynamics. The authors introduce QwenLong-L1.5, featuring an automated data synthesis pipeline for complex long-input tasks and AEPO, a refined reinforcement learning algorithm that stabilizes training through gradient clipping to enable more granular credit assignment within reasoning trajectories.

Dataset

The authors use a synthetically constructed long-context reinforcement learning dataset for QwenLong-L1.5, developed through a three-stage pipeline:

  • Composition and sources:
    Built from 82,175 high-quality long documents (9.2B tokens) across five categories:
    • Code repositories (high-starred Python projects)
    • Academic literature (STEM, medicine, law, arXiv papers)
    • Professional documents (financial reports, manuals, government publications)
    • General knowledge (novels, Wikipedia)
    • Simulated multi-turn dialogues

  • Key subset details:
    • Initial pool: 42.7k synthesized question-answer pairs
    • Final training set: 14.1k samples after rigorous filtering
    • Filtering rules:

    • Knowledge grounding check: Removed samples answerable without context (testing internal knowledge reliance)
    • Contextual robustness check: Discarded samples where answers failed when irrelevant documents were added
      • Enhanced complexity: Increased samples exceeding 64K tokens versus prior versions
  • Data usage:
    • Solely for RL training (no separate validation/test splits described)
    • Mixture ratios not specified, but emphasizes domain diversity: multi-hop reasoning, numerical calculation, temporal analysis, viewpoint analysis, and dialogue memory

  • Processing strategies:
    • Context expansion: Strategically inserted irrelevant documents to force sparse information retrieval
    • Difficulty scaling: Used knowledge graphs/tables and multi-agent self-evolution to create challenging questions
    • Decontamination: Applied multi-stage deduplication and test set cleaning
    • Quality control: Combined rule-based and LLM-as-judge filtering during corpus curation

Method

The authors leverage a comprehensive post-training framework to endow QwenLong-L1.5 with robust long-context reasoning capabilities, integrating a novel data synthesis pipeline, stabilized reinforcement learning, and a memory-augmented architecture. The overall training paradigm is structured as a multi-stage, progressive process designed to scale reasoning capacity while maintaining training stability.

The core of the method begins with the systematic generation of high-quality, complex training data. As shown in the figure below, the data synthesis pipeline is designed to move beyond simple retrieval tasks by constructing challenges that demand multi-hop grounding over globally distributed evidence. The process starts by ingesting a diverse document corpus, including web-crawled content, code repositories, dialogue, and literature. From this corpus, the system extracts structured representations: knowledge graphs are built for multi-hop reasoning tasks, while unstructured documents are parsed into formalized corpus tables to enable numerical reasoning. For general long-context tasks, a multi-agent self-evolving framework is employed, where a proposer agent generates questions, a solver agent attempts to answer them, and a verifier agent validates the correctness of the QA pairs. This iterative process, depicted in the diagram, ensures the generated dataset covers a broad spectrum of reasoning types—including causal analysis, hypothetical scenarios, statistical aggregation, and viewpoint analysis—while undergoing rigorous verification for knowledge grounding and contextual robustness.

To train the model on this data, the authors formulate long-context reasoning as a reinforcement learning problem, optimizing a policy model to maximize a reward function while being regularized against deviation from a reference policy. Given the computational intractability of standard PPO for long sequences, they adopt Group Relative Policy Optimization (GRPO), which estimates advantages via group-wise reward z-score normalization, eliminating the need for a separate value network. To address the instability inherent in multi-task, long-context RL, they introduce two key innovations. First, they implement task-balanced sampling, ensuring each training batch contains an equal number of samples from five distinct task types (e.g., multiple choice, multi-hop QA, numerical calculation). Second, they propose task-specific advantage estimation, which computes the reward standard deviation within each task’s subset of the batch rather than across the entire group, thereby isolating tasks with dense versus sparse reward distributions and yielding a more accurate, less biased advantage signal.

Further stabilizing the training process, the authors introduce Adaptive Entropy-Controlled Policy Optimization (AEPO). This algorithm dynamically regulates the exploration-exploitation trade-off by monitoring the batch-level entropy of the policy. When entropy exceeds a predefined upper bound, AEPO masks all samples with negative advantages, effectively performing advantage-weighted online rejection sampling to reduce entropy. Conversely, when entropy falls below a lower bound, negative gradients are reintroduced to prevent collapse and encourage exploration. This mechanism enables stable training over progressively longer sequences and is critical for scaling the model’s reasoning capabilities.

For tasks exceeding the model’s physical context window, the authors deploy a memory agent framework. As illustrated in the figure below, this architecture reframes long-context reasoning as a sequential decision-making process. The user query is first decomposed into a core question and formatting instructions. The long document is split into chunks, and at each step, the policy observes the current chunk and the historical memory state to update both the memory and a navigational plan for the next chunk. This recurrent mechanism allows the model to “fold” global context into a compact, evolving representation. After processing all chunks, the final answer is generated by integrating the accumulated memory with the original formatting instructions. The policy is optimized end-to-end using GRPO, with trajectory-level rewards computed based on the correctness of the final answer.

The overall training pipeline, shown in the final figure, is executed in stages. Starting from the Qwen3-30B-A3B-Thinking base model, the authors first perform three stages of full-context RL, progressively increasing the input and output lengths (from 20K/12K to 120K/50K tokens). After Stage 3, they train a specialized memory agent model using the same RL framework but with chunked inputs. This expert model is then merged with the Stage 3 model using the SCE algorithm, and a final stage of full-context RL is applied to the merged model to produce the final QwenLong-L1.5.

This integrated approach—combining scalable data synthesis, stabilized RL, and memory-augmented inference—enables QwenLong-L1.5 to achieve state-of-the-art performance on long-context reasoning benchmarks and generalize effectively to domains requiring extended coherence and multi-hop inference.

Experiment

  • QwenLong-L1.5-30B-A3B achieved 71.82 average score across long-context benchmarks including LongBench-V2 and MRCR, surpassing DeepSeek-R1-0528 (68.67) and Gemini-2.5-Flash-Thinking (68.73), with 82.99 on MRCR setting state-of-the-art.
  • Demonstrated +9.90 point improvement over baseline Qwen3-30B-A3B-Thinking-2507, with largest gains on long-context tasks: MRCR (+31.72), CorpusQA (+9.69), and LongBench-V2 (+6.16).
  • Showed generalization to non-training domains, improving AIME25 by +3.65 and LongMemEval by +15.60 without performance degradation on general benchmarks like MMLU-PRO.
  • Handled ultra-long contexts up to 4M tokens, scoring 14.29 on CorpusQA 4M subset and outperforming baseline by 18.32 points on MRCR 128K-512K subsets.
  • Multi-stage training validated progressive gains: Stage-1 increased average score by +7.67 points, while later stages specifically enhanced long-context reasoning and memory agent capabilities without compromising full-context performance.

The authors use ablation experiments on Qwen3-4B-Thinking-2507 to evaluate the impact of different negative gradient clipping strategies within the AEPO algorithm. Results show that clipping high-entropy tokens at the token level yields the highest average score (57.02), while clipping high-entropy sequences at the sequence level achieves the best overall performance (57.36), indicating that sequence-level clipping better preserves useful reasoning paths during training.

The authors evaluate QwenLong-L1.5-30B-A3B against its baseline Qwen3-30B-A3B-Thinking-2507 on general, agentic memory, and dialogue memory benchmarks. Results show that QwenLong-L1.5 maintains or improves performance across most tasks, with notable gains in AIME25 (+3.65), Memory-KV (+5.80), and LongMemEval (+15.60), indicating effective generalization of long-context reasoning skills to out-of-domain tasks.

The authors evaluate ablations of the AEPO algorithm on Qwen3-4B-Thinking-2507, testing combinations of task-balanced sampling and batch standardization. Results show that adding task-batch standardization yields the highest average score of 58.62, with notable gains on Frames and CorpusQA, while MRCR performance remains lower than with task-balanced sampling alone. This suggests that task-batch standardization improves overall performance but may not uniformly benefit all task types.

The authors evaluate QwenLong-L1.5-30B-A3B against flagship and lightweight reasoning models on LongBench-V1 QA benchmarks, showing it outperforms its baseline Qwen3-30B-A3B-Thinking-2507 by +3.30 points on average. Notable gains are observed on Musique (+7.00) and NarrativeQA (+9.00), while Qasper shows a slight decline (-3.00). The model matches or exceeds top performers like Gemini-2.5-Pro and DeepSeek-R1-0528 on most subtasks, demonstrating competitive multi-hop reasoning capabilities.

The authors use Qwen3-4B-Thinking-2507 as a base model and compare ablation variants of their AEPO algorithm against GRPO. Results show that adding AEPO yields a 3.29-point average improvement over the base model and a 2.29-point gain over GRPO, with consistent performance boosts across all evaluated benchmarks including DocMath, Frames, and CorpusQA.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp