HyperAIHyperAI

Command Palette

Search for a command to run...

RubricEM: 검증 가능한 보상(Rewards)을 넘어서는 기준 기반 정책 분해를 활용한 메타강화학습

초록

심층 연구 에이전트, 즉 계획을 수립하고, 검색하며, 증거를 평가하고, 장문의 보고서를 종합하는 시스템을 학습시키는 작업은 검증 가능한 보상(regime of verifiable rewards)의 범위를 넘어 강화학습의 새로운 지평을 연다. 이러한 에이전트의 출력물은 정답이 없으며, 그 학습 궤적은 여러 도구 보조 결정(tool-augmented decisions)을 포괄하므로, 기존 사후 학습(post-training) 방식은 과거의 시도를 재사용 가능한 경험으로 전환하는 데 있어 충분한 메커니즘을 제공하지 못한다. 본 연구에서는 러브릭(rubrics)이 단순히 최종 답변을 평가하는 도구를 넘어, 정책(policy) 실행,judge 피드백, 에이전트 메모리를 구조화하는 공유 인터페이스 역할을 해야 한다고 주장한다. 이러한 관점에 기반하여, 본 논문은 단계별 정책 분해(stage-wise policy decomposition)와 성찰 기반 메타-정책 학습을 결합한 러브릭 지도 강화학습 프레임워크인 RUBRICEM을 제시한다. RUBRICEM은 먼저 스스로 생성한 러브릭을 조건으로 계획, 증거 수집, 검토, 종합 단계를 구성함으로써 연구 궤적을 단계 인식(stage-aware)적으로 만든다. 이어 단계 구조화된 GRPO(Stage-Structured GRPO)를 통해 각 단계별 러브릭 판단을 활용, 장기 최적화를 위한 더稠密한(semantically dense) 의미론적 피드백을 제공한다. 병렬로, RUBRICEM은 판단된 궤적을 향후 시도들을 위한 재사용 가능한 러브릭 기반 지침으로 압축하는 공유 백본(shared-backbone) 성찰 메타-정책을 학습한다. 그 결과, RUBRICEM-8B는 4가지 대표적인 장문 연구 벤치마크에서 강력한 성능을 보이며, 유사한 오픈 모델들을 상회하고 폐쇄형 proprietary 심층 연구 시스템에 근접하는 성과를 거두었다. 최종 성능 외에도, 본 논문은 RUBRICEM의 핵심 구성 요소들을 이해하기 위한 철저한 분석을 수행한다.

One-sentence Summary

The authors introduce RUBRICEM, a rubric-guided reinforcement learning framework combining stagewise policy decomposition with reflection-based meta-policy training to optimize deep research agents beyond verifiable rewards by employing Stage-Structured GRPO for denser semantic feedback and distilling judged trajectories into reusable guidance, enabling RUBRICEM-8B to achieve strong performance across four representative long-form research benchmarks while outperforming comparable open models and approaching proprietary deep-research systems.

Key Contributions

  • This work introduces RUBRICEM, a rubric-guided reinforcement learning framework that uses rubrics as a shared interface to structure policy execution, judge feedback, and agent memory. The method makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics.
  • Credit assignment relies on Stage-Structured GRPO, which leverages stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. A shared-backbone reflection meta-policy runs asynchronously to distill judged trajectories into reusable rubric-grounded guidance without imposing sequential bottlenecks.
  • Experiments demonstrate that RUBRICEM-8B achieves strong performance across four representative long-form research benchmarks. The resulting system outperforms comparable open models and approaches proprietary deep-research systems in these evaluations.

Introduction

Training deep research agents requires moving beyond verifiable rewards since their long form outputs lack ground truth answers and standard post training offers little reusable experience. Prior methods rely on verifiable search proxies or imitation data, leaving a gap in handling the coarse and delayed feedback of open ended research trajectories. The authors introduce RUBRICEM, a framework that treats rubrics as a shared interface to structure policy execution and agent memory. This system combines stagewise policy decomposition with reflection based meta policy training to provide denser semantic feedback and distill reusable guidance from judged trajectories.

Dataset

  • Dataset Composition and Sources

    • The authors construct a supervised fine-tuning dataset comprising approximately 11,000 samples.
    • Data originates from agent trajectories generated by a Gemini teacher model and adapted for Qwen3.
    • The final volume is roughly 2,000 samples smaller than the DR Tulu baseline due to rigorous filtering.
  • Filtering and Rejection Rules

    • Trajectories lacking a closing </answer> tag are discarded as a hard reject.
    • Samples missing valid tool calls in non-final rounds are removed to prevent reliance on internal knowledge.
    • Data without required XML structures like <structured_plan> or <state_evaluation> is excluded.
    • Any trajectory with two or more consecutive tool errors is rejected to ensure reliability.
  • Processing and Training Format

    • Reasoning tags are converted from <scratchpad> to <think> for template compatibility.
    • Tool names are normalized to canonical identifiers such as google_search.
    • Each sample is formatted as a single-turn ChatML conversation with system, user, and assistant messages.
    • Tool output tokens are masked during training so the model does not memorize search results.
  • Evaluation Benchmarks

    • Performance is measured on four long-form datasets including HealthBench and ResearchQA.
    • These benchmarks range from 100 to 1,000 questions covering medical and scientific research domains.

Method

The RubricEM framework operates by treating rubrics as a shared interface across the agent's planning, execution, and learning phases. As illustrated in the framework diagram, the system integrates three core components: a rubric-guided structured trajectory for the task policy, a stage-structured reinforcement learning algorithm for credit assignment, and a reflection meta-policy for experience reuse.

Structured Reasoning Scaffold

To manage long-horizon research tasks, the authors impose an explicit stage structure on agent trajectories. This scaffold decomposes the generation process into four semantically distinct stages: Plan, Research, Review, and Answer. Each stage is marked by XML tags and governed by specific behavioral requirements. In the planning phase, the agent generates task-specific rubrics that define the criteria for success, including a knowledge checklist and negative constraints. These rubrics then guide the subsequent research and synthesis phases. The detailed workflow shows how the agent analyzes needs, defines grading criteria, and plans the search before executing tool calls.

This structure allows the policy to condition its decisions on the current stage, avoiding the aliasing of decision modes that occurs in flat autoregressive processes. A concrete example demonstrates how a query about sleep patterns is broken down into these stages, with specific rubrics generated during planning to direct the research and review steps.

Stage-Structured GRPO and Meta-Policy Training

Standard reinforcement learning methods often broadcast a single terminal reward to all tokens, which is inefficient for long-horizon tasks. RubricEM employs Stage-Structured GRPO (SS-GRPO) to provide finer-grained credit assignment. Instead of a single score, the LLM judge evaluates each stage (Plan, Research, Review, Answer) against stage-specific rubrics. These stagewise scores are combined using a causal stage-dependence matrix to compute returns that account for downstream impact. The training pipeline visualizes how task rollouts are judged to generate discriminative rubrics, which are then used to score the trajectories and update the policy.

Beyond optimizing the task policy, the framework explicitly trains a reflection meta-policy to reuse experience. The task policy and reflection meta-policy share the same backbone. After a task rollout is judged, the backbone samples rubric-grounded reflection candidates. A separate judge scores these candidates based on their utility for within-episode refinement and cross-episode transfer. The highest-scoring reflection is stored in a rubric bank, serving as natural-language memory for future queries. To ensure efficiency, the system uses an asynchronous execution pipeline where reflection generation and training run in parallel with task rollouts, avoiding sequential bottlenecks. The infrastructure diagram highlights this asynchronous design, showing how the training engine consumes deferred reflection batches while the inference engine generates new rollouts.

Experiment

The evaluation assesses RUBRICEM on four representative long-form benchmarks using an infrastructure adapted from DR Tulu. Results demonstrate that the proposed reinforcement learning recipe significantly improves performance over supervised fine-tuning and outperforms strong open baselines while remaining competitive with proprietary systems. Ablation studies validate that stagewise credit assignment and structured scaffolding contribute complementary gains, whereas inference-time experience reuse proves effective only with the learned meta-policy. Furthermore, the model exhibits strong generalization to short-form tasks, confirming that the training teaches transferable tool-use and evidence-grounding skills rather than just long-form report writing.

The the the table compares short-form search performance between the proposed RUBRICEM model and DR Tulu baselines. RUBRICEM demonstrates consistent improvements across all benchmarks, with the RL-fine-tuned version achieving the best overall results. RUBRICEM-8B (RL) achieves the highest average score, surpassing strong open baselines like DR Tulu-8B (RL). The model generalizes effectively to short-form tasks despite being trained primarily on long-form deep research data. RUBRICEM reaches superior performance using fewer RL training steps compared to the DR Tulu baseline.

The authors evaluate RUBRICEM against proprietary and open-source deep research models across multiple benchmarks. Results indicate that RUBRICEM-8B-RL achieves the highest performance among non-proprietary systems, surpassing strong open baselines like DR Tulu and Tongyi DeepResearch. Furthermore, the model demonstrates competitive capabilities against top-tier proprietary systems, particularly outperforming them on the DRB benchmark. RUBRICEM-8B-RL achieves the highest average performance among non-proprietary deep research systems evaluated. The Reinforcement Learning stage yields significant performance gains over the Supervised Fine-Tuning baseline. The model outperforms OpenAI Deep Research on the DRB benchmark while remaining competitive with other closed models overall.

The authors evaluate RUBRICEM against open-source and proprietary deep research models to validate its performance across short-form search and deep research benchmarks. Results indicate that the RL-fine-tuned version generalizes effectively from long-form training data, achieving top performance among non-proprietary systems. Additionally, the model surpasses strong open baselines and remains competitive with top-tier proprietary systems while requiring fewer reinforcement learning training steps.


AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩
바로 사용 가능한 GPU
최적의 가격

HyperAI Newsletters

최신 정보 구독하기
한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다
이메일 서비스 제공: MailChimp
RubricEM: 검증 가능한 보상(Rewards)을 넘어서는 기준 기반 정책 분해를 활용한 메타강화학습 | 문서 | HyperAI초신경