4달 전

Jiangshan Duo Hanyu Li Hailin Zhang Yudong Wang Sujian Li Liang Zhao

초록

검증 가능한 보상과 함께한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델에서 추론을 수행하는 표준적 패러다임으로 자리 잡았다. 그러나 최종 답변의 정확성에만 초점을 맞추는 최적화는 모델이 방향성 없고 지나치게 긴 탐색으로 빠지게 하며, 구조적 계획 대신 포괄적인 시도-오류 전략에 의존하게 만든다. 길이 페널티와 같은 히우리스틱 제약 조건은 이러한 지나친 반복을 줄일 수는 있으나, 종종 필수적인 추론 단계를 자르게 되어 효율성과 검증 가능성 사이에 어려운 트레이드오프를 초래한다. 본 논문에서는 탐색의 효율성을 높이기 위해 구분 능력(discriminative capability)이 필수적이라고 주장한다. 즉, 유효한 해답을 구분할 수 있도록 학습함으로써 모델은 탐색 공간을 줄이는 내재적 안내 신호를 획득할 수 있다. 이를 바탕으로 우리는 ‘JudgeRLVR’이라는 이단계적 판단-생성 파라다임을 제안한다. 첫 번째 단계에서는 검증 가능한 답변을 가진 해답 응답을 판단할 수 있도록 모델을 훈련시킨다. 두 번째 단계에서는 판단 모델에서 초기화된 일반적인 생성형 RLVR를 사용하여 동일한 모델을 미세 조정한다. 동일한 수학 도메인 훈련 데이터를 사용하는 일반적인 RLVR와 비교했을 때, JudgeRLVR는 Qwen3-30B-A3B 모델에서 더 우수한 품질-효율성 균형을 달성한다. 도메인 내 수학 문제에서는 평균 정확도가 약 +3.7점 향상되면서 평균 생성 길이는 약 42% 감소하였으며, 도메인 외 벤치마크에서는 평균 정확도가 약 +4.5점 개선되어 일반화 능력이 향상됨을 입증한다.

One-sentence Summary

The authors from Peking University and Xiaomi propose JudgeRLVR, a two-stage reinforcement learning framework that first trains a model to judge verifiable solutions, enabling more efficient reasoning by pruning the search space; this approach improves both accuracy and generation efficiency in Qwen3-30B-A3B, achieving up to +4.5 accuracy gain on out-of-domain math benchmarks with significantly reduced verbosity.

Key Contributions

The paper identifies a key limitation of Reinforcement Learning with Verifiable Rewards (RLVR): optimizing solely for final-answer correctness leads to verbose, unstructured exploration with excessive backtracking, undermining efficiency without guaranteeing higher-quality reasoning.
It proposes JudgeRLVR, a two-stage paradigm where a model first learns to judge solution validity before being fine-tuned for generation, enabling it to internalize discriminative priors that prune low-quality reasoning paths without explicit length penalties.
On Qwen3-30B-A3B, JudgeRLVR achieves +3.7 average accuracy gain with -42% generation length on in-domain math and +4.5 accuracy improvement on out-of-domain benchmarks, demonstrating superior quality-efficiency trade-offs and reduced reliance on explicit backtracking cues.

Introduction

The authors address the challenge of inefficient and verbose reasoning in large language models trained via Reinforcement Learning with Verifiable Rewards (RLVR), where models often resort to trial-and-error exploration instead of structured, goal-directed thinking. While RLVR improves final answer accuracy, it fails to promote high-quality reasoning patterns, leading to low information density and excessive generation length—issues exacerbated by heuristic fixes like length penalties that trade off accuracy for brevity. To overcome this, the authors propose JudgeRLVR, a two-stage training paradigm: first, the model is trained as a judge to discriminate between correct and incorrect solution responses, internalizing a strong signal for valid reasoning; second, the same model is fine-tuned for generation using vanilla RLVR, initialized from the judge. This approach implicitly prunes unproductive reasoning paths without explicit constraints, resulting in more direct, coherent, and efficient reasoning. On Qwen3-30B-A3B, JudgeRLVR achieves a +3.7 accuracy gain and 42% reduction in generation length on in-domain math tasks, along with +4.5 accuracy improvement on out-of-domain benchmarks, demonstrating superior quality-efficiency trade-offs and reduced reliance on backtracking cues.

Dataset

The dataset comprises math reasoning problems from multiple sources, each with a gold final answer $a^*(x)$ , and solution responses $z = (o_1, \ldots, o_T)$ that include a step-by-step logical process ending in a final answer.
The final answer $a(z)$ is extracted from the solution response using a deterministic parser, and correctness is determined by comparing $a(z)$ to $a^*(x)$ via a binary label $\ell(x, z) = \mathbb{I}(a(z) = a^*(x))$ .
The model is trained to judge the correctness of clean solution responses rather than full reasoning trajectories, enabling more precise error detection by avoiding noise from lengthy, distracting CoT sequences.
The training and evaluation use in-domain math benchmarks: AIME24, AIME25, MATH500, HMMT_feb_2025, and BeyondAIME, all sourced from published works and curated for high-quality reasoning problems.
Out-of-domain evaluation covers diverse capabilities using benchmarks such as GPQA Diamond (science reasoning), IFEval (instruction following), LiveCodeBenchv6 (coding), MMLU-Redux (general knowledge), and ZebraLogic (logical reasoning).
The model is trained on a mixture of these datasets with carefully balanced ratios to ensure robust performance across domains, with no explicit cropping strategy applied—data is used as-is after standard preprocessing.
Metadata for each problem includes source, difficulty level, and problem type, constructed during dataset curation to support fine-grained analysis and evaluation.

Method

The authors leverage a two-stage training paradigm, JudgeRLVR, to enhance reasoning efficiency by decoupling discriminative error awareness from solution generation. The overall framework consists of a judging stage followed by a generating stage, where the model first learns to evaluate the correctness of candidate solutions before being optimized to produce high-quality outputs. This sequential approach is designed to enable the model to internalize valid reasoning patterns, thereby improving generation quality and reducing unnecessary computation.

As shown in the figure below, the training pipeline begins with a supervised fine-tuned model, $\pi_{\text{SFT}}$ , which serves as the base for both stages. In the judging stage, the model is trained to act as a critic. Given a problem $x$ and a candidate solution response $z$ , the policy $\pi_{\text{judge}}$ generates a commentary $c$ that includes the chain-of-thought (CoT) trajectory and a discrete verdict token $v \in \{0, 1\}$ , where 0 indicates an incorrect solution and 1 indicates a correct one. The model is trained on a discriminative dataset $D_{\text{judge}}$ constructed from triplets $(x, z, \ell)$ , where $\ell$ is the ground truth label derived from comparing the final answer $a(z)$ with the correct answer $a^*(x)$ . Data construction involves rollout generation from diverse large language models (LLMs), hard negative mining to focus on moderately difficult problems, and class balancing to ensure equal representation of correct and incorrect solutions. The reward for this stage is binary, based on whether the predicted verdict $v_i$ matches the true label $\ell(x, z_i)$ , and the policy gradient optimizes both the verdict and the explanatory commentary.

In the generating stage, the model is initialized with the weights from the judging stage and is trained under a vanilla reinforcement learning with verification (RLVR) setting. The policy $\pi_{\text{final}}$ generates a solution response $z$ directly from the problem $x$ , producing a CoT trajectory and the final answer $a(z)$ . The reward is again a sparse binary signal based on the correctness of the final answer, defined as $r = \ell(x, z) = \mathbb{I}(a(z) = a^*(x))$ . This stage leverages the discriminative knowledge acquired during the judging phase to produce more concise and accurate solutions, as the model has learned to identify and avoid erroneous reasoning paths early in the generation process.

The model architecture used is Qwen3-30B-A3B, a mixture-of-experts (MoE) model, which is first fine-tuned on curated open-source CoT datasets to obtain Qwen3-30B-A3B-SFT. For training, the authors collect a dataset of 113k math problems with gold answers and employ a strict sampling strategy for the judging stage, generating 16 distinct reasoning paths per problem using MiMo-7B RL and the target model. The training algorithm is DAPO, a GRPO-family policy gradient method, with dynamic sampling to filter out trivial problems, a batch size of 256, and a rollout size of $n=8$ . Training uses the AdamW optimizer with a learning rate of $3 \times 10^{-6}$ , a 10-step warmup, and a training temperature of 1.0, while inference uses a temperature of 0.6.

Experiment

Main experiment: JudgeRLVR vs. Vanilla RLVR and Base SFT on reasoning quality and efficiency. JudgeRLVR achieves higher or comparable accuracy with significantly shorter generation lengths across multiple benchmarks, demonstrating improved quality-efficiency trade-off. On AIME24/25, HMMT_feb_2025, and BeyondAIME, JudgeRLVR improves accuracy while reducing reasoning verbosity; on MATH500, it maintains accuracy with drastic length reduction. On out-of-domain tasks (GPQA Diamond, LiveCodeBenchv6, MMLU-Redux), JudgeRLVR improves accuracy and reduces length, indicating generalization of error-aware reasoning. On IFEval and ZebraLogic, accuracy improves but length increases, reflecting task-specific need for explicit verification.
Ablation experiments: Judge Only (judging-only training) degrades accuracy and increases length on in-domain math, showing that judging alone does not improve generation. Mixed Strategy (interleaved judge/generate training) is less stable and often produces longer outputs, indicating that sequential staging is essential for clean policy learning and efficient reasoning.
Mechanism verification: Perplexity (PPL) analysis shows a significant increase in Base SFT’s PPL on JudgeRLVR outputs during the judging stage, confirming a style transfer toward error-sensitive reasoning. Transition-word statistics reveal a substantial decrease in backtracking markers (e.g., but, however, wait) during JudgeRLVR’s generating stage, indicating reduced explicit self-correction and internalized verification.

The authors use the table to compare the performance of Base SFT, Vanilla RLVR, and JudgeRLVR across multiple benchmarks, showing that JudgeRLVR achieves higher or comparable accuracy while significantly reducing generation length in most cases. Results show that JudgeRLVR improves the quality-efficiency trade-off by enabling more concise reasoning, particularly in math domains, and demonstrates consistent gains in accuracy and efficiency across diverse tasks.

The authors use a two-stage judge-then-generate approach, JudgeRLVR, to improve the quality-efficiency trade-off in reasoning tasks compared to Vanilla RLVR. Results show that JudgeRLVR achieves higher or comparable accuracy while significantly reducing generation length across most benchmarks, indicating more efficient and focused reasoning.

The authors use a two-stage judge-then-generate approach in JudgeRLVR to improve the quality-efficiency trade-off compared to Vanilla RLVR, achieving higher or comparable accuracy while significantly reducing generation length across most benchmarks. Results show that JudgeRLVR consistently outperforms Vanilla RLVR in in-domain math tasks, with substantial gains in accuracy and shorter outputs, and maintains competitive performance on out-of-domain tasks, demonstrating improved reasoning efficiency and generalization.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

4달 전

Jiangshan Duo Hanyu Li Hailin Zhang Yudong Wang Sujian Li Liang Zhao

초록

One-sentence Summary

Key Contributions

The paper identifies a key limitation of Reinforcement Learning with Verifiable Rewards (RLVR): optimizing solely for final-answer correctness leads to verbose, unstructured exploration with excessive backtracking, undermining efficiency without guaranteeing higher-quality reasoning.
It proposes JudgeRLVR, a two-stage paradigm where a model first learns to judge solution validity before being fine-tuned for generation, enabling it to internalize discriminative priors that prune low-quality reasoning paths without explicit length penalties.
On Qwen3-30B-A3B, JudgeRLVR achieves +3.7 average accuracy gain with -42% generation length on in-domain math and +4.5 accuracy improvement on out-of-domain benchmarks, demonstrating superior quality-efficiency trade-offs and reduced reliance on explicit backtracking cues.

Introduction

Dataset

The dataset comprises math reasoning problems from multiple sources, each with a gold final answer $a^*(x)$ , and solution responses $z = (o_1, \ldots, o_T)$ that include a step-by-step logical process ending in a final answer.
The final answer $a(z)$ is extracted from the solution response using a deterministic parser, and correctness is determined by comparing $a(z)$ to $a^*(x)$ via a binary label $\ell(x, z) = \mathbb{I}(a(z) = a^*(x))$ .
The model is trained to judge the correctness of clean solution responses rather than full reasoning trajectories, enabling more precise error detection by avoiding noise from lengthy, distracting CoT sequences.
The training and evaluation use in-domain math benchmarks: AIME24, AIME25, MATH500, HMMT_feb_2025, and BeyondAIME, all sourced from published works and curated for high-quality reasoning problems.
Out-of-domain evaluation covers diverse capabilities using benchmarks such as GPQA Diamond (science reasoning), IFEval (instruction following), LiveCodeBenchv6 (coding), MMLU-Redux (general knowledge), and ZebraLogic (logical reasoning).
The model is trained on a mixture of these datasets with carefully balanced ratios to ensure robust performance across domains, with no explicit cropping strategy applied—data is used as-is after standard preprocessing.
Metadata for each problem includes source, difficulty level, and problem type, constructed during dataset curation to support fine-grained analysis and evaluation.

Method

Experiment

Main experiment: JudgeRLVR vs. Vanilla RLVR and Base SFT on reasoning quality and efficiency. JudgeRLVR achieves higher or comparable accuracy with significantly shorter generation lengths across multiple benchmarks, demonstrating improved quality-efficiency trade-off. On AIME24/25, HMMT_feb_2025, and BeyondAIME, JudgeRLVR improves accuracy while reducing reasoning verbosity; on MATH500, it maintains accuracy with drastic length reduction. On out-of-domain tasks (GPQA Diamond, LiveCodeBenchv6, MMLU-Redux), JudgeRLVR improves accuracy and reduces length, indicating generalization of error-aware reasoning. On IFEval and ZebraLogic, accuracy improves but length increases, reflecting task-specific need for explicit verification.
Ablation experiments: Judge Only (judging-only training) degrades accuracy and increases length on in-domain math, showing that judging alone does not improve generation. Mixed Strategy (interleaved judge/generate training) is less stable and often produces longer outputs, indicating that sequential staging is essential for clean policy learning and efficient reasoning.
Mechanism verification: Perplexity (PPL) analysis shows a significant increase in Base SFT’s PPL on JudgeRLVR outputs during the judging stage, confirming a style transfer toward error-sensitive reasoning. Transition-word statistics reveal a substantial decrease in backtracking markers (e.g., but, however, wait) during JudgeRLVR’s generating stage, indicating reduced explicit self-correction and internalized verification.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

JudgeRLVR: 효율적인 추론을 위한 먼저 판단하고 나서 생성하기

Jiangshan Duo Hanyu Li Hailin Zhang Yudong Wang Sujian Li Liang Zhao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

JudgeRLVR: 효율적인 추론을 위한 먼저 판단하고 나서 생성하기

Jiangshan Duo Hanyu Li Hailin Zhang Yudong Wang Sujian Li Liang Zhao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters

Command Palette

JudgeRLVR: 효율적인 추론을 위한 먼저 판단하고 나서 생성하기

Jiangshan Duo Hanyu Li Hailin Zhang Yudong Wang Sujian Li Liang Zhao

초록

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AI로 AI 구축

HyperAI Newsletters