HyperAIHyperAI

Command Palette

Search for a command to run...

Console

Nemotron-Cascade:汎用推論モデル向け階層的強化学習のスケーリング

Abstract

強化学習(RL)を用いた汎用的推論モデルの構築には、推論時の応答長や検証遅延に顕著なドメイン間の異質性が伴い、この変動がRLインフラの複雑さを増し、学習を遅くし、学習カリキュラム(例:応答長の拡張)やハイパーパラメータ選定を困難にしている。本研究では、汎用的推論モデル「Nemotron-Cascade」の開発を目的として、段階的ドメイン別強化学習(Cascade RL)を提案する。このアプローチにより、指示従属モードと深層的思考モードの両方で動作可能なモデルの構築が可能となる。従来の異なるドメインから得られる異種のプロンプトを統合する手法とは異なり、Cascade RLは順次的かつドメイン別に強化学習を実行することで、エンジニアリングの複雑さを低減し、広範なベンチマークにおいて最先端の性能を達成する。特に、アライメントのためのRLHF(強化学習による人間の好み最適化)を前段階として用いることで、単なる好み最適化をはるかに超える推論能力の向上が得られ、その後のドメイン別RLVR段階では、以前のドメインで達成したベンチマーク性能が低下すること rarely であり、むしろ向上する場合すらある(図1を参照)。本研究で開発した14Bパラメータのモデルは、RL処理を経た後、SFT教師モデルであるDeepSeek-R1-0528を上回り、LiveCodeBench v5/v6/Proにおいて優れた性能を示し、2025年国際情報オリンピック(IOI)では銀メダル相当の成績を収めた。本研究では、学習プロセスおよびデータの調理法(レシピ)を完全に公開している。

One-sentence Summary

NVIDIA researchers et al. propose Cascade RL to train Nemotron-Cascade reasoning models via sequential domain-wise reinforcement learning, avoiding conventional blended prompt approaches to handle cross-domain heterogeneity in response lengths and verification latency. This reduces engineering complexity while achieving state-of-the-art performance on coding benchmarks like LiveCodeBench and securing a silver medal in the 2025 IOI, with RLHF pre-training significantly enhancing reasoning capabilities beyond standard alignment.

Key Contributions

  • Cross-domain heterogeneity in reinforcement learning for reasoning models causes significant infrastructure complexity and training slowdowns due to varying response lengths and verification latency across tasks.
  • The proposed cascaded domain-wise reinforcement learning (Cascade RL) method sequentially trains on distinct domains instead of blending heterogeneous prompts, enabling dual instruct/deep thinking modes while preventing performance degradation between stages.
  • Evaluated on benchmarks including LiveCodeBench v5/v6/Pro and the 2025 International Olympiad in Informatics, their 14B Nemotron-Cascade model surpassed its teacher (DeepSeek-R1-0528) and achieved silver-medal performance without degrading prior domain results.

Introduction

Training general-purpose reasoning models with reinforcement learning faces significant hurdles due to cross-domain heterogeneity—varying response lengths and verification latencies across tasks like math, coding, and alignment. This diversity complicates RL infrastructure, slows training, and hinders curriculum design and hyperparameter tuning. Prior approaches that blend heterogeneous prompts from multiple domains simultaneously often degrade performance in unified models, forcing compromises between thinking-mode reasoning and instruct-mode responsiveness.

The authors leverage cascaded domain-wise reinforcement learning (Cascade RL) to address these challenges. Their method sequences RL stages per domain—starting with RLHF for alignment, followed by sequential math, code, and software engineering RL—reducing engineering complexity while minimizing catastrophic forgetting. Crucially, early-stage RL (e.g., RLHF) unexpectedly boosts reasoning beyond alignment, and subsequent domain training rarely degrades prior gains. This enables a single unified model (Nemotron-Cascade) to operate effectively in both instruct and deep-thinking modes, with their 14B variant outperforming its teacher model on coding benchmarks and achieving competitive results in the 2025 IOI.

Dataset

  • Dataset composition and sources: The authors use a multi-stage supervised fine-tuning (SFT) curriculum spanning math, coding, science, tool use, software engineering, and general domains (e.g., dialogue, knowledge QA, creative writing). Sources include public datasets like AceMath, NuminaMath, TACO, APPs, SWE-Bench variants, and Llama-Nemotron tool-calling data, supplemented by synthetically generated samples.
  • Key subset details:
    • General-domain: 2.8M samples (3.2B tokens) from diverse sources (e.g., Lian et al., 2023; Xu et al., 2024), with parallel thinking/non-thinking responses. Filtered for response quality, length, and stylistic consistency.
    • Math: Stage 1 (16K tokens): 353K prompts → 2.77M samples (DeepSeek-R1); Stage 2 (32K): 163K filtered "hard" prompts → 1.88M samples (DeepSeek-R1-0528). Decontaminated via 9-gram overlap removal.
    • Code: Stage 1: 172K prompts → 1.42M samples; Stage 2: 79K prompts → 1.39M samples. Sources include TACO, APPs, and OpenCodeReasoning.
    • Science: 226K prompts → 289K Stage-1 samples; 345K Stage-2 samples. Filtered for complex reasoning and decontaminated.
    • Tool calling: 310K conversations (1.41M turns) from Llama-Nemotron, with tools listed in system prompts.
    • Software engineering: 127K code repair instances (e.g., SWE-Bench-Train, SWE-Smith), filtered via patch similarity (Unidiff ≥0.5).
  • Usage in training: The SFT curriculum runs in two stages: Stage 1 (16K tokens) trains on general-domain + math/science/code data for one epoch; Stage 2 (32K tokens) recombines general data with new Stage-2 reasoning data, tool-calling, and software engineering datasets (also one epoch). Science data is upsampled 2×; software engineering data is upsampled 3× in Stage 2. All reasoning/tool data uses thinking-mode formatting.
  • Processing details: Responses are generated via DeepSeek models (e.g., R1-0528), with multi-response sampling per prompt (avg. 7–17 responses). Data undergoes 9-gram decontamination against benchmarks, ground-truth verification (e.g., discarding mismatched MCQ answers), and cross-validation with auxiliary models (e.g., Qwen2.5-32B). For software engineering RL, prompts exceed SFT context limits (up to 60K tokens via YaRN scaling) and include noisy localized files to simulate real-world complexity.

Method

The authors leverage a cascaded reinforcement learning (Cascade RL) framework to progressively refine model capabilities across increasingly specialized domains. The overall training pipeline begins with a base model that undergoes multi-stage supervised fine-tuning (SFT) to establish foundational skills. From this SFT checkpoint, the model enters a sequential RL pipeline: first, Reinforcement Learning from Human Feedback (RLHF) is applied to align outputs with human preferences and reduce verbosity; this is followed by Instruction-Following RL (IF-RL) to enhance precise adherence to user directives. Subsequent stages—Math RL, Code RL, and finally Software Engineering RL (SWE RL)—target domain-specific reasoning and generation tasks, culminating in the Nemotron-Cascade model. This staged progression from general to specialized domains is designed to mitigate catastrophic forgetting by ensuring that reward structures across stages are aligned and that prompt overlap is minimized.

Refer to the framework diagram for the full training pipeline.

Each RL stage employs the Group Relative Policy Optimization (GRPO) algorithm under a strict on-policy regime, with no KL divergence term, simplifying the objective to a group-normalized REINFORCE formulation. At each iteration, the policy generates a group of GGG rollouts, and the advantage for each token is computed relative to the group’s mean and standard deviation of rewards. This design ensures stable updates and avoids entropy collapse. The reward functions vary by domain: RLHF uses a scalar score from a 72B reward model trained on human preferences; Math RL assigns binary rewards based on answer correctness via a rule-based verifier; Code RL and SWE RL use execution-free verifiers that compute lexical and semantic similarity between generated and ground-truth patches.

For interaction control, the authors adopt a simplified ChatML-based template with explicit /think and /no_think flags appended to each user prompt, enabling fine-grained, turn-level control over reasoning mode. This contrasts with prior work that embeds mode control in the system prompt or uses redundant template-based cues. For tool calling, available functions are declared within tags in the system prompt, and model-generated calls are enclosed in <tool_call> tags, as illustrated in the system prompt example.

As shown in the figure below:

In the SWE RL stage, the authors employ a simplified Agentless framework that decomposes software repair into localization, repair, and patch validation. The repair phase generates targeted, diff-style patches by concatenating localized files and surrounding context into a unified prompt, preserving code structure to reduce hallucinations. Patch validation proceeds through regression, reproduction, and majority voting phases to ensure functional correctness and robustness. For training stability, a two-stage curriculum extends input context from 16K to 24K tokens, allowing the model to gradually develop multi-file reasoning capabilities without early degradation.

Experiment

  • Cascade RL framework validated across human-feedback alignment, instruction following, math reasoning, competitive programming, and software engineering, demonstrating minimal catastrophic forgetting and domain-specific performance gains.
  • Nemotron-Cascade-14B-Thinking achieved 78.0/74.8 on LiveCodeBench v5/v6, surpassing DeepSeek-R1-0528 (74.8/73.3) and Gemini-2.5-Pro-06-05 despite using a 64K-token inference budget.
  • Nemotron-Cascade-8B unified model matched DeepSeek-R1-0528 on LiveCodeBench v5/v6 (75.3/71.5) with only 8B parameters versus 671B, while achieving silver-medal performance on IOI 2025.
  • Nemotron-Cascade-14B reached 43.1% pass@1 on SWE-bench Verified, exceeding specialized models like DeepSWE-32B (42.2%) and general-purpose 14B models (Qwen3-14B: 27.4%).

The authors evaluate the impact of maximum prompt length on code repair performance using a 14B model across four conditions. Results show that increasing prompt length from 16K to 32K improves repair accuracy, particularly when ground-truth file localization is provided, but performance degrades at 40K, suggesting diminishing returns beyond 32K context.

The authors evaluate their Nemotron-Cascade models on a series of Codeforces contests, reporting scores, penalties, and estimated ELO rankings across multiple divisions. Results show consistent performance across contests, with estimated ELO scores ranging from approximately 1500 to 2600, indicating competitive standing among human participants. The model’s performance varies by contest difficulty and division, with higher scores and rankings typically observed in lower-divisions and more recent rounds.

The authors evaluate different reward functions for SWE RL, finding that semantic similarity-based rewards outperform lexical similarity in code repair tasks, especially when ground-truth file localization is provided. Reward shaping improves performance with lexical similarity but offers no additional benefit with semantic similarity, suggesting the latter provides more reliable training signals even at low similarity scores.

The authors apply SWE RL as the final stage in their Cascade RL pipeline and observe substantial gains on SWE-bench Verified, with the 14B-Thinking model achieving 43.1% pass@1, outperforming specialized 32B models. While SWE RL improves software engineering performance, it has minimal impact on other domains, with most changes attributable to evaluation variance. The unified 8B model closes the performance gap with its dedicated 8B-Thinking counterpart on SWE-bench Verified after full training, achieving 37.2% versus 38.5%.

The authors evaluate their Nemotron-Cascade models on a series of Codeforces contests, reporting scores, penalties, and estimated ranks across multiple divisions. Results show consistent performance with high scores in many rounds, particularly in Div. 2 contests, and demonstrate competitive standing against other participants based on estimated ELO ratings. The data reflects the model’s ability to solve algorithmic problems under contest conditions, with performance varying by round difficulty and division.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

最新情報を購読する
北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします
メール配信サービスは MailChimp によって提供されています