HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Zengzhi Wang Fan Zhou Xuefeng Li Pengfei Liu

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Abstract

Different base language model families, such as Llama and Qwen, exhibitdivergent behaviors during post-training with reinforcement learning (RL),especially on reasoning-intensive tasks. What makes a base language modelsuitable for reinforcement learning? Gaining deeper insight into this questionis essential for developing RL-scalable foundation models of the nextgeneration. In this work, we investigate how mid-training strategies shape RLdynamics, focusing on two representative model families: Qwen and Llama. Ourstudy reveals that (1) high-quality mathematical corpora, such asMegaMath-Web-Pro, significantly improve both base model and RL performance,while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) furtheradding QA-style data, particularly long chain-of-thought (CoT) reasoningexamples, enhances RL outcomes, and instruction data further unlocks thiseffect; (3) while long-CoT improves reasoning depth, it can also induceverbosity of model responses and unstability of RL training, underscoring theimportance of data formatting; (4) scaling mid-training consistently leads tostronger downstream RL performance. Building on these insights, we introduce atwo-stage mid-training strategy, Stable-then-Decay, in which base models arefirst trained on 200B tokens with a constant learning rate, followed by 20Btokens across three CoT-focused branches with learning rate decay. This yieldsOctoThinker, a family of models demonstrating strong RL compatibility andclosing the performance gap with more RL-friendly model families, i.e., Qwen.We hope our work will help shape pre-training strategies for foundation modelsin the RL era. To support further research, we release our open-source modelsalong with a curated math reasoning-intensive corpus of over 70 billion tokens(i.e., MegaMath-Web-Pro-Max).

Code Repositories

gair-nlp/octothinker
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling | Papers | HyperAI