HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Reinforcement Learning on Pre-Training Data

Reinforcement Learning on Pre-Training Data

Abstract

The growing disparity between the exponential scaling of computationalresources and the finite growth of high-quality text data now constrainsconventional scaling approaches for large language models (LLMs). To addressthis challenge, we introduce Reinforcement Learning on Pre-Training data(RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrastto prior approaches that scale training primarily through supervised learning,RLPT enables the policy to autonomously explore meaningful trajectories tolearn from pre-training data and improve its capability through reinforcementlearning (RL). While existing RL strategies such as reinforcement learning fromhuman feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR)rely on human annotation for reward construction, RLPT eliminates thisdependency by deriving reward signals directly from pre-training data.Specifically, it adopts a next-segment reasoning objective, rewarding thepolicy for accurately predicting subsequent text segments conditioned on thepreceding context. This formulation allows RL to be scaled on pre-trainingdata, encouraging the exploration of richer trajectories across broadercontexts and thereby fostering more generalizable reasoning skills. Extensiveexperiments on both general-domain and mathematical reasoning benchmarks acrossmultiple models validate the effectiveness of RLPT. For example, when appliedto Qwen3-4B-Base, RLPT yields absolute improvements of 3.0, 5.1, 8.1,6.0, 6.6, and 5.3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, andAIME25, respectively. The results further demonstrate favorable scalingbehavior, suggesting strong potential for continued gains with more compute. Inaddition, RLPT provides a solid foundation, extending the reasoning boundariesof LLMs and enhancing RLVR performance.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Reinforcement Learning on Pre-Training Data | Papers | HyperAI