HyperAI

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, Gao Huang
Release Date: 5/11/2025
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise inenhancing the reasoning capabilities of large language models by learningdirectly from outcome-based rewards. Recent RLVR works that operate under thezero setting avoid supervision in labeling the reasoning process, but stilldepend on manually curated collections of questions and answers for training.The scarcity of high-quality, human-produced examples raises concerns about thelong-term scalability of relying on human supervision, a challenge alreadyevident in the domain of language model pretraining. Furthermore, in ahypothetical future where AI surpasses human intelligence, tasks provided byhumans may offer limited learning potential for a superintelligent system. Toaddress these concerns, we propose a new RLVR paradigm called Absolute Zero, inwhich a single model learns to propose tasks that maximize its own learningprogress and improves reasoning by solving them, without relying on anyexternal data. Under this paradigm, we introduce the Absolute Zero Reasoner(AZR), a system that self-evolves its training curriculum and reasoning abilityby using a code executor to both validate proposed code reasoning tasks andverify answers, serving as an unified source of verifiable reward to guideopen-ended yet grounded learning. Despite being trained entirely withoutexternal data, AZR achieves overall SOTA performance on coding and mathematicalreasoning tasks, outperforming existing zero-setting models that rely on tensof thousands of in-domain human-curated examples. Furthermore, we demonstratethat AZR can be effectively applied across different model scales and iscompatible with various model classes.