2 days ago

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques

View Paper Details

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via
Multi-Agent Multi-Turn Reinforcement Learning

Abstract

Recent advances in reinforcement learning have shown that language models candevelop sophisticated reasoning through training on tasks with verifiablerewards, but these approaches depend on human-curated problem-answer pairs anddomain-specific reward engineering. We introduce SPIRAL, a self-play frameworkwhere models learn by playing multi-turn, zero-sum games against continuouslyimproving versions of themselves, eliminating the need for human supervision.Through self-play, SPIRAL generates an infinite curriculum of progressivelychallenging problems as models must constantly adapt to stronger opponents. Toenable this self-play training at scale, We implement a fully online,multi-turn, multi-agent reinforcement learning system for LLMs and proposerole-conditioned advantage estimation (RAE) to stabilize multi-agent training.Using SPIRAL, self-play on zero-sum games produces reasoning capabilities thattransfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6%improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000expert game trajectories. Analysis reveals that this transfer occurs throughthree cognitive patterns: systematic decomposition, expected value calculation,and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, SimpleNegotiation) further enhances performance as each game develops distinctreasoning strengths. Applying SPIRAL to a strong reasoning model(DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. Theseresults demonstrate that zero-sum games naturally develop transferablereasoning capabilities, highlighting a promising direction for autonomousreasoning development.