HyperAI
2 days ago

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston Tan, Weiyan Shi, Min Lin, Wee Sun Lee, Natasha Jaques
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via
  Multi-Agent Multi-Turn Reinforcement Learning
Abstract

Recent advances in reinforcement learning have shown that language models candevelop sophisticated reasoning through training on tasks with verifiablerewards, but these approaches depend on human-curated problem-answer pairs anddomain-specific reward engineering. We introduce SPIRAL, a self-play frameworkwhere models learn by playing multi-turn, zero-sum games against continuouslyimproving versions of themselves, eliminating the need for human supervision.Through self-play, SPIRAL generates an infinite curriculum of progressivelychallenging problems as models must constantly adapt to stronger opponents. Toenable this self-play training at scale, We implement a fully online,multi-turn, multi-agent reinforcement learning system for LLMs and proposerole-conditioned advantage estimation (RAE) to stabilize multi-agent training.Using SPIRAL, self-play on zero-sum games produces reasoning capabilities thattransfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6%improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000expert game trajectories. Analysis reveals that this transfer occurs throughthree cognitive patterns: systematic decomposition, expected value calculation,and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, SimpleNegotiation) further enhances performance as each game develops distinctreasoning strengths. Applying SPIRAL to a strong reasoning model(DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. Theseresults demonstrate that zero-sum games naturally develop transferablereasoning capabilities, highlighting a promising direction for autonomousreasoning development.