HyperAIHyperAI
a month ago

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
Sharing is Caring: Efficient LM Post-Training with Collective RL
  Experience Sharing
Abstract

Post-training language models (LMs) with reinforcement learning (RL) canenhance their complex reasoning capabilities without supervised fine-tuning, asdemonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMsrequires significant parallelization to scale-up inference, which introducesnon-trivial technical challenges (e.g. latency, memory, and reliability)alongside ever-growing financial costs. We present Swarm sAmpling PolicyOptimization (SAPO), a fully decentralized and asynchronous RL post-trainingalgorithm. SAPO is designed for decentralized networks of heterogenous computenodes, where each node manages its own policy model(s) while "sharing" rolloutswith others in the network; no explicit assumptions about latency, modelhomogeneity, or hardware are required and nodes can operate in silo if desired.As a result, the algorithm avoids common bottlenecks in scaling RLpost-training while also allowing (and even encouraging) new possibilities. Bysampling rollouts "shared" across the network, it enables "Aha moments" topropagate, thereby bootstrapping the learning process. In this paper we showSAPO achieved cumulative reward gains of up to 94% in controlled experiments.We also share insights from tests on a network with thousands of nodescontributed by Gensyn community members running the algorithm on diversehardware and models during an open-source demo.