Command Palette
Search for a command to run...
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
Jinghao Zhang Naishan Zheng Ruilin Li Dongzhou Cheng Zheming Liang Feng Zhao Jiaqi Wang

Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged asa promising framework for improving reasoning abilities in Large LanguageModels (LLMs). However, policy optimized with binary verification prone tooverlook potential valuable exploration in reasoning trajectory. In view ofheavy annotation cost of golden Process Reward Models (PRMs), recent worksattempt using auxiliary signals for reward shaping of process tokens, involvingentropy and likelihood collected from logit space. In this work, we offer anovel perspective on shaping RLVR with flow rewards derived from latent space,and propose RLFR, where the flow fields of model latents are constructed fromeither off-policy high-quality data and on-policy rejection sampling data, andthe velocity deviations of policy latents within it are quantified to serve asa reward signal. RLFR first demonstrates that a well-established flow field canbe a sound environment for reward signal collection, highlighting theexpressive latent space is much underexplored. Moreover, RLFR is able tocompress any off-policy expert data as reference for constituting rewardsignals, and we show that the efficient context dependence compressed withinthe hidden states are utilized, rather than individual token-level denotationfor context comprehending. Experiments on both language and multimodalreasoning benchmarks demonstrate the reliability of flow rewards, andsuggesting a promising paradigm for reward shaping with auxiliary signals.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.