a month ago

FlowRL: Matching Reward Distributions for LLM Reasoning

Xuekai Zhu Daixuan Cheng Dinghuai Zhang Hengli Li Kaiyan Zhang Che Jiang Youbang Sun Ermo Hua Yuxin Zuo Xingtai Lv

Abstract

We propose FlowRL: matching the full reward distribution via flow balancinginstead of maximizing rewards in large language model (LLM) reinforcementlearning (RL). Recent advanced reasoning models adopt reward-maximizing methods(\eg, PPO and GRPO), which tend to over-optimize dominant reward signals whileneglecting less frequent but valid reasoning paths, thus reducing diversity. Incontrast, we transform scalar rewards into a normalized target distributionusing a learnable partition function, and then minimize the reverse KLdivergence between the policy and the target distribution. We implement thisidea as a flow-balanced optimization method that promotes diverse explorationand generalizable reasoning trajectories. We conduct experiments on math andcode reasoning tasks: FlowRL achieves a significant average improvement of10.0% over GRPO and 5.1% over PPO on math benchmarks, and performsconsistently better on code reasoning tasks. These results highlight rewarddistribution-matching as a key step toward efficient exploration and diversereasoning in LLM reinforcement learning.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

FlowRL: Matching Reward Distributions for LLM Reasoning

Xuekai Zhu Daixuan Cheng Dinghuai Zhang Hengli Li Kaiyan Zhang Che Jiang Youbang Sun Ermo Hua Yuxin Zuo Xingtai Lv13 more

Abstract

Build AI with AI

Hyper Newsletters

Xuekai Zhu Daixuan Cheng Dinghuai Zhang Hengli Li Kaiyan Zhang Che Jiang Youbang Sun Ermo Hua Yuxin Zuo Xingtai Lv