Command Palette
Search for a command to run...

Abstract
We propose FlowRL: matching the full reward distribution via flow balancinginstead of maximizing rewards in large language model (LLM) reinforcementlearning (RL). Recent advanced reasoning models adopt reward-maximizing methods(\eg, PPO and GRPO), which tend to over-optimize dominant reward signals whileneglecting less frequent but valid reasoning paths, thus reducing diversity. Incontrast, we transform scalar rewards into a normalized target distributionusing a learnable partition function, and then minimize the reverse KLdivergence between the policy and the target distribution. We implement thisidea as a flow-balanced optimization method that promotes diverse explorationand generalizable reasoning trajectories. We conduct experiments on math andcode reasoning tasks: FlowRL achieves a significant average improvement of10.0% over GRPO and 5.1% over PPO on math benchmarks, and performsconsistently better on code reasoning tasks. These results highlight rewarddistribution-matching as a key step toward efficient exploration and diversereasoning in LLM reinforcement learning.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.