Command Palette
Search for a command to run...
Poly-EPO: 探索的推論モデルのトレーニング
Poly-EPO: 探索的推論モデルのトレーニング
Ifdita Hasan Orney Jubayer Ibn Hamid Shreya S Ramanujam Shirley Wu Hengyuan Hu Noah Goodman Dorsa Sadigh Chelsea Finn
概要
経験からの学習において、探索は中核的な役割を果たす。それはエージェントが複雑な問題に対する解を見つけ、新規な問題にも泛化し、テスト時の計算量(test-time compute)に応じて性能をスケールさせることを可能にする。本論文では、楽観的な探索を明示的に促し、探索と活用(exploitation)の相乗効果を促進する、言語モデル(LMs)のトレーニング後のフレームワークを提案する。中心的なアイデアは、報酬関数(reward function)の下で集合的に正確であり、かつ推論戦略において探索的な反応セットを生成するようLMを訓練することである。まず、任意の目的関数(arbitrary objective functions)の下でセット強化学習(set RL)を用いてLMを最適化する一般的なレシピを開発し、アドバンテージ計算(advantage computation)の変更を通じて標準的なRLアルゴリズムをこの設定に適応させる方法を示す。次に、このフレームワークをインスタンス化し、探索と活用の相乗効果を明示的に目指す目的関数を用いた、Polychromatic Exploratory Policy Optimization (POLY-EPO) を提案する。複数の推論ベンチマークにおいて、POLY-EPOが高いpass@kカバレッジによって示されるような泛化性能の向上、モデル生成におけるより高い多様性の保持、およびテスト時の計算量に対する効果的なスケーリングを実現することを示す。
One-sentence Summary
This paper presents Polychromatic Exploratory Policy Optimization (POLY-EPO), a post-training framework for language models that explicitly encourages optimistic exploration and promotes synergy between exploration and exploitation via set reinforcement learning with modified advantage computation to generate diverse response sets, demonstrating improved generalization evidenced by higher pass@k coverage, greater generation diversity, and effective scaling with test-time compute across reasoning benchmarks.
Key Contributions
- The paper develops a general recipe for optimizing language models with set reinforcement learning under arbitrary objective functions. This approach adapts standard reinforcement learning algorithms to the setting through a modification to the advantage computation.
- Polychromatic Exploratory Policy Optimization (POLY-EPO) instantiates this framework with an objective that explicitly synergizes exploration and exploitation. The method encodes this synergy directly in the advantage function by depending on the covariance between average reward and diversity across generations in a set.
- Experiments across a range of reasoning benchmarks show that POLY-EPO improves generalization as evidenced by higher pass@k coverage. The method also preserves greater diversity in model generations and effectively scales with test-time compute.
Introduction
Exploration is essential for language models to generalize to novel problems and scale performance with test-time compute, but standard reinforcement learning fine-tuning often collapses generation diversity onto narrow high-reward behaviors. Previous methods typically address this by adding exploration bonuses to the reward function, yet they treat exploration and exploitation as separate objectives that rely on careful hyperparameter tuning rather than explicit synergy. The authors leverage a scalable set reinforcement learning framework to introduce Polychromatic Exploratory Policy Optimization, which optimizes sets of responses based on both average reward and reasoning strategy diversity. By incorporating the covariance between reward and diversity into the advantage function, their approach encourages optimistic exploration of novel strategies while maintaining high task performance and improving generalization across reasoning benchmarks.
Method
The authors leverage a framework called Set Reinforcement Learning (Set RL) to optimize language models. Standard reinforcement learning assigns rewards to individual trajectories, whereas Set RL assigns rewards to sets of sampled actions. In the language model setting, this means evaluating a set of n generations sampled from the same prompt. The optimization goal is to maximize the expected set-level reward function f(x,y1:n), where x is the prompt and y1:n represents the set of generations. This formulation couples all generations in a set under a shared learning signal, distinguishing it from standard approaches where each generation receives an independent advantage.
To make this framework computationally feasible, the authors derive an unbiased gradient estimator. For a given prompt, the policy samples N independent generations where N>n. From these samples, K sets of size n are constructed without replacement. The set-level score for each constructed set is computed, and a baseline is defined as the average of these scores across all sets. The set advantage is then calculated as the difference between the set score and the baseline. To apply this within standard policy gradient algorithms, a marginal set advantage is defined for each individual generation. This value is the average of the set advantages for all sets that contain the specific generation. This marginal advantage replaces the standard advantage function in algorithms such as PPO or GRPO, enabling the use of existing optimization infrastructure.
The specific objective implemented in this work is the polychromatic objective. This function is designed to encourage both exploration and exploitation. It is defined as the product of the mean reward of the generations in a set and a diversity metric for that set. The mathematical form is given by:
fpoly(x,y1:n)=n1i=1∑nr(x,yi)⋅d(x,y1:n)where r(x,yi) is the reward for an individual generation and d(x,y1:n) measures the diversity of the set. Because the objective is a product, a set must achieve both high reward and high diversity to maximize the score. This structure allows an incorrect generation to receive a positive learning signal if it contributes to the diversity of a set that contains correct generations.
Diversity is quantified using a language model judge that clusters generations based on reasoning strategies. The judge groups responses according to their underlying semantic approach, ignoring superficial differences like tone or phrasing. Responses that exhibit degenerate behaviors, such as reward hacking or unintelligible text, are isolated into a dedicated cluster and excluded from diversity calculations. The diversity of a set is calculated as the number of distinct clusters represented in the set divided by the set size. This ensures that the diversity metric reflects genuine strategic variation rather than noise.
The final training algorithm, POLY-EPO, integrates these components into a standard on-policy loop. The process involves sampling generations, constructing sets, computing polychromatic scores, and deriving marginal set advantages. These advantages are then used to update the policy parameters. This design allows the method to scale efficiently while maintaining the benefits of set-level credit assignment. The resulting policy update internalizes the balance between exploration and exploitation, as the advantage assigned to a trajectory depends on its contribution to the synergy between reward and diversity across sets.
Experiment
The study evaluates POLY-EPO on mathematical reasoning benchmarks and synthetic domains with infinitely many valid strategies, comparing it against GRPO and a diversity-augmented baseline. In mathematical reasoning tasks, POLY-EPO demonstrates superior exploration by maintaining diverse reasoning clusters throughout training and branching earlier during generation, which leads to improved generalization and effective use of test-time compute. Experiments in synthetic domains further reveal that while standard methods collapse to single strategies, POLY-EPO successfully uncovers a significantly broader repertoire of successful solutions, confirming the objective's ability to balance exploration and exploitation.