Command Palette
Search for a command to run...
Poly-EPO: Training explorative Modellierungsansätze
Poly-EPO: Training explorative Modellierungsansätze
Ifdita Hasan Orney Jubayer Ibn Hamid Shreya S Ramanujam Shirley Wu Hengyuan Hu Noah Goodman Dorsa Sadigh Chelsea Finn
Zusammenfassung
Exploration ist eine Grundlagendisziplin des lernenden Systems: Sie ermöglicht es Agenten, Lösungen für komplexe Probleme zu finden, auf neuartige Situationen zu verallgemeinern und die Leistung mit Testzeit-Compute zu skalieren. In diesem Papier stellen wir ein Framework für post-training language models (LMs) vor, das explizit optimistische Exploration fördert und eine Synergie zwischen Exploration und Exploitation unterstützt. Die zentrale Idee besteht darin, die LM so zu trainieren, dass sie Mengen von Antworten generiert, die unter der Reward-Funktion kollektiv korrekt sind und explorativ in ihren Reasoning-Strategien sind. Zuerst entwickeln wir ein allgemeines Rezept zur Optimierung von LMs mit set reinforcement learning (set RL) unter beliebigen Objektiven und zeigen, wie Standard-RL-Algorithmen durch eine Anpassung der Advantage-Berechnung an diesen Ansatz angepasst werden können. Dann schlagen wir Polychromatic Exploratory Policy Optimization (POLY-EPO) vor, das dieses Framework mit einem Ziel realisiert, das Exploration und Exploitation explizit synergisiert. Über eine Reihe von Reasoning-Benchmarks hinweg zeigen wir, dass POLY-EPO die Generalisierung verbessert, wie durch höhere pass@k-Abdeckung belegt wird, die Diversität in den Modellausgaben erhalten bleibt und effizient mit Testzeit-Compute skaliert.
One-sentence Summary
This paper presents Polychromatic Exploratory Policy Optimization (POLY-EPO), a post-training framework for language models that explicitly encourages optimistic exploration and promotes synergy between exploration and exploitation via set reinforcement learning with modified advantage computation to generate diverse response sets, demonstrating improved generalization evidenced by higher pass@k coverage, greater generation diversity, and effective scaling with test-time compute across reasoning benchmarks.
Key Contributions
- The paper develops a general recipe for optimizing language models with set reinforcement learning under arbitrary objective functions. This approach adapts standard reinforcement learning algorithms to the setting through a modification to the advantage computation.
- Polychromatic Exploratory Policy Optimization (POLY-EPO) instantiates this framework with an objective that explicitly synergizes exploration and exploitation. The method encodes this synergy directly in the advantage function by depending on the covariance between average reward and diversity across generations in a set.
- Experiments across a range of reasoning benchmarks show that POLY-EPO improves generalization as evidenced by higher pass@k coverage. The method also preserves greater diversity in model generations and effectively scales with test-time compute.
Introduction
Exploration is essential for language models to generalize to novel problems and scale performance with test-time compute, but standard reinforcement learning fine-tuning often collapses generation diversity onto narrow high-reward behaviors. Previous methods typically address this by adding exploration bonuses to the reward function, yet they treat exploration and exploitation as separate objectives that rely on careful hyperparameter tuning rather than explicit synergy. The authors leverage a scalable set reinforcement learning framework to introduce Polychromatic Exploratory Policy Optimization, which optimizes sets of responses based on both average reward and reasoning strategy diversity. By incorporating the covariance between reward and diversity into the advantage function, their approach encourages optimistic exploration of novel strategies while maintaining high task performance and improving generalization across reasoning benchmarks.
Method
The authors leverage a framework called Set Reinforcement Learning (Set RL) to optimize language models. Standard reinforcement learning assigns rewards to individual trajectories, whereas Set RL assigns rewards to sets of sampled actions. In the language model setting, this means evaluating a set of n generations sampled from the same prompt. The optimization goal is to maximize the expected set-level reward function f(x,y1:n), where x is the prompt and y1:n represents the set of generations. This formulation couples all generations in a set under a shared learning signal, distinguishing it from standard approaches where each generation receives an independent advantage.
To make this framework computationally feasible, the authors derive an unbiased gradient estimator. For a given prompt, the policy samples N independent generations where N>n. From these samples, K sets of size n are constructed without replacement. The set-level score for each constructed set is computed, and a baseline is defined as the average of these scores across all sets. The set advantage is then calculated as the difference between the set score and the baseline. To apply this within standard policy gradient algorithms, a marginal set advantage is defined for each individual generation. This value is the average of the set advantages for all sets that contain the specific generation. This marginal advantage replaces the standard advantage function in algorithms such as PPO or GRPO, enabling the use of existing optimization infrastructure.
The specific objective implemented in this work is the polychromatic objective. This function is designed to encourage both exploration and exploitation. It is defined as the product of the mean reward of the generations in a set and a diversity metric for that set. The mathematical form is given by:
fpoly(x,y1:n)=n1i=1∑nr(x,yi)⋅d(x,y1:n)where r(x,yi) is the reward for an individual generation and d(x,y1:n) measures the diversity of the set. Because the objective is a product, a set must achieve both high reward and high diversity to maximize the score. This structure allows an incorrect generation to receive a positive learning signal if it contributes to the diversity of a set that contains correct generations.
Diversity is quantified using a language model judge that clusters generations based on reasoning strategies. The judge groups responses according to their underlying semantic approach, ignoring superficial differences like tone or phrasing. Responses that exhibit degenerate behaviors, such as reward hacking or unintelligible text, are isolated into a dedicated cluster and excluded from diversity calculations. The diversity of a set is calculated as the number of distinct clusters represented in the set divided by the set size. This ensures that the diversity metric reflects genuine strategic variation rather than noise.
The final training algorithm, POLY-EPO, integrates these components into a standard on-policy loop. The process involves sampling generations, constructing sets, computing polychromatic scores, and deriving marginal set advantages. These advantages are then used to update the policy parameters. This design allows the method to scale efficiently while maintaining the benefits of set-level credit assignment. The resulting policy update internalizes the balance between exploration and exploitation, as the advantage assigned to a trajectory depends on its contribution to the synergy between reward and diversity across sets.
Experiment
The study evaluates POLY-EPO on mathematical reasoning benchmarks and synthetic domains with infinitely many valid strategies, comparing it against GRPO and a diversity-augmented baseline. In mathematical reasoning tasks, POLY-EPO demonstrates superior exploration by maintaining diverse reasoning clusters throughout training and branching earlier during generation, which leads to improved generalization and effective use of test-time compute. Experiments in synthetic domains further reveal that while standard methods collapse to single strategies, POLY-EPO successfully uncovers a significantly broader repertoire of successful solutions, confirming the objective's ability to balance exploration and exploitation.