Command Palette
Search for a command to run...

Abstract
Reinforcement learning from human feedback (RLHF) has emerged as the standardparadigm for aligning large language models (LLMs) with human preferences.However, reward-based methods built on the Bradley-Terry assumption struggle tocapture the non-transitive and heterogeneous nature of real-world preferences.To address this, recent studies have reframed alignment as a two-player Nashgame, giving rise to Nash learning from human feedback (NLHF). While thisperspective has inspired algorithms such as INPO, ONPO, and EGPO with strongtheoretical and empirical guarantees, they remain fundamentally restricted totwo-player interactions, creating a single-opponent bias that fails to capturethe full complexity of realistic preference structures. In this work, weintroduce Multiplayer Nash Preference Optimization (MNPO), a novel frameworkthat generalizes NLHF to the multiplayer regime. It formulates alignment as ann-player game, where each policy competes against a population of opponentswhile being regularized toward a reference model. Our framework establisheswell-defined Nash equilibria in multiplayer settings and extends the concept ofduality gap to quantify approximation quality. We demonstrate that MNPOinherits the equilibrium guarantees of two-player methods while enabling richercompetitive dynamics and improved coverage of diverse preference structures.Through comprehensive empirical evaluation, we show that MNPO consistentlyoutperforms existing NLHF baselines on instruction-following benchmarks,achieving superior alignment quality under heterogeneous annotator conditionsand mixed-policy evaluation scenarios. Together, these results establish MNPOas a principled and scalable framework for aligning LLMs with complex,non-transitive human preferences. Code is available athttps://github.com/smiles724/MNPO.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.