a month ago

Multiplayer Nash Preference Optimization

Fang Wu Xu Huang Weihao Xuan Zhiwei Zhang Yijia Xiao Guancheng Wan Xiaomin Li Bing Hu Peng Xia Jure Leskovec

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the standardparadigm for aligning large language models (LLMs) with human preferences.However, reward-based methods built on the Bradley-Terry assumption struggle tocapture the non-transitive and heterogeneous nature of real-world preferences.To address this, recent studies have reframed alignment as a two-player Nashgame, giving rise to Nash learning from human feedback (NLHF). While thisperspective has inspired algorithms such as INPO, ONPO, and EGPO with strongtheoretical and empirical guarantees, they remain fundamentally restricted totwo-player interactions, creating a single-opponent bias that fails to capturethe full complexity of realistic preference structures. In this work, weintroduce Multiplayer Nash Preference Optimization (MNPO), a novel frameworkthat generalizes NLHF to the multiplayer regime. It formulates alignment as ann-player game, where each policy competes against a population of opponentswhile being regularized toward a reference model. Our framework establisheswell-defined Nash equilibria in multiplayer settings and extends the concept ofduality gap to quantify approximation quality. We demonstrate that MNPOinherits the equilibrium guarantees of two-player methods while enabling richercompetitive dynamics and improved coverage of diverse preference structures.Through comprehensive empirical evaluation, we show that MNPO consistentlyoutperforms existing NLHF baselines on instruction-following benchmarks,achieving superior alignment quality under heterogeneous annotator conditionsand mixed-policy evaluation scenarios. Together, these results establish MNPOas a principled and scalable framework for aligning LLMs with complex,non-transitive human preferences. Code is available athttps://github.com/smiles724/MNPO.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Multiplayer Nash Preference Optimization

Fang Wu Xu Huang Weihao Xuan Zhiwei Zhang Yijia Xiao Guancheng Wan Xiaomin Li Bing Hu Peng Xia Jure Leskovec1 more

Abstract

Build AI with AI

Hyper Newsletters

Fang Wu Xu Huang Weihao Xuan Zhiwei Zhang Yijia Xiao Guancheng Wan Xiaomin Li Bing Hu Peng Xia Jure Leskovec