Pre-Trained Policy Discriminators are General Reward Models

We offer a novel perspective on reward modeling by formulating it as a policydiscriminator, which quantifies the difference between two policies to generatea reward signal, guiding the training policy towards a target policy withdesired behaviors. Based on this conceptual insight, we propose a scalablepre-training method named Policy Discriminative Learning (POLAR), which trainsa reward model (RM) to discern identical policies and discriminate differentones. Unlike traditional reward modeling methods relying on absolutepreferences, POLAR captures the relative difference between one policy and anarbitrary target policy, which is a scalable, high-level optimization objectivesuitable for modeling generic ranking relationships. Leveraging the POLARpre-training paradigm, we present a series of RMs with parameter scales from1.8B to 7B. Empirical results show that POLAR substantially outperformstraditional non-pre-trained methods, significantly enhancing RM performance.For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% onSTEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTAbaselines. POLAR also shows robust generalization capabilities in RLHF usingReinforcement Fine-tuning (RFT), providing reliable reward signals and markedlyenhancing policy performance--improving LLaMa3.1-8B from an average of 47.36%to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover,scaling experiments reveal a clear power-law relationship between computationand performance, supported by linear correlation coefficients approaching 0.99.The impressive performance, strong generalization, and scaling propertiessuggest that POLAR is a promising direction for developing general and strongreward models.