HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou Zhenwen Liang Haolin Liu Wenhao Yu Kishan Panaganti Linfeng Song Dian Yu Xiangliang Zhang Haitao Mi Dong Yu

Evolving Language Models without Labels: Majority Drives Selection,
  Novelty Promotes Variation

Abstract

Large language models (LLMs) are increasingly trained with reinforcementlearning from verifiable rewards (RLVR), yet real-world deployment demandsmodels that can self-improve without labels or external judges. Existinglabel-free methods, confidence minimization, self-consistency, or majority-voteobjectives, stabilize learning but steadily shrink exploration, causing anentropy collapse: generations become shorter, less diverse, and brittle. Unlikeprior approaches such as Test-Time Reinforcement Learning (TTRL), whichprimarily adapt models to the immediate unlabeled dataset at hand, our goal isbroader: to enable general improvements without sacrificing the model'sinherent exploration capacity and generalization ability, i.e., evolving. Weformalize this issue and propose EVolution-Oriented and Label-freeReinforcement Learning (EVOL-RL), a simple rule that couples stability withvariation under a label-free setting. EVOL-RL keeps the majority-voted answeras a stable anchor (selection) while adding a novelty-aware reward that favorsresponses whose reasoning differs from what has already been produced(variation), measured in semantic space. Implemented with GRPO, EVOL-RL alsouses asymmetric clipping to preserve strong signals and an entropy regularizerto sustain search. This majority-for-selection + novelty-for-variation designprevents collapse, maintains longer and more informative chains of thought, andimproves both pass@1 and pass@n. EVOL-RL consistently outperforms themajority-only TTRL baseline; e.g., training on label-free AIME24 liftsQwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5%to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocksstronger generalization across domains (e.g., GPQA). Furthermore, wedemonstrate that EVOL-RL also boosts performance in the RLVR setting,highlighting its broad applicability.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp