a month ago

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Renjie Luo Zichen Liu Xiangyan Liu Chao Du Min Lin Wenhu Chen Wei Lu Tianyu Pang

Abstract

LLMs are often trained with RL from human or AI feedback, yet such methodstypically compress nuanced feedback into scalar rewards, discarding much oftheir richness and inducing scale imbalance. We propose treating verbalfeedback as a conditioning signal. Inspired by language priors in text-to-imagegeneration, which enable novel outputs from unseen prompts, we introduce thefeedback-conditional policy (FCP). FCP learns directly from response-feedbackpairs, approximating the feedback-conditional posterior through maximumlikelihood training on offline data. We further develop an online bootstrappingstage where the policy generates under positive conditions and receives freshfeedback to refine itself. This reframes feedback-driven learning asconditional generation rather than reward optimization, offering a moreexpressive way for LLMs to directly learn from verbal feedback. Our code isavailable at https://github.com/sail-sg/feedback-conditional-policy.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Renjie Luo Zichen Liu Xiangyan Liu Chao Du Min Lin Wenhu Chen Wei Lu Tianyu Pang

Abstract

Build AI with AI

Hyper Newsletters