Command Palette
Search for a command to run...
Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Renjie Luo Zichen Liu Xiangyan Liu Chao Du Min Lin Wenhu Chen Wei Lu Tianyu Pang

Abstract
LLMs are often trained with RL from human or AI feedback, yet such methodstypically compress nuanced feedback into scalar rewards, discarding much oftheir richness and inducing scale imbalance. We propose treating verbalfeedback as a conditioning signal. Inspired by language priors in text-to-imagegeneration, which enable novel outputs from unseen prompts, we introduce thefeedback-conditional policy (FCP). FCP learns directly from response-feedbackpairs, approximating the feedback-conditional posterior through maximumlikelihood training on offline data. We further develop an online bootstrappingstage where the policy generates under positive conditions and receives freshfeedback to refine itself. This reframes feedback-driven learning asconditional generation rather than reward optimization, offering a moreexpressive way for LLMs to directly learn from verbal feedback. Our code isavailable at https://github.com/sail-sg/feedback-conditional-policy.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.