HyperAIHyperAI
12 days ago

Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Beyond the Trade-off: Self-Supervised Reinforcement Learning for
  Reasoning Models' Instruction Following
Abstract

Reasoning models excel in complex problem solving but exhibit a concerningtrade off between reasoning capabilities and instruction following abilities.Existing approaches for improving instruction following rely on strongerexternal models, creating methodological bottlenecks and practical limitationsincluding increased costs and accessibility constraints. We propose aself-supervised RL framework that leverages reasoning models' own internalsignals to improve instruction following capabilities without externalsupervision. Extensive experiments demonstrate that our framework significantlyimproves instruction following capabilities while maintaining reasoningperformance, offering a scalable and cost-effective approach to enhanceinstruction following in reasoning models. The data and code are publiclyavailable at https://github.com/Rainier-rq/verl-if.