Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

Reasoning models excel in complex problem solving but exhibit a concerningtrade off between reasoning capabilities and instruction following abilities.Existing approaches for improving instruction following rely on strongerexternal models, creating methodological bottlenecks and practical limitationsincluding increased costs and accessibility constraints. We propose aself-supervised RL framework that leverages reasoning models' own internalsignals to improve instruction following capabilities without externalsupervision. Extensive experiments demonstrate that our framework significantlyimproves instruction following capabilities while maintaining reasoningperformance, offering a scalable and cost-effective approach to enhanceinstruction following in reasoning models. The data and code are publiclyavailable at https://github.com/Rainier-rq/verl-if.