Command Palette
Search for a command to run...
Xiangxin Zhou Zichen Liu Haonan Wang Chao Du Min Lin Chongxuan Li Liang Wang Tianyu Pang

Abstract
We introduce a variational reasoning framework for language models thattreats thinking traces as latent variables and optimizes them throughvariational inference. Starting from the evidence lower bound (ELBO), we extendit to a multi-trace objective for tighter bounds and propose a forward-KLformulation that stabilizes the training of the variational posterior. Wefurther show that rejection sampling finetuning and binary-reward RL, includingGRPO, can be interpreted as local forward-KL objectives, where an implicitweighting by model accuracy naturally arises from the derivation and reveals apreviously unnoticed bias toward easier questions. We empirically validate ourmethod on the Qwen 2.5 and Qwen 3 model families across a wide range ofreasoning tasks. Overall, our work provides a principled probabilisticperspective that unifies variational inference with RL-style methods and yieldsstable objectives for improving the reasoning ability of language models. Ourcode is available at https://github.com/sail-sg/variational-reasoning.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.