HyperAIHyperAI

Command Palette

Search for a command to run...

L2T: An Information-Theoretic Reinforcement Learning Framework Enhances LLM Reasoning Efficiency

Recently, a research team from the Institute of Software, Chinese Academy of Sciences, addressed the challenge of optimizing large language models (LLMs) for complex reasoning tasks by proposing a novel reinforcement fine-tuning framework based on information theory, named Learning to Think (L2T). As LLMs continue to advance, their applications have expanded beyond basic natural language processing to include intricate, multi-step logical reasoning problems. However, existing approaches often rely solely on the final outcome of reasoning as the reward signal, neglecting timely feedback on intermediate reasoning steps. This limitation leads to redundant computations, inefficient resource utilization, and sometimes degraded reasoning performance. To overcome these issues, the L2T framework restructures the reasoning process as a multi-turn hierarchical dialogue. It introduces a dense process reward mechanism grounded in information theory, which quantifies the information gain at each reasoning step. By leveraging an enhanced GRPO (Generalized Reward Policy Optimization) algorithm, the framework actively encourages meaningful, logically sound reasoning steps while suppressing unnecessary or redundant content. This enables fine-grained control over the reasoning path, significantly improving both reasoning quality and computational efficiency. Extensive evaluations on established reasoning benchmarks—including AIME, AMC, and HumanEval—demonstrated consistent performance gains across various model sizes, such as DeepScaleR-1.5B-Preview and DeepSeek-R1-Distill-Qwen-1.5B. Compared to result-based reward methods, L2T achieved over a 3.2% improvement in accuracy while doubling token efficiency. When benchmarked against other process-based reward approaches, L2T still delivered approximately a 2% accuracy gain and a 1.2-fold improvement in efficiency. Moreover, in multi-task evaluations across varying difficulty levels, L2T achieved an average accuracy improvement of nearly 3% and maintained stable performance advantages under different token budget constraints. The findings highlight L2T as a promising new direction for enhancing the reasoning capabilities of LLMs in practical applications. The research has been accepted for presentation at NeurIPS 2025, one of the top-tier conferences in artificial intelligence. The full paper is available via the provided link.

Related Links