OThink-R1: AI Learns When to Think Deeply—and When Not To
Researchers from Zhejiang University, led by master’s student Shengjia Zhang, have introduced OThink-R1, a novel approach that enables large language models to autonomously decide whether deep thinking is necessary when solving problems. The team’s work addresses a critical inefficiency in current deep reasoning models: the tendency to perform extensive, computationally expensive reasoning even on simple tasks—such as answering “1+1=?”—where a quick, intuitive response would suffice. Drawing inspiration from human cognition, which distinguishes between fast, intuitive thinking (System 1) and slow, deliberate reasoning (System 2), the researchers sought to equip large models with the ability to switch between these modes dynamically. They observed that while deep reasoning models like DeepSeek-R1 and OpenAI o1 significantly outperform standard models on complex tasks by simulating human-like step-by-step analysis, they often overthink on straightforward questions. This not only wastes computational resources but also increases response latency. To tackle this issue, the team analyzed reasoning traces from both non-reasoning models (which mimic fast thinking) and deep reasoning models on simple tasks such as elementary math and common-sense questions. By comparing these traces, they identified patterns that distinguish necessary reasoning from redundant or superfluous steps. They then manually curated a dataset of hybrid reasoning chains—retaining only the essential parts of deep reasoning while removing unnecessary elaborations—creating a training set that teaches models when to skip deep thinking. The core innovation lies in training the model using supervised fine-tuning on this hybrid dataset. This allows the model to learn to recognize when a problem can be solved intuitively and when deeper analysis is required. As a result, OThink-R1 can now produce fast, direct answers for simple queries while reserving deep reasoning for more complex problems—leading to more efficient use of computational resources. The research was conducted as part of a joint project between OPPO and Zhejiang University, inspired by the recent advances in deep reasoning models like DeepSeek-R1. Early experiments revealed a major challenge: the lack of consistent output formatting in models like DeepSeek-R1 made it difficult to validate results reliably. Initial attempts using reinforcement learning (GRPO) and Direct Preference Optimization (DPO) failed due to poor instruction-following behavior and unstable training dynamics, especially when trying to reduce reasoning length. Eventually, the team settled on a supervised fine-tuning approach that proved more stable and effective. However, early versions still struggled with generalization across different tasks. The breakthrough came when they re-evaluated the assumption that all deep reasoning on simple problems was redundant. By leveraging large language models themselves (LLM-Judge) to classify reasoning steps as either effective or redundant, they refined their dataset and achieved significantly better performance. While OThink-R1 currently relies on an external model (LLM-Judge) to assess reasoning redundancy, the team aims to develop a fully end-to-end solution in future work. This would allow the model to autonomously determine its own thinking mode without external supervision. The study, titled OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for Over-Reasoning Mitigation, is available on arXiv (https://arxiv.org/abs/2506.02397). The authors include Shengjia Zhang, Junjie Wu, Jiawei Chen, Changwang Zhang, Xingyu Lou, Wangchunshu Zhou, Sheng Zhou, Can Wang, and Jun Wang.
