Command Palette
Search for a command to run...
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Xin Lai Junyi Li Wei Li Tao Liu Tianjian Li Hengshuang Zhao

Abstract
Recent advances in large multimodal models have leveraged image-based toolswith reinforcement learning to tackle visual problems. However, existingopen-source approaches often exhibit monotonous reasoning patterns and allowonly a limited number of interaction turns, making them inadequate fordifficult tasks that require trial-and-error exploration. In this work, weaddress this limitation by scaling up tool-based interactions and introduceMini-o3, a system that executes deep, multi-turn reasoning -- spanning tens ofsteps -- and achieves state-of-the-art performance on challenging visual searchtasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three keycomponents. First, we construct the Visual Probe Dataset, a collection ofthousands of challenging visual search problems designed for exploratoryreasoning. Second, we develop an iterative data collection pipeline to obtaincold-start trajectories that exhibit diverse reasoning patterns, includingdepth-first search, trial-and-error, and goal maintenance. Third, we propose anover-turn masking strategy that prevents penalization of over-turn responses(those that hit the maximum number of turns) during reinforcement learning,thereby balancing training-time efficiency with test-time scalability. Despitetraining with an upper bound of only six interaction turns, our model generatestrajectories that naturally scale to tens of turns at inference time, withaccuracy improving as the number of turns increases. Extensive experimentsdemonstrate that Mini-o3 produces rich reasoning patterns and deep thinkingpaths, effectively solving challenging visual search problems.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.