Inference-time Scaling
Inference-time scaling is a method to improve the performance of large language models (LLMs) by increasing the computing resources in the inference phase. OpenAI's o1 series of models first introduced the concept of inference-time scaling, and achieved significant performance improvements in tasks such as mathematics, programming, and scientific reasoning by increasing the length of the Chain-of-Thought reasoning process.
Inference-time expansion aims to improve model performance by allocating additional computing resources (such as more computing steps, more complex reasoning strategies, etc.) during the reasoning process to evaluate multiple results and select the best solution. It breaks through the traditional limitation of improving model capabilities by simply increasing training resources, allowing the model to think strategically and solve problems systematically when facing complex tasks.