Scaling Test-time Compute for LLM Agents

Scaling test time compute has shown remarkable success in improving thereasoning abilities of large language models (LLMs). In this work, we conductthe first systematic exploration of applying test-time scaling methods tolanguage agents and investigate the extent to which it improves theireffectiveness. Specifically, we explore different test-time scaling strategies,including: (1) parallel sampling algorithms; (2) sequential revisionstrategies; (3) verifiers and merging methods; (4)strategies for diversifyingrollouts.We carefully analyze and ablate the impact of different designstrategies on applying test-time scaling on language agents, and have followfindings: 1. Scaling test time compute could improve the performance of agents.2. Knowing when to reflect is important for agents. 3. Among differentverification and result merging approaches, the list-wise method performs best.4. Increasing diversified rollouts exerts a positive effect on the agent's taskperformance.