Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Models like OpenAI-o3 pioneer visual grounded reasoning by dynamicallyreferencing visual regions, just like human "thinking with images". However, nobenchmark exists to evaluate these capabilities holistically. To bridge thisgap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), adiagnostic benchmark built on three principles: (1) focused visual perceptionof subtle targets in complex scenes, (2) traceable evidence via bounding boxevaluation, and (3) second-order reasoning to test object interactions andspatial hierarchies beyond simple object localization. Prioritizing images withdense objects, we initially sample 1K high-quality images from SA-1B, andincorporate eight LMM experts to manually annotate questions, candidateoptions, and answers for each image. After three stages of quality control,TreeBench consists of 405 challenging visual question-answering pairs, even themost advanced models struggle with this benchmark, where none of them reach 60%accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR(Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm tosupervise localization and reasoning jointly with reinforcement learning,enabling accurate localizations and explainable reasoning pathways. Initializedfrom Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), andTreeBench (+13.4), proving traceability is key to advancing vision-groundedreasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.