DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Although Vision Language Models (VLMs) exhibit strong perceptual abilitiesand impressive visual reasoning, they struggle with attention to detail andprecise action planning in complex, dynamic environments, leading to subparperformance. Real-world tasks typically require complex interactions, advancedspatial reasoning, long-term planning, and continuous strategy refinement,usually necessitating understanding the physics rules of the target scenario.However, evaluating these capabilities in real-world scenarios is oftenprohibitively expensive. To bridge this gap, we introduce DeepPHY, a novelbenchmark framework designed to systematically evaluate VLMs' understanding andreasoning about fundamental physical principles through a series of challengingsimulated environments. DeepPHY integrates multiple physical reasoningenvironments of varying difficulty levels and incorporates fine-grainedevaluation metrics. Our evaluation finds that even state-of-the-art VLMsstruggle to translate descriptive physical knowledge into precise, predictivecontrol.