PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Shen, Hui ; Wu, Taiqiang ; Han, Qi ; Hsieh, Yunta ; Wang, Jizhou ; Zhang, Yuyue ; Cheng, Yuxin ; Hao, Zijian ; Ni, Yuansheng ; Wang, Xin ; Wan, Zhongwei ; Zhang, Kai ; Xu, Wendong ; Xiong, Jing ; Luo, Ping ; Chen, Wenhu ; Tao, Chaofan ; Mao, Zhuoqing ; Wong, Ngai

Veröffentlichungsdatum: 5/26/2025

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Abstract

Existing benchmarks fail to capture a crucial aspect of intelligence:physical reasoning, the integrated ability to combine domain knowledge,symbolic reasoning, and understanding of real-world constraints. To addressthis gap, we introduce PhyX: the first large-scale benchmark designed to assessmodels capacity for physics-grounded reasoning in visual scenarios. PhyXincludes 3K meticulously curated multimodal questions spanning 6 reasoningtypes across 25 sub-domains and 6 core physics domains: thermodynamics,electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. Inour comprehensive evaluation, even state-of-the-art models strugglesignificantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, andGPT-o4-mini achieve only 32.5\%, 42.2\%, and 45.8\% accuracyrespectively-performance gaps exceeding 29\% compared to human experts. Ouranalysis exposes critical limitations in current models: over-reliance onmemorized disciplinary knowledge, excessive dependence on mathematicalformulations, and surface-level visual pattern matching rather than genuinephysical understanding. We provide in-depth analysis through fine-grainedstatistics, detailed case studies, and multiple evaluation paradigms tothoroughly examine physical reasoning capabilities. To ensure reproducibility,we implement a compatible evaluation protocol based on widely-used toolkitssuch as VLMEvalKit, enabling one-click evaluation.

Details der Forschungsarbeit anzeigen