HyperAI

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
Release Date: 4/25/2025
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
  Large Language Models
Abstract

Visual reasoning is a core component of human intelligence and a criticalcapability for advanced multimodal models. Yet current reasoning evaluations ofmultimodal large language models (MLLMs) often rely on text descriptions andallow language-based reasoning shortcuts, failing to measure genuinevision-centric reasoning. To address this, we introduce VisuLogic: a benchmarkof 1,000 human-verified problems across six categories (e.g., quantitativeshifts, spatial relations, attribute comparisons). These various types ofquestions can be evaluated to assess the visual reasoning capabilities of MLLMsfrom multiple perspectives. We evaluate leading MLLMs on this benchmark andanalyze their results to identify common failure modes. Most models score below30% accuracy-only slightly above the 25% random baseline and far below the51.4% achieved by humans-revealing significant gaps in visual reasoning.Furthermore, we provide a supplementary training dataset and areinforcement-learning baseline to support further progress.