VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Visual reasoning is a core component of human intelligence and a criticalcapability for advanced multimodal models. Yet current reasoning evaluations ofmultimodal large language models (MLLMs) often rely on text descriptions andallow language-based reasoning shortcuts, failing to measure genuinevision-centric reasoning. To address this, we introduce VisuLogic: a benchmarkof 1,000 human-verified problems across six categories (e.g., quantitativeshifts, spatial relations, attribute comparisons). These various types ofquestions can be evaluated to assess the visual reasoning capabilities of MLLMsfrom multiple perspectives. We evaluate leading MLLMs on this benchmark andanalyze their results to identify common failure modes. Most models score below30% accuracy-only slightly above the 25% random baseline and far below the51.4% achieved by humans-revealing significant gaps in visual reasoning.Furthermore, we provide a supplementary training dataset and areinforcement-learning baseline to support further progress.