Pixels, Patterns, but No Poetry: To See The World like Humans

Achieving human-like perception and reasoning in Multimodal Large LanguageModels (MLLMs) remains a central challenge in artificial intelligence. Whilerecent research has primarily focused on enhancing reasoning capabilities inMLLMs, a fundamental question persists: Can Multimodal Large Language Modelstruly perceive the world as humans do? This paper shifts focus from reasoningto perception. Rather than constructing benchmarks specifically for reasoning,we introduce the Turing Eye Test (TET), a challenging perception-orientedbenchmark comprising four diagnostic tasks that evaluate MLLMs' performance onsynthetic images that humans process intuitively. Our findings reveal thatstate-of-the-art MLLMs exhibit catastrophic failures on our perceptual taskstrivial for humans. Both in-context learning and training on languagebackbone-effective for previous benchmarks-fail to improve performance on ourtasks, while fine-tuning the vision tower enables rapid adaptation, suggestingthat our benchmark poses challenges for vision tower generalization rather thanfor the knowledge and reasoning capabilities of the language backbone-a key gapbetween current MLLMs and human perception. We release a representative subsetof TET tasks in this version, and will introduce more diverse tasks and methodsto enhance visual generalization in future work.