HyperAI

Researchers from the Institute of Automation at the Chinese Academy of Sciences (CAS) have uncovered that multimodal large language models (MLLMs) can develop human-like conceptual representations of objects. This breakthrough study, published in Nature Machine Intelligence, combines behavioral experiments and neuroimaging analysis to demonstrate how MLLMs can form a sophisticated understanding of common objects, much like humans do. Humans have an inherent ability to conceptualize objects in the natural world, recognizing not just their physical attributes but also their functions, emotional significance, and cultural meanings. This multi-dimensional understanding is a cornerstone of human intelligence. As large language models (LLMs) like ChatGPT have seen explosive growth, a critical question has emerged: Can these models also form similar conceptual representations from language and multimodal data? The study, led by Dr. Huiguang He, a senior researcher at the CAS Institute of Automation, sought to bridge this gap. According to Dr. He, "While AI can accurately distinguish between images of cats and dogs, this recognition differs from a human's deeper understanding of what cats and dogs are." To address this, the research team designed an innovative approach that integrates computational modeling, behavioral experiments, and brain science. The researchers used a classic cognitive psychology task known as the "odd-one-out" triplet recognition task. Both human participants and AI models were presented with sets of three object concepts (drawn from 1,854 everyday items) and asked to identify the most dissimilar one. By analyzing over 4.7 million behavioral judgments, the team created what they call an "AI concept map," providing the first detailed representation of how these models process and understand object concepts. A key finding of the study is the identification of 66 "mental dimensions" within the AI models, each given a semantic label. These dimensions are highly interpretable and closely match the neural activity patterns in specific brain regions responsible for processing categories such as faces (FFA), scenes (PPA), and bodies (EBA). The researchers also compared the decision-making patterns of various models to those of humans. They found that multimodal models, such as Gemini_Pro_Vision and Qwen2_VL, exhibited higher human consistency in their choices, indicating a more nuanced understanding similar to human cognition. Moreover, the study revealed that humans tend to combine visual features and semantic information when making decisions, whereas AI models rely more heavily on semantic labels and abstract concepts. This insight suggests that while AI models are still distinct from human mental processes, they do exhibit a form of understanding that goes beyond mere pattern recognition. "The findings show that large language models are not just random parrots; they possess internal structures that allow them to grasp real-world concepts in a manner akin to humans," said Dr. Changde Du, the first author of the paper. The research team included collaborators from the CAS Center for Excellence in Brain Science and Intelligence Technology, such as Dr. Le Chang. This study is significant because it opens new avenues in artificial intelligence and cognitive science. It provides a theoretical framework for building AI systems that more closely mimic human cognitive structures, potentially leading to more advanced and intuitive models. The work was supported by several prestigious funding bodies, including the CAS Frontier Research Program, the National Natural Science Foundation of China, and the National Key Laboratory of Brain Cognitive and Brain-inspired Intelligence. The full paper, titled "Human-like object concept representations emerge naturally in multimodal large language models," is available in the journal Nature Machine Intelligence (2025), with the DOI: 10.1038/s42256-025-01049-z. This research represents a crucial step forward in understanding how AI can move beyond simple recognition tasks to develop a deeper, more human-like comprehension of the world around us.

Multimodal Large Models Demonstrate Human-like Object Concept Representation for the First Time

Related Links