Study Reveals Large Language Models Outperform Humans on Emotional Intelligence Tests and Can Create Valid Versions
A recent study conducted by researchers at the University of Bern and the University of Geneva has revealed that large language models (LLMs) like ChatGPT can perform exceptionally well in emotional intelligence (EI) tests, which are traditionally designed for humans. The study, published in Communications Psychology, assessed the ability of six widely used LLMs to solve and create EI tests, including ChatGPT-4, ChatGPT-o1, Gemini 1.5 Flash, Copilot 365, Claude 3.5, Haiku, and DeepSeek V3. Katja Schlegel, the lead researcher, has spent many years exploring EI and developing performance-based tests to measure one's ability to recognize, understand, and regulate emotions. When advanced LLMs like ChatGPT became publicly available, she and her colleagues, Nils R. Sommer and Marcello Mortillaro, were intrigued to determine how these models would fare in EI assessments. This inquiry was driven by ongoing debates about whether AI can genuinely possess empathy, the capacity to understand and respond to others' emotions. To conduct their study, the team used five different EI tests that present short emotional scenarios and ask for the most emotionally intelligent response, such as identifying what someone might be feeling or suggesting the best way to handle an emotional situation. These tests were administered to the selected LLMs, and their scores were compared to human averages from previous studies. The results were striking: the LLMs achieved an average accuracy of 81% on the EI tests, significantly higher than the average human accuracy of 56%. This suggests that LLMs are highly adept at understanding and interpreting emotional contexts, at least within the structured scenarios presented in these tests. In the second phase of the study, the researchers tasked ChatGPT-4 with generating new EI test items. These items included different emotional scenarios, questions, and answer options, along with the correct responses. The newly created tests were then given to over 460 human participants to evaluate their difficulty, clarity, realism, and correlation with established EI tests and measures of cognitive intelligence. The human participants rated the AI-generated EI test items as similarly clear and realistic as the original ones, and the tests showed comparable psychometric quality. According to Schlegel, this indicates that LLMs not only solve EI tests effectively but also demonstrate a deep conceptual understanding of emotions, enabling them to construct valid and reliable test items. The implications of these findings are significant for several fields. For psychology, the ability of LLMs to generate EI tests and training materials could streamline the process, which is currently time-consuming and done manually. This could lead to more efficient and scalable EI assessments and interventions. Additionally, the study suggests that LLMs could be used to create tailored role-play scenarios for social workers, enhancing their training and preparedness for real-world emotional interactions. For the development of social agents like mental health chatbots, educational tutors, and customer service avatars, the research highlights the potential of LLMs to enhance emotional reasoning capabilities. These agents often operate in emotionally sensitive contexts where understanding human emotions is crucial, and the study suggests that LLMs can at least emulate the emotional reasoning skills necessary for effective interaction. However, the researchers note that further studies are needed to explore the performance of LLMs in less structured, real-life emotional conversations, as the current tests are based on controlled, scenario-driven formats. They also aim to investigate the cultural sensitivity of these models, as most are trained on Western-centric data, which might limit their applicability in diverse cultural settings. Industry insiders and experts are optimistic about these findings, recognizing the potential of LLMs to revolutionize how emotional intelligence is measured and applied. The ability of these models to generate high-quality test items and scenarios could significantly reduce the workload for psychologists and educators, making EI assessments more accessible and effective. However, there is still skepticism about whether LLMs can truly embody empathy and emotional depth in unstructured, real-world interactions, emphasizing the need for continued research and development. The University of Bern and the University of Geneva are renowned institutions with strong programs in psychology and artificial intelligence, making this collaboration a powerful effort to bridge the gap between human and AI emotional understanding.