Humans and AI Agree on Confusing Code: Study Reveals Shared Cognitive Struggles and Enables Smarter Development Tools
Researchers from Saarland University and the Max Planck Institute for Software Systems have discovered that humans and large language models (LLMs) react similarly when confronted with confusing or misleading program code, marking a significant step toward more effective human-AI collaboration in software development. The findings, published on the arXiv preprint server, are based on a novel interdisciplinary study that compared human brain activity with model uncertainty in LLMs. Led by Sven Apel, Professor of Software Engineering at Saarland University, and Mariya Toneva, a researcher at the Max Planck Institute for Software Systems, the team investigated how both humans and AI models process code that contains subtle, deceptive patterns—known as "atoms of confusion." These are short, syntactically correct code snippets that mislead even experienced developers due to their counterintuitive behavior. To explore this, the researchers combined data from a prior study in which participants read both confusing and clean code variants while their brain activity was recorded using electroencephalography (EEG) and eye tracking. Simultaneously, the team measured the uncertainty of LLMs using perplexity—a standard metric that quantifies how surprised a model is by a given sequence of text tokens. High perplexity indicates greater uncertainty in prediction. The results revealed a strong alignment: whenever human participants showed increased brain activity—particularly the late frontal positivity, a neural signal linked to processing unexpected or confusing information—LLMs also exhibited higher perplexity. This correlation was statistically significant, indicating that both humans and models struggle in the same code regions. Youssef Abdelsalam, a doctoral researcher working under Toneva and Apel, expressed surprise at the findings. “We were astounded that the peaks in brain activity and model uncertainty showed such strong correspondence,” he said. Building on this insight, the team developed a data-driven method to automatically detect confusing code patterns. The algorithm successfully identified known confusing snippets in over 60% of test cases and uncovered more than 150 previously unrecognized patterns that also triggered increased human brain activity. “This work brings us closer to understanding how humans and machines perceive code in similar ways,” said Mariya Toneva. “By identifying shared points of confusion, we can create tools that make code more transparent and improve how developers and AI assistants work together.” Sven Apel added, “If we know when and why both humans and AI stumble, we can design better software, better training data, and better AI assistants that truly support developers.” The study represents a convergence of neuroscience, software engineering, and artificial intelligence, and has been accepted for presentation at the International Conference on Software Engineering (ICSE).
