AI struggles to grade university essays, favoring style over substance
A University of Cambridge-led study involving psychologists and AI experts has concluded that current generative artificial intelligence models are not sufficiently accurate to grade university essays. The research, titled AI in University Assessment: Evaluating the Opportunities and Risks of Automated Marking, tested three leading AI systems, including the latest versions of Claude and ChatGPT as of April 2026, on over 750 undergraduate psychology essays submitted between 2022 and 2025. The results revealed that AI matched human-assigned degree classifications only about half the time, with accuracy varying significantly by institution. At Cambridge, AI correctly identified the degree band 63 percent of the time, compared to 53 percent at Nottingham and just 35 percent at Manchester Metropolitan. The study highlighted a critical flaw: AI systems frequently undervalued top-tier work while overvaluing poor submissions. Instead of assessing academic reasoning, the models were found to be oversensitive to linguistic features such as essay length, vocabulary range, and sentence complexity, effectively rewarding style over substance. Researchers observed a central tendency bias, where AI models assigned middling marks to most submissions. Consequently, an essay graded 75 by a human was often scored lower by AI, while a 50 was scored higher. The models were most consistent with human graders only in the mid-range of grades, precisely where distinguishing between pass and fail or a First and an Upper Second is most challenging. Despite these limitations, the study suggests AI could serve as a supplementary tool rather than a replacement for human judgment. It could be useful for detecting errors, ensuring consistency, and triaging feedback by flagging assignments where AI and human marks diverge significantly. However, the authors warn against heavy reliance on automation, noting that it risks homogenizing grading and eroding the social contract between students and educators. Dr. Deborah Talmi, who leads the OpRaise project, emphasized that assessment is fundamental to educational meaning and trust. Students expressed feeling cheated by the prospect of AI grading, and staff warned that removing human engagement could weaken motivation and professional judgment. Additionally, while AI-generated feedback was three to eight times longer than human comments, focus groups found it difficult to distinguish AI insights from human ones when word counts were matched. Once the source was revealed, the acceptance of such feedback dropped. The researchers chose psychology as the test bed because the subject prioritizes evidence synthesis and critical judgment over single correct answers, making it a rigorous environment for evaluating AI. The study found that while AI provided consistent marks across different testing sessions, its internal logic differed fundamentally from human academic judgment. Human grading is based on reasoning, whereas AI relies on statistical predictions. In summary, the report cautions that while AI may assist in reducing staff workload and improving efficiency, it remains too shallow and inconsistent to determine final grades. The consensus is that human examiners must always determine the final mark to uphold standards, ensure fairness, and maintain the integrity of the higher education system.
