AI Matches Human Graders in Evaluating Economics Exam Responses
How does high population growth affect gross domestic product? This is a common question in macroeconomics exams, requiring students to demonstrate not just factual knowledge, but also the ability to construct coherent, evidence-based arguments. Free-text responses like these are valuable for assessing deep understanding, but they pose a significant challenge for instructors and teaching assistants, who must carefully read, interpret, and grade each answer by hand—often a time-intensive and laborious process. Now, a new study shows that artificial intelligence is increasingly capable of matching human graders in evaluating these complex responses. Researchers tested AI models, including advanced language models, on their ability to assess student answers to macroeconomics exam questions. The results revealed that the AI systems performed at a level comparable to human graders in terms of accuracy, consistency, and alignment with established grading rubrics. The AI was trained on thousands of previously graded student responses, learning to identify key economic concepts such as labor force expansion, capital per worker, productivity growth, and the potential trade-offs between population growth and per capita income. It could detect nuances like logical structure, use of relevant data, and the quality of economic reasoning—elements that go beyond simple keyword matching. This development could significantly reduce the grading burden on academic staff, allowing them to focus on teaching and student feedback rather than administrative tasks. It also opens the door to faster, more consistent feedback for students, which can enhance learning outcomes. While human oversight remains essential—especially for highly subjective or creative responses—the integration of AI into grading represents a major step forward in educational technology. As AI continues to improve in understanding context and reasoning, its role in academic assessment is likely to expand, offering scalable solutions for evaluating complex, open-ended responses across disciplines.
