OpenAI's New Benchmark Shows ChatGPT Matches Human Performance on Real-World Work Tasks
OpenAI has released a new benchmark called GDPval to demonstrate that AI can already perform certain real-world work tasks at a level comparable to human professionals. The initiative comes amid growing skepticism about the practical value of AI in the workplace, with recent studies showing that most companies see little to no return on their AI investments. A MIT Media Lab study found that fewer than 10% of AI pilot projects generated measurable revenue, while a Harvard Business Review and Stanford study identified “workslop”—AI-generated content that appears productive but lacks real substance—as a major barrier to success. GDPval is designed to address the limitations of traditional AI benchmarks, which often focus on abstract academic problems rather than the everyday tasks people perform in their jobs. OpenAI says the benchmark is grounded in the concept of Gross Domestic Product, selecting 44 jobs across nine industries that contribute most to U.S. economic output, including finance, manufacturing, government, and real estate. The focus is on high-wage, knowledge-based roles. To build the test set, OpenAI worked with professionals averaging 14 years of experience in each field. These experts created real-world tasks and provided human-written examples of high-quality work. The final evaluation includes 30 reviewed tasks per occupation, with five open-sourced “gold” tasks per role. These tasks range from drafting legal briefs and writing nursing care plans to designing engineering blueprints and handling customer support. Independent experts from the same fields graded AI-generated outputs against human-written ones, using blind evaluations to assess quality. They ranked each output as better, as good as, or worse than the human version. The results show that leading AI models are now capable of producing work that matches or exceeds human performance on many tasks. Claude Opus 4.1 led with a 47.6% win and tie rate, excelling in areas like document formatting and visual layout. GPT-5 high followed with a 38.8% rate, showing strength in accuracy and following instructions. GPT-4o trailed with only 12.4%. AI models performed particularly well on routine, structured tasks such as those done by counter clerks, inventory clerks, sales managers, and software developers. They struggled more with complex, judgment-intensive roles like industrial engineering, pharmacy, financial management, and video editing. OpenAI claims these models can complete GDPval tasks around 100 times faster and 100 times cheaper than human experts. However, the company emphasizes that AI is not poised to replace humans entirely. Instead, it aims to handle repetitive, rule-based work so people can focus on higher-level creativity, decision-making, and problem-solving. As OpenAI stated, “most jobs are more than just a collection of tasks that can be written down.” The goal is not replacement, but augmentation—freeing workers from mundane work to focus on what only humans can do best.