LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Recent reports claim that large language models (LLMs) now outperform elitehumans in competitive programming. Drawing on knowledge from a group ofmedalists in international algorithmic contests, we revisit this claim,examining how LLMs differ from human experts and where limitations stillremain. We introduce LiveCodeBench Pro, a benchmark composed of problems fromCodeforces, ICPC, and IOI that are continuously updated to reduce thelikelihood of data contamination. A team of Olympiad medalists annotates everyproblem for algorithmic categories and conducts a line-by-line analysis offailed model-generated submissions. Using this new data and benchmark, we findthat frontier models still have significant limitations: without externaltools, the best model achieves only 53% pass@1 on medium-difficulty problemsand 0% on hard problems, domains where expert humans still excel. We also findthat LLMs succeed at implementation-heavy problems but struggle with nuancedalgorithmic reasoning and complex case analysis, often generating confidentlyincorrect justifications. High performance appears largely driven byimplementation precision and tool augmentation, not superior reasoning.LiveCodeBench Pro thus highlights the significant gap to human grandmasterlevels, while offering fine-grained diagnostics to steer future improvements incode-centric LLM reasoning.