LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
Published in Under review at Conference on Neural Information Processing Systems (NeurIPS), 2025, 2025
We introduce LiveCodeBench Pro, a new benchmark for evaluating Large Language Models on competitive programming problems, judged by human Olympiad medalists. Our work, co-led by a team of 20 researchers, quantifies a major performance gap: LLMs score nearly 0% on hard problems and consistently fail on tasks requiring deep observation and reasoning.
Recommended citation: Zihan Zheng*, Zerui Cheng*, Zeyu Shen*, Shang Zhou*, Kaiyuan Liu*, Hansen He*, et al. (2025). "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?." arXiv preprint arXiv:2506.11928.
Download Paper