Publications

You can also find my articles on my Google Scholar profile.

Conference Papers

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Published in Under review at Conference on Neural Information Processing Systems (NeurIPS), 2025, 2025

We introduce LiveCodeBench Pro, a new benchmark for evaluating Large Language Models on competitive programming problems, judged by human Olympiad medalists. Our work, co-led by a team of 20 researchers, quantifies a major performance gap: LLMs score nearly 0% on hard problems and consistently fail on tasks requiring deep observation and reasoning.

Recommended citation: Zihan Zheng*, Zerui Cheng*, Zeyu Shen*, Shang Zhou*, Kaiyuan Liu*, Hansen He*, et al. (2025). "LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?." arXiv preprint arXiv:2506.11928.
Download Paper

Scaling LLM Inference with Optimized Sample Compute Allocation

Published in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025, 2024

This paper introduces OSCA, a pioneering algorithm that formulates sample distribution for LLM inference as a learning problem. By using a mixed-allocation strategy, OSCA achieves superior accuracy with significantly less compute—128x less on code generation and 25x less on reasoning tasks.

Recommended citation: Kexun Zhang*, Shang Zhou*, Danqing Wang, William Yang Wang, Lei Li. (2025). "Scaling LLM Inference with Optimized Sample Compute Allocation." In Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Download Paper

Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs

Published in Findings of the Association for Computational Linguistics (ACL), 2024, 2024

We developed a novel evaluation framework to assess the fine-grained control of attributes in text generated by LLMs. Using GPT-4 as a judge and an Elo rating system, our work quantifies control calibration and consistency across five attributes for various prompting and representation editing (RepE) methods.

Recommended citation: Shang Zhou*, Feng Yao*, Chengyu Dong, Zihan Wang, Jingbo Shang. (2024). "Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs." In Findings of the Association for Computational Linguistics: ACL 2024.
Download Paper