Scaling LLM Inference with Optimized Sample Compute Allocation

Published in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025, 2024

In this work, we address the high computational cost of LLM inference, especially for complex tasks requiring multiple samples. We propose OSCA (Optimized Sample Compute Allocation), a novel strategy that intelligently allocates computational resources across different samples. Our experiments show that OSCA sets a new state-of-the-art in inference efficiency.

Download paper here

Recommended citation: Kexun Zhang, Shang Zhou, Danqing Wang, William Yang Wang, Lei Li. (2025). “Scaling LLM Inference with Optimized Sample Compute Allocation.” In Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics.

Recommended citation: Kexun Zhang*, Shang Zhou*, Danqing Wang, William Yang Wang, Lei Li. (2025). "Scaling LLM Inference with Optimized Sample Compute Allocation." In Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Download Paper