Scaling LLM Inference with Optimized Sample Compute Allocation
Published in Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025, 2024
In this work, we address the high computational cost of LLM inference, especially for complex tasks requiring multiple samples. We propose OSCA (Optimized Sample Compute Allocation), a novel strategy that intelligently allocates computational resources across different samples. Our experiments show that OSCA sets a new state-of-the-art in inference efficiency.
Recommended citation: Kexun Zhang, Shang Zhou, Danqing Wang, William Yang Wang, Lei Li. (2025). “Scaling LLM Inference with Optimized Sample Compute Allocation.” In Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Recommended citation: Kexun Zhang*, Shang Zhou*, Danqing Wang, William Yang Wang, Lei Li. (2025). "Scaling LLM Inference with Optimized Sample Compute Allocation." In Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
Download Paper