Large language models often face inefficient resource utilization during inference due to their auto-regressive nature.
Existing literature typically explains performance plateau in large-batch inference as a shift to the compute-bound regime, but a new study reveals it remains memory-bound.
Researchers propose a Batching Configuration Advisor (BCA) to optimize memory allocation, reducing GPU memory requirements and improving resource utilization.
The study challenges conventional assumptions, offers insights and strategies for better resource utilization, especially for smaller language models.