<ul><li>Large Language Models (LLMs) experience memory-bound limitations during inference due to High Bandwidth Memory (HBM) bandwidth constraints.</li><li>A new method of asynchronous KV Cache prefetching is proposed to overcome memory bandwidth limitations in LLM inference.</li><li>By scheduling idle memory bandwidth during active computation windows, the method prefetches required KV Cache into GPU L2 cache to enable faster subsequent accesses.</li><li>Experiments on NVIDIA H20 GPUs show significant improvements in attention kernel efficiency and end-to-end throughput, surpassing existing baseline approaches.</li></ul>

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Discover more