Large Language Models (LLMs) experience memory-bound limitations during inference due to High Bandwidth Memory (HBM) bandwidth constraints.
A new method of asynchronous KV Cache prefetching is proposed to overcome memory bandwidth limitations in LLM inference.
By scheduling idle memory bandwidth during active computation windows, the method prefetches required KV Cache into GPU L2 cache to enable faster subsequent accesses.
Experiments on NVIDIA H20 GPUs show significant improvements in attention kernel efficiency and end-to-end throughput, surpassing existing baseline approaches.