Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations.
SiftAttention is a new approximate attention method proposed to address memory bandwidth limitations in attention computations.
SiftAttention replaces the top-$k$ step in attention computation with a computationally efficient element-wise filtering operation based on a threshold value.
The approach dynamically estimates a threshold value per prompt at each generation step, reducing data movement between High Bandwidth Memory (HBM) and SRAM.