<ul><li>SageAttention is a highly efficient and accurate quantization method for attention in transformer architecture.</li><li>Attention has a computational complexity of O(N^2) and becomes the primary time-consuming component when handling large sequence lengths.</li><li>SageAttention outperforms FlashAttention2 and xformers in terms of operations per second (OPS) by about 2.1 times and 2.7 times, respectively.</li><li>Comprehensive experiments show that SageAttention incurs almost no end-to-end metrics loss across diverse models.</li></ul>

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Discover more