<ul><li>The efficiency of attention is crucial due to its quadratic time complexity.</li><li>Enhancement of attention efficiency is achieved through leveraging the new FP4 Tensor Cores in Blackwell GPUs, resulting in 5x speedup over the fastest FlashAttention on RTX5090.</li><li>Introduction of low-bit attention to training tasks, exploring its effectiveness in both forward and backward propagation.</li><li>Experiments show that 8-bit attention achieves lossless performance in fine-tuning tasks but has slower convergence in pretraining tasks.</li></ul>

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Discover more