The efficiency of attention is crucial due to its quadratic time complexity.
Enhancement of attention efficiency is achieved through leveraging the new FP4 Tensor Cores in Blackwell GPUs, resulting in 5x speedup over the fastest FlashAttention on RTX5090.
Introduction of low-bit attention to training tasks, exploring its effectiveness in both forward and backward propagation.
Experiments show that 8-bit attention achieves lossless performance in fine-tuning tasks but has slower convergence in pretraining tasks.