Researchers have developed subquadratic algorithms for computing Attention in Transformers with head dimension d = Theta(log n).
Subquadratic Attention is feasible when inputs have small entries bounded by B = o(sqrt(log n)), or when softmax is applied with high temperature for d = Theta(log n).
Efficient computation of Attention without strong assumptions on temperature is explored, with subquadratic algorithms presented for constant d = O(1).
The study concludes that in certain scenarios, the standard algorithm for Attention is optimal under fine-grained complexity assumptions.