Second-order optimization methods like KFAC offer superior convergence by utilizing curvature information of the loss landscape.
MAC, a computationally efficient optimization method, is proposed by analyzing the components of the layer-wise Fisher information matrix used in KFAC.
MAC is unique for applying the Kronecker factorization to the FIM of attention layers in transformers and integrating attention scores into preconditioning.
Extensive evaluations show that MAC outperforms KFAC and other methods in terms of accuracy, training time, and memory usage across various network architectures and datasets.