<ul><li>Second-order optimization methods like KFAC offer superior convergence by utilizing curvature information of the loss landscape.</li><li>MAC, a computationally efficient optimization method, is proposed by analyzing the components of the layer-wise Fisher information matrix used in KFAC.</li><li>MAC is unique for applying the Kronecker factorization to the FIM of attention layers in transformers and integrating attention scores into preconditioning.</li><li>Extensive evaluations show that MAC outperforms KFAC and other methods in terms of accuracy, training time, and memory usage across various network architectures and datasets.</li></ul>

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

Discover more