The success of Shampoo in the AlgoPerf contest has led to a resurgence of interest in Kronecker-factorization-based optimization algorithms for training neural networks.
Shampoo depends on heuristics like learning rate grafting and stale preconditioning for performance at-scale, which increase complexity and require hyperparameter tuning without solid theoretical backing.
This study explores these heuristics by focusing on Frobenius norm approximation to full-matrix Adam and separating the preconditioner's eigenvalues and eigenbasis updates.
The research demonstrates how grafting from Adam can address staleness and mis-scaling of eigenvalues, eliminating the need for learning rate grafting, along with proposing adaptive criteria for eigenbasis computation frequency to better manage errors and improve convergence.