Understanding the efficacy of Adam in training transformer-based language models is a key focus for the optimization community.
Multiple simplifications of Adam have been proposed, including signed gradient and signed momentum methods, to gain deeper insights.
An empirical study involving training over 1,300 language models compared Adam to simplified variants, revealing that constraining the Adam momentum parameters to be equal holds promise for optimal performance.
This constrained Adam option not only delivers robust performance but also offers new theoretical insights by implementing a natural online algorithm for estimating gradients' mean and variance.