Recent empirical evidence shows that heavy-tailed gradient noise in machine learning challenges standard assumptions of bounded variance in stochastic optimization.
Gradient clipping is commonly used to address heavy-tailed noise, but current theoretical understanding has limitations, such as relying on large clipping thresholds and sub-optimal sampling complexity.
A new approach, Normalized SGD (NSGD), is introduced to overcome these issues by establishing parameter-free sample complexity and improving convergence rates even when problem parameters are known.
The study on NSGD offers improved sample complexities, matching lower bounds for first-order methods, and ensures high-probability convergence with a mild dependence on failure probability.