Training large language models (LLMs) faces challenges due to their scale and complex architectures.
An optimizer wrapper called Scaling with Gradient Grouping (SGG) is introduced to improve adaptive learning rate estimation.
SGG groups gradient statistics in each layer, applies cluster-specific scaling, and calibrates learning rates for each parameter, enhancing learning rate estimation.
Experiments indicate that SGG seamlessly integrates with existing optimizers, offers consistent gains, faster convergence, and stability across various batch sizes and learning rates for LLM optimization.