menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Moonshot A...
source image

Marktechpost

7h

read

206

img
dot

Moonshot AI and UCLA Researchers Release Moonlight: A 3B/16B-Parameter Mixture-of-Expert (MoE) Model Trained with 5.7T Tokens Using Muon Optimizer

  • Moonlight is a Mixture-of-Expert (MoE) model developed by Moonshot AI and UCLA, optimized with the Muon optimizer to handle challenges in large language model training.
  • Muon addresses issues like vanishing/exploding gradients, inconsistent updates, and resource demands in training models with billions of parameters and trillions of tokens.
  • The Muon optimizer uses matrix orthogonalization through Newton-Schulz iterations to ensure uniform gradient updates across the model.
  • Technical adjustments to Muon include integrating weight decay and scaling updates to align with AdamW's performance.
  • Muon's distributed implementation reduces memory overhead and communication costs in large-scale training environments.
  • Empirical evaluations show that Moonlight trained with Muon outperformed other models in language understanding and code generation tasks.
  • Scaling law experiments demonstrate Muon's ability to match AdamW performance with reduced computational cost.
  • Moonlight's training with Muon leads to a diverse range of singular values in weight matrices, aiding generalization across tasks.
  • The project demonstrates improvements in training efficiency and stability, providing a viable alternative to traditional optimization methods.
  • The open-sourcing of Muon implementation is expected to encourage further research into scalable optimization techniques for large language models.
  • Transitioning from AdamW to Muon does not require extensive tuning, simplifying the integration process for researchers.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app