menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Microsoft ...
source image

Marktechpost

1M

read

283

img
dot

Microsoft Releases GRIN MoE: A Gradient-Informed Mixture of Experts MoE Model for Efficient and Scalable Deep Learning

  • Microsoft has developed the Gradient-Informed Mixture of Experts (GRIN MoE) to make deep-learning models more efficient and scalable.
  • Existing models, such as GPT-3 and GPT-4, are resource-heavy, while others like GShard and Switch Transformers require token dropping to manage their resource distribution.
  • GRIN enhances the model's performance by addressing these inadequacies through its routing mechanism, making computation more efficient and scalable by assigning only the top-two experts to each input token.
  • Researchers tested the model against other models across tasks, and GRIN MoE achieved impressive results, surpassing other models and matching their performance with fewer activated parameters.
  • In the MMLU benchmark, GRIN MoE scored 79.4, while on the HumanEval benchmark, it scored 74.4 for solving coding problems. It also demonstrated superior performance on HellaSwag and achieved a score of 83.7.
  • GRIN uses advanced techniques, including MoE layers, which consist of 16 experts per layer, and a routing mechanism. Stateless sparse mixer-v2 was also used, a key component that estimated gradients related to expert routing.
  • GRIN MoE uses only 6.6 billion activated parameters during inference, but it still outperforms competing models.
  • GRIN also improves training efficiency. When trained on 64 H100 GPUs, it achieved an 86.56% throughput, faster than any previous models, and still maintaining accuracy.
  • The researchers' work on GRIN presents a scalable solution for developing high-performing models that can be used in natural language processing, mathematics, coding and more.
  • GRIN MoE marks a significant step forward in artificial intelligence (AI) research, leading the pathway for increasingly efficient, scalable, and high-performing models.

Read Full Article

like

17 Likes

For uninterrupted reading, download the app