Microsoft has developed the Gradient-Informed Mixture of Experts (GRIN MoE) to make deep-learning models more efficient and scalable.
Existing models, such as GPT-3 and GPT-4, are resource-heavy, while others like GShard and Switch Transformers require token dropping to manage their resource distribution.
GRIN enhances the model's performance by addressing these inadequacies through its routing mechanism, making computation more efficient and scalable by assigning only the top-two experts to each input token.
Researchers tested the model against other models across tasks, and GRIN MoE achieved impressive results, surpassing other models and matching their performance with fewer activated parameters.
In the MMLU benchmark, GRIN MoE scored 79.4, while on the HumanEval benchmark, it scored 74.4 for solving coding problems. It also demonstrated superior performance on HellaSwag and achieved a score of 83.7.
GRIN uses advanced techniques, including MoE layers, which consist of 16 experts per layer, and a routing mechanism. Stateless sparse mixer-v2 was also used, a key component that estimated gradients related to expert routing.
GRIN MoE uses only 6.6 billion activated parameters during inference, but it still outperforms competing models.
GRIN also improves training efficiency. When trained on 64 H100 GPUs, it achieved an 86.56% throughput, faster than any previous models, and still maintaining accuracy.
The researchers' work on GRIN presents a scalable solution for developing high-performing models that can be used in natural language processing, mathematics, coding and more.
GRIN MoE marks a significant step forward in artificial intelligence (AI) research, leading the pathway for increasingly efficient, scalable, and high-performing models.