A new perspective on Mixture-of-Experts (MoE) models with top-k routing has been introduced, called Mixture of Group Experts (MoGE), to address limitations of vanilla MoE models.
MoGE utilizes group sparse regularization for routing inputs, creating a 2D topographic map that enhances expert diversity and specialization, leading to improved performance in tasks like image classification and language modeling.
Comprehensive evaluations show that MoGE outperforms traditional MoE models with minimal extra memory and computation requirements, offering an efficient solution to scale the number of experts while avoiding redundancy.
The source code for MoGE is included in the supplementary material and will be made publicly available for further exploration and implementation.