<ul><li>Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.</li><li>ReMoE is a fully differentiable MoE architecture that offers a drop-in replacement for the conventional TopK+Softmax routing.</li><li>ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures.</li><li>The implementation of ReMoE based on Megatron-LM is available on GitHub.</li></ul>

ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Discover more