Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget.ReMoE is a fully differentiable MoE architecture that offers a drop-in replacement for the conventional TopK+Softmax routing.ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures.The implementation of ReMoE based on Megatron-LM is available on GitHub.