Recent works improve MoE inference load balance by dynamically duplicating popular experts to more GPUs to process excessive tokens.
MoE-GPS is a framework proposed to guide the selection of the optimal predictor design for multi-GPU Mixture-of-Experts network.
It advocates for Distribution-Only Prediction, a strategy that predicts overall token distribution to reduce overhead compared to Token-to-Expert Prediction.
On Mixtral 8x7B MMLU dataset, MoE-GPS suggests Distribution-Only Prediction, improving end-to-end inference performance by over 23% compared to Token-to-Expert Prediction.