<ul><li>Recent works improve MoE inference load balance by dynamically duplicating popular experts to more GPUs to process excessive tokens.</li><li>MoE-GPS is a framework proposed to guide the selection of the optimal predictor design for multi-GPU Mixture-of-Experts network.</li><li>It advocates for Distribution-Only Prediction, a strategy that predicts overall token distribution to reduce overhead compared to Token-to-Expert Prediction.</li><li>On Mixtral 8x7B MMLU dataset, MoE-GPS suggests Distribution-Only Prediction, improving end-to-end inference performance by over 23% compared to Token-to-Expert Prediction.</li></ul>

MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing

Discover more