<ul><li>MegaScale-Infer is an efficient system for serving large-scale Mixture-of-Experts (MoE) models.</li><li>It disaggregates attention and feed-forward network (FFN) modules within each model layer.</li><li>MegaScale-Infer introduces ping-pong pipeline parallelism to exploit MoE's sparsity.</li><li>Experimental results show that MegaScale-Infer achieves higher per-GPU throughput than other solutions.</li></ul>

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Discover more