<ul><li>Mixture-of-Experts models achieve performance and inference efficiency by activating only a subset of experts.</li><li>Large-scale Mixture-of-Experts models face the limitation of storing all experts which leads to significant memory overhead.</li><li>A pruning framework called EASY-EP is proposed, which utilizes domain-specific demonstrations to identify and retain the most relevant experts.</li><li>EASY-EP can achieve comparable performance and higher throughput while reducing memory usage by half in the DeepSeek-R1 model.</li></ul>

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

Discover more