<ul data-eligibleForWebStory="false"><li>Mixture-of-Experts (MoE) models enhance the scalability of large language models by activating relevant experts per input.</li><li>The high number of expert networks in an MoE model poses storage challenges for edge devices.</li><li>A study addresses expert caching on edge servers under storage constraints for efficient distributed inference using a Top-K selection strategy.</li><li>Proposed algorithms aim to minimize latency for expert co-activation within MoE layers, showing improved inference speed in simulations.</li></ul>

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Discover more