Mixture of Experts (MoE) has emerged as a pivotal architectural paradigm for efficient scaling of Large Language Models (LLMs), operating through selective activation of parameter subsets for each input token.
In this paper, the authors introduce Mixture of Latent Experts (MoLE), a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space.
The MoLE architecture significantly reduces parameter count and computational requirements, addressing challenges such as excessive memory utilization and communication overhead during training and inference.
Empirical evaluations demonstrate that MoLE achieves performance comparable to standard MoE implementations while substantially reducing resource requirements.