Language Model pre-training uses broad data mixtures to enhance performance across domains and languages.DEPT proposes a communication-efficient pre-training framework that decouples embeddings from the transformer body.DEPT can handle significant data heterogeneity and minimize token embedding parameters.DEPT improves transformer body plasticity, generalization, and overall performance.