Pretraining large language models effectively requires strategic data selection, blending and ordering.A two-phase pretraining approach outperforms random data ordering and natural distribution of tokens.The two-phase approach improves average accuracies by 3.4% and 17%.Guidance is provided on crafting optimal data blends based on data source quality and the number of epochs.