Zyphra Technologies has released Zyda-2, an open pretraining dataset comprising 5 trillion tokens.Zyda-2 has been distilled to retain the strengths of existing datasets while eliminating weaknesses.Zamba2 small language model trained on Zyda-2 performs significantly better than other state-of-the-art language modeling datasets.The dataset aims to help enterprises train high-accuracy small language models for edge and consumer devices.