Microsoft has shown off its breakthrough Phi-4 AI language model, which includes 14bn parameters and use of synthetic data for high-performance reasoning and problem solving. It is able to provide top level performance without the need for big data input and scaling, and also comes with native support for up to 10 Indian languages. Microsoft's approach to high-quality datasets over quantity is part of an emerging trend that sees evidence of untapped data either held in corporate vaults or not being used in easily digestible digital formats.
Phi-4's synthetic data serves as a more effective mechanism for learning by using structured, diverse and nuanced datasets.
Frontier models will have to use different techniques and methods, beyond increasing parameter counts, as data for training like the data already available might have reached its peak.
Microsoft Phi-4 AI language model has surpassed many of its competitors.
It is expected to have far-reaching impacts on countries where most people cannot afford top-of-the-line models.
Microsoft's focus on synthetic datasets is a step away from the original scaling hypothesis, instead choosing to prioritise the quality of datasets over their quantity.
Frontier models have been a hot topic recently, with AI researchers contemplating the end of pre-training.
Untapped and locked data sources may also be a means to improve machine learning models.
Microsoft's new language model packs plenty of promise.
Phi-4's use of synthetic data has brought about a paradigm shift to the earlier assumption is that the more data used, the better.