As new, real data is increasingly hard to come by, AI firms have been turning to synthetic generated data.
While human annotation can be costly and comes with a number of other limitations, synthetic alternatives have been developed.
This has been made possible due to the statistical nature of AI, as the examples a machine is fed are more important than the specific data source.
The quality of synthetic data is reliant on the accuracy of the original generative model and many AI labs are fine-tuning their models using AI-generated data.
However, a recent study found over-reliance on synthetic data can limit the diversity of models and lead to training data which is increasingly error-ridden.
Concerns have been raised over the safety of relying on synthetic data without first thoroughly reviewing, curating, and filtering it.
While it is possible AI could learn to create its own synthetic training data, the concept has yet to be fully realised and major AI labs continue to supplement generative models with large datasets of real-world information.
For the foreseeable future, it seems that humans will need to be involved in the training process to ensure a model's training is effective and not affected by biases.