The advancement of artificial intelligence hinges on the availability and quality of training data, particularly as multimodal foundation models grow in prominence.
Web-based or synthetic sources like YouTube and Wikipedia account for a significant share of speech, text, and video datasets, leading to data not representing underrepresented languages and regions adequately.
Nearly 80% of widely used datasets carry some form of implicit restrictions despite only 33% being explicitly licensed for non-commercial use, creating legal ambiguities and ethical challenges.
There is an urgent need for a systematic audit of multimodal datasets that holistically considers their sourcing, licensing, and representation to enable the development of unbiased and legally sound technologies.
Researchers from the Data Provenance Initiative conducted the largest longitudinal audit of multimodal datasets revealing the dominance of web-crawled data and providing valuable insights for developers and policymakers.
The lack of transparency and persistent Western-centric biases call for more rigorous audits and equitable practices in dataset curation and prioritizing transparency in data provenance.
The study highlights the significant inconsistencies in how data is licensed and documented and reveals stark geographical imbalances with North American and European organizations dominating and African and South American organizations lagging behind.
The research provides a roadmap for creating more transparent, equitable, and responsible AI systems and underscores the need for continued vigilance and measures.
The audit showed that over 70% of speech and video datasets are derived from web platforms, while synthetic sources are becoming increasingly popular, accounting for nearly 10% of all text data tokens.
The lack of transparency and persistent Western-centric biases call for more rigorous audits and equitable practices in dataset curation and prioritizing transparency in data provenance.