<ul><li>The authors conducted a longitudinal audit of popular text, speech, and video datasets.</li><li>They analyzed nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.</li><li>They found that web-crawled, synthetic, and social media platforms are the primary sources for multimodal machine learning applications.</li><li>They also discovered that a significant portion of widely-used datasets carry non-commercial restrictions and coverage of languages and geographies has not significantly improved in recent years.</li></ul>

Bridging the Data Provenance Gap Across Text, Speech and Video

Discover more