The authors conducted a longitudinal audit of popular text, speech, and video datasets.
They analyzed nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.
They found that web-crawled, synthetic, and social media platforms are the primary sources for multimodal machine learning applications.
They also discovered that a significant portion of widely-used datasets carry non-commercial restrictions and coverage of languages and geographies has not significantly improved in recent years.