menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

Bridging t...
source image

Arxiv

14h

read

323

img
dot

Image Credit: Arxiv

Bridging the Data Provenance Gap Across Text, Speech and Video

  • The authors conducted a longitudinal audit of popular text, speech, and video datasets.
  • They analyzed nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries.
  • They found that web-crawled, synthetic, and social media platforms are the primary sources for multimodal machine learning applications.
  • They also discovered that a significant portion of widely-used datasets carry non-commercial restrictions and coverage of languages and geographies has not significantly improved in recent years.

Read Full Article

like

19 Likes

For uninterrupted reading, download the app