The paper focuses on enhancing data efficiency by curating web-crawl datasets through an advanced approach named EcoDatum.
EcoDatum addresses challenges related to unstructured and heterogeneous datasets, overcoming biases and the exclusion of relevant data often seen in traditional curation methods.
The method incorporates quality-guided deduplication for balanced feature distributions and integrates various data curation operators within a weak supervision ensemble framework.
Automated optimization is used to effectively score each data point, leading to improved curation quality and efficiency compared to existing techniques.
EcoDatum outperforms state-of-the-art methods and ranked 1st on the DataComp leaderboard, achieving an average performance score of 0.182 across 38 evaluation datasets.
The approach demonstrated a 28% improvement over the DataComp baseline method, showcasing its effectiveness in enhancing dataset curation and model training efficiency.