ICONS: Influence Consensus for Vision-Language Data Selection
Training vision-language models often relies on large mixtures of data spanning diverse tasks and domains. However, these mixtures can include redundant information, increasing computational costs without performance gains.
Effective data selection strategies are necessary to address these issues. Existing methods use task-agnostic heuristics or focus on optimizing single tasks, limiting their effectiveness in multitask settings.
This work introduces ICONS, a gradient-based approach for vision-language data selection. ICONS leverages training dynamics to estimate the influence of individual examples on validation performance and aggregates these estimates across tasks via majority voting.
By identifying data points consistently valuable across tasks, ICONS prioritizes examples driving overall performance. The method mitigates score calibration and outlier sensitivity issues, resulting in robust data selection for diverse multitask mixtures.
With only 20% of the data from LLaVA-665K and Cambrian-7M, selected subsets retain high performance levels. They achieve 98.6% and 98.8% performance compared to full datasets and can even surpass full data training at a 60% selection ratio on LLaVA-665K.
The approach also generalizes to unseen tasks and architectures, showcasing strong transfer capabilities. Two compact subsets, LLaVA-ICONS-133K and Cambrian-ICONS-1.4M, are released with impactful training examples for efficient vision-language model development.