Researchers from ByteDance and the Chinese Academy of Sciences introduced InfiMM-WebMath-40B, a comprehensive dataset that offers a large-scale multimodal resource specifically designed for mathematical reasoning.
The dataset includes 24 million web pages, 85 million associated image URLs, and approximately 40 billion text tokens extracted and filtered from the CommonCrawl repository.
The dataset was constructed using a rigorous data processing pipeline, filtering 122 billion web pages to ensure the inclusion of high-quality, relevant content, making it the first of its kind in the open-source community.
Models trained using this dataset outperformed others in their ability to process both text and visual information, proving the importance of integrating visual elements with textual data to improve mathematical reasoning capabilities.
The performance of models trained on InfiMM-WebMath-40B highlights the importance of integrating visual elements with textual data to improve mathematical reasoning capabilities.
This dataset bridges the gap between proprietary and open-source models and paves the way for future research to enhance AI’s ability to solve complex mathematical problems.
This dataset offers an unprecedented resource for training Multimodal Large Language Models, enabling them to process and reason with more complex mathematical concepts than ever.
Proprietary models like GPT-4 and Claude 3.5 Sonnet have leveraged extensive private datasets for pre-training. But the lack of comprehensive multimodal datasets that integrate text and visual data has limited the full potential of open-source projects.
The growing sophistication of LLMs has made them indispensable tools in fields where advanced reasoning is required.
In evaluations conducted on benchmarks such as MathVerse and We-Math, models trained using this dataset outperformed others in their ability to process both text and visual information.