<ul><li>AutoSDT is an automatic pipeline designed to address the challenge of data scarcity in building AI co-scientists for scientific discovery tasks.</li><li>It collects high-quality coding tasks from real-world data-driven workflows using LLMs to search for sources, select tasks, and synthesize instructions and code solutions.</li><li>AutoSDT-5K dataset, created using this pipeline, comprises 5,404 coding tasks spanning four scientific disciplines and 756 Python packages, making it the largest open dataset for data-driven scientific discovery generated automatically.</li><li>Expert feedback indicates that 93% of tasks collected are ecologically valid, and 92.2% of synthesized programs are functionally correct. AutoSDT-Coder models trained on this dataset show significant improvements on data-driven discovery benchmarks, matching the performance of GPT-4o on ScienceAgentBench and enhancing scores on DiscoveryBench.</li></ul>

AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

Discover more