AutoSDT is an automatic pipeline designed to address the challenge of data scarcity in building AI co-scientists for scientific discovery tasks.
It collects high-quality coding tasks from real-world data-driven workflows using LLMs to search for sources, select tasks, and synthesize instructions and code solutions.
AutoSDT-5K dataset, created using this pipeline, comprises 5,404 coding tasks spanning four scientific disciplines and 756 Python packages, making it the largest open dataset for data-driven scientific discovery generated automatically.
Expert feedback indicates that 93% of tasks collected are ecologically valid, and 92.2% of synthesized programs are functionally correct. AutoSDT-Coder models trained on this dataset show significant improvements on data-driven discovery benchmarks, matching the performance of GPT-4o on ScienceAgentBench and enhancing scores on DiscoveryBench.