menu
techminis

A naukri.com initiative

google-web-stories
Home

>

ML News

>

AutoSDT: S...
source image

Arxiv

5d

read

110

img
dot

Image Credit: Arxiv

AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists

  • AutoSDT is an automatic pipeline designed to address the challenge of data scarcity in building AI co-scientists for scientific discovery tasks.
  • It collects high-quality coding tasks from real-world data-driven workflows using LLMs to search for sources, select tasks, and synthesize instructions and code solutions.
  • AutoSDT-5K dataset, created using this pipeline, comprises 5,404 coding tasks spanning four scientific disciplines and 756 Python packages, making it the largest open dataset for data-driven scientific discovery generated automatically.
  • Expert feedback indicates that 93% of tasks collected are ecologically valid, and 92.2% of synthesized programs are functionally correct. AutoSDT-Coder models trained on this dataset show significant improvements on data-driven discovery benchmarks, matching the performance of GPT-4o on ScienceAgentBench and enhancing scores on DiscoveryBench.

Read Full Article

like

6 Likes

For uninterrupted reading, download the app