<ul><li>ScienceAgentBench is a benchmark for evaluating language agents for data-driven scientific discovery.</li><li>It aims to assess the capabilities of large language models (LLMs) in automating scientific discovery tasks.</li><li>The benchmark includes 102 tasks extracted from peer-reviewed publications in four disciplines, with validation from subject matter experts.</li><li>Results show that current language agents have limitations in generating code for data-driven discovery and end-to-end automation of scientific research.</li></ul>

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Discover more