ScienceAgentBench is a benchmark for evaluating language agents for data-driven scientific discovery.
It aims to assess the capabilities of large language models (LLMs) in automating scientific discovery tasks.
The benchmark includes 102 tasks extracted from peer-reviewed publications in four disciplines, with validation from subject matter experts.
Results show that current language agents have limitations in generating code for data-driven discovery and end-to-end automation of scientific research.