<ul><li>OpenAI unveiled PaperBench, a new benchmark to measure how well AI agents can reproduce cutting-edge AI research.</li><li>The benchmark consists of 20 top papers from the International Conference on Machine Learning (ICML) 2024, covering 12 different topics.</li><li>Anthropic's Claude 3.5 Sonnet was the best performing model with a 21.0% replication score, while human PhDs scored an average of 41.4%.</li><li>PaperBench's code is available on GitHub, and a lightweight version of the benchmark, PaperBench Code-Dev, is also accessible.</li></ul>

OpenAI’s New Benchmark to Study AI Agents’ Research Capabilities

Discover more