OpenAI unveiled PaperBench, a new benchmark to measure how well AI agents can reproduce cutting-edge AI research.The benchmark consists of 20 top papers from the International Conference on Machine Learning (ICML) 2024, covering 12 different topics.Anthropic's Claude 3.5 Sonnet was the best performing model with a 21.0% replication score, while human PhDs scored an average of 41.4%.PaperBench's code is available on GitHub, and a lightweight version of the benchmark, PaperBench Code-Dev, is also accessible.