OpenAI has introduced PaperBench, a benchmark designed to evaluate the competence of AI agents in autonomously replicating state-of-the-art machine learning research.
PaperBench requires AI agents to process research papers, develop code repositories independently, and execute experiments to replicate empirical outcomes.
Performance evaluations reveal varying levels of replication scores among different AI models on PaperBench.
The results highlight strengths in initial code generation and experimental setup, but weaknesses in sustained task execution and strategic problem-solving.