OpenAI Introduces AI Agent Evaluation Benchmark PaperBench
2025-04-03 / Read about 0 minute
Author:小编   

On April 2, OpenAI, the leading American research institution for open artificial intelligence, unveiled PaperBench, a novel benchmark tailored to assess the proficiency of AI agents in replicating cutting-edge AI research. This rigorous evaluation requires AI agents to independently reproduce 20 ICML 2024 Spotlight and Oral papers, encompassing diverse tasks like grasping paper contributions, constructing codebases, and successfully conducting experiments. The test results reveal that Claude 3.5 Sonnet (the latest version), in conjunction with open-source frameworks, attained an impressive average reproduction score of 21.0% on PaperBench, outperforming other agents. However, when top machine learning Ph.D. students were invited to attempt some of the test sets, it became evident that the model's performance still lags behind human capabilities.