OpenAI Introduces AI Agent Evaluation Benchmark PaperBench

2025-04-03 / Read about 0 minute

Author：小编

On April 2, OpenAI, the leading American research institution for open artificial intelligence, unveiled PaperBench, a novel benchmark tailored to assess the proficiency of AI agents in replicating cutting-edge AI research. This rigorous evaluation requires AI agents to independently reproduce 20 ICML 2024 Spotlight and Oral papers, encompassing diverse tasks like grasping paper contributions, constructing codebases, and successfully conducting experiments. The test results reveal that Claude 3.5 Sonnet (the latest version), in conjunction with open-source frameworks, attained an impressive average reproduction score of 21.0% on PaperBench, outperforming other agents. However, when top machine learning Ph.D. students were invited to attempt some of the test sets, it became evident that the model's performance still lags behind human capabilities.

Previous page：DeepMind's Extensive 145-Page AGI Safety Paper Fac...

Next page：OpenAI Launches Open-Source PaperBench, Revolution...

Return to List

Hot Reading

2 day ago

Uber engineers built an AI version of their boss

2 day ago

India’s AI boom pushes firms to trade near-term revenue for users

2 day ago

DJI sues the FCC for “carelessly” restricting its drones

19 hour ago

Google reveals Nano Banana 2 AI image model, coming to Gemini today