OpenAI has announced the introduction of PaperBench, a novel benchmark designed to gauge the proficiency of AI agents in replicating cutting-edge research. This challenge necessitates agents to thoroughly comprehend, reconstruct codebases, and conduct experiments for 20 seminal papers from ICML 2024, starting from scratch. Preliminary tests reveal that Claude 3.5 Sonnet, the top-performing agent, has achieved an average reproduction score of 21.0%, yet it still falls short of surpassing the human baseline.
