AI Solves 56% of Weeks-Long Coding Projects in New Benchmark: MirrorCode
4 hour ago / Read about 30 minute
Source:TechTimes

computer coding yeiferr/pixabay.com

Autonomous AI coding has crossed a threshold that most software engineers did not expect to see this year: a new benchmark released Friday by Epoch AI and METR found that today's best model, Claude Opus 4.7, can successfully reconstruct entire software projects that would take a human engineer weeks to complete — without ever seeing the source code, without human intervention, and without any access to the internet. On MirrorCode, a 25-program long-horizon coding evaluation, Claude Opus 4.7 solved 56% of targets, including a 16,000-line bioinformatics toolkit that four independent engineers estimated would take a skilled human 2 to 17 weeks to reimplement.

The benchmark marks the first rigorous, reproducible, multi-model demonstration that AI agents can sustain goal-directed software development across task horizons previously studied only by formal methods researchers pursuing the decades-old dream of automated program synthesis.

What Makes MirrorCode Different From Every Other Coding Benchmark

Most AI coding benchmarks — including the widely cited SWE-bench — measure how well a model can fix a single bug in an existing codebase, or implement a small feature given the full source code. Those tasks typically resolve in minutes and cost a dollar or two in inference compute. They measure whether a model can perform a smart, context-bounded action.

MirrorCode measures something structurally different: whether a model can reconstruct the full behavior of a program it cannot read. The AI receives only a compiled binary, natural language documentation, and a set of example input-output pairs. It can run the binary with arbitrary inputs to observe what it does — a setup the researchers call a "black-box oracle" — but it cannot see the source code, access the internet, or receive human guidance during the run. Every solution must produce byte-exact output on both the test cases the model could see and a separate set of held-out tests it could not see, ensuring there is no path to gaming the benchmark through memorization or lookup tables.

The domains covered in the full release span the breadth of working software: Unix utilities, data serialization and query tools, bioinformatics toolkits, language interpreters, static analyzers, cryptography implementations, and compression utilities. Models could implement their solutions in any of six languages: Python, C, Rust, Go, OCaml, or Ada.

Read more: Xiaomi MiMo Code Claims to Beat Claude Code: Benchmark Scores Are Self-Reported

How the Scoring Works: Byte-Exact Output, Hidden Tests, No Cheating

The benchmark's technical design addresses the most persistent problem in AI coding evaluation: distinguishing genuine competence from memorization.

Because MirrorCode tasks involve reimplementing real open-source programs, the models almost certainly encountered those codebases in pretraining. To prevent memorization from producing a false positive, the benchmark separates tests into visible and hidden sets. An average of 34% of tests per target are held out — never shown to the AI during its run. A solution passes only if it produces byte-exact matches on both sets simultaneously. Median test count per target: 601 individual input-output cases.

The paper includes a memorization screen: models were prompted to reproduce original source functions verbatim. The researchers found a baseline similarity score of 0.34 — meaning models were not predominantly retrieving memorized code — though the authors acknowledge memorization cannot be fully ruled out and expect any quantitative inflation would not change the directional finding.

The infrastructure adds three further guardrails: models cannot wrap the reference binary to mimic its outputs (the model's code is copied to a separate sandbox where the original binary is absent during scoring); models cannot interfere with the scoring mechanism (scoring runs in an isolated environment and requires string equality); and model code cannot access the internet during a run.

The Hardest Task: 19 Days, $2,600, One Shot

The benchmark's most extreme data point illustrates how far from typical benchmarking this work sits. One of the 25 target programs required $2,600 in inference compute for a single attempt and kept the model working continuously for 19 uninterrupted days. Epoch AI notes this is the cost of genuine elicitation: most existing software engineering benchmarks cap inference spending at $1 to $10, even for tasks researchers estimate would take a human weeks. At that budget, the model never gets a fair chance at the hardest programs.

The standout success story is gotree, a bioinformatics toolkit with roughly 16,000 lines of Go code and more than 40 commands. Claude Opus 4.7 reimplemented it in 14 hours, passing 2,000 of 2,001 tests — 99.95% — at a cost of $251. The one failing test covered an edge case for a niche date-annotation command. The researchers describe the reimplementation as effectively complete for all practical purposes.

For comparison, leading AI models from eight months ago would have scored approximately 30% on the same benchmark and were limited to simpler targets like a calendar utility. GPT-5.5 placed second overall, and Gemini 3.1 Pro Preview placed third with approximately 32%.

Where AI Still Fails: Architectural Limits on the Largest Programs

The 56% headline score obscures a meaningful pattern inside the data. Benchmark programs fall into three informal size tiers, and the results differ sharply across them. Small programs are solved reliably by all tested models. Medium programs are solved by the leading models in at least some runs. Large programs — including Pkl, a configuration language interpreter with 61,461 lines of code — defeated every model tested.

The Pkl failure is technically instructive. During a run that consumed approximately 1 billion tokens of inference and cost roughly $550, Claude Opus 4.6 correctly diagnosed that the program required a lazy evaluation architecture. The model never performed the necessary rewrite. With 770 million tokens still available, it continued iterating on the wrong architectural foundation. That specific failure — correct diagnosis, absent structural refactoring — represents a concrete, documented ceiling of current agentic systems rather than a general limitation of the underlying model's reasoning.

David Rein, a METR researcher and co-author of the benchmark, noted after the preliminary results in April that MirrorCode may already be approaching saturation. On 21 of 25 target programs, at least one model has passed 99% of tests or more. Eight targets have never been fully solved in any single run at 100%, but the difficulty is concentrated in a small number of hard edge cases rather than fundamental capability absence.

The Specification Gap: What This Means for Real-World Software Engineering

The researchers are precise about what MirrorCode does and does not prove. The benchmark's design requires something that is genuinely rare in real software development: a precise, programmatically checkable specification backed by hundreds of test cases and an executable reference implementation. In a professional software project, that specification usually does not exist at the start; it emerges through iteration with stakeholders, users, and product managers over time.

The benchmark demonstrates AI capability at execution, not at requirement discovery. A model that can reconstruct a 16,000-line bioinformatics toolkit from its observable behavior is demonstrating sustained architectural planning, iterative debugging, and tolerance for ambiguity across hours of uninterrupted work — qualitatively different from fixing a bug or generating a function. But it is not the same as being handed an ambiguous brief and producing production software from scratch.

The researchers frame this as a useful bound rather than a limitation: MirrorCode establishes what AI can do when the specification problem is solved. The remaining open question — how well AI performs when the specification itself must be discovered through stakeholder collaboration — is the next frontier the benchmark is not designed to measure.

Read more: Open-Source Coding Model Ornith-1.0 Writes Its Own Training Scaffold in Reinforcement Learning

What the Full Release Includes

Epoch AI and METR have open-sourced the benchmark scaffold and 22 of the 25 target programs, covering 132 task instances across the six supported implementation languages. The remaining three programs are held back as a private test set to preserve evaluation integrity as new models arrive. A leaderboard is now live at epoch.ai/MirrorCode where researchers can submit new models for evaluation.

The MirrorCode paper is authored by Tom Adamczewski and David Owen of Epoch AI, and David Rein of METR, with additional task contributions from Florian Brand, Giles Edkins, Allen Hart, and Daniel O'Connell.

The June 26 Epoch Brief also included two additional research items: an analysis of hyperscaler capital expenditure trajectories showing that major cloud providers — including Microsoft, Amazon, Alphabet, Meta, and Oracle — are on pace to spend beyond their operating cash flows before the end of 2026; and a taxonomy of more than 60 distinct tasks in frontier AI research and development, designed to track which parts of AI research remain unautomated.


Frequently Asked Questions

What is the MirrorCode benchmark and how does it work?

MirrorCode is a long-horizon coding benchmark developed by Epoch AI and METR that asks AI models to reconstruct real software programs without access to the original source code. The model receives only a compiled binary it can run, natural language documentation, and example input-output test cases. Solutions must produce byte-exact outputs on both visible and hidden test cases, making it impossible to game the benchmark through memorization or lookup tables. The 25 target programs span Unix utilities, bioinformatics, cryptography, interpreters, and other domains, with solutions implemented in any of six languages.

How does MirrorCode differ from SWE-bench?

SWE-bench gives a model the full source code of an existing project and asks it to fix a specific bug, with most tasks resolving in minutes. MirrorCode gives the model only an opaque binary and asks it to reconstruct the program's entire behavior from scratch — no source code, no internet access, no human guidance. Where SWE-bench measures targeted repair capability, MirrorCode measures sustained, architect-level construction across time horizons of hours to weeks.

Can AI replace software engineers based on MirrorCode results?

Not on the basis of this benchmark alone. MirrorCode requires something rare in real development: a precise, programmatically checkable specification with hundreds of test cases and an executable reference implementation. Professional software engineering typically begins without that level of specification clarity. What MirrorCode establishes is that when the specification problem is solved, AI can handle the execution at a professional engineer's scale — weeks of coding work — autonomously. The remaining open question is how AI performs when specifications are ambiguous, evolving, and require stakeholder negotiation.

What are the engineering limits MirrorCode revealed?

The benchmark exposed a specific architectural ceiling: AI systems can correctly diagnose that a program requires a particular architecture — such as lazy evaluation in an interpreter — but fail to perform the structural rewrite needed to implement it, even when given substantial additional inference budget. This is distinct from a general reasoning failure; it is a documented gap in how current agentic systems handle large-scale architectural refactoring mid-attempt.