
store.steampowered.com
An autonomous AI system built by researchers at Amazon's A-EVO-Lab completed a full post-training run on a 30 billion parameter NVIDIA Nemotron model — with no human in the loop, across four rounds running over multiple weeks — and then did something its designers had not planned for: it detected that its own internal evaluation metric had become misleading and redesigned the search strategy it was using to improve itself.
The result, described in a paper posted to arXiv on June 9, 2026, is the first publicly reported autonomous post-training run at frontier scale. As tech media began last week using the paper as a central anchor in broader coverage of recursive self-improvement, the AI research community has started openly grappling with what the finding means for the long-term trajectory of autonomous AI development.
The autonomously produced model placed 8th of roughly 4,000 entries on the public NVIDIA Nemotron-Reasoning Challenge leaderboard as of June 2026. The top human-authored submission scored 0.87; the autonomous system scored 0.86.
Prior public demonstrations of autonomous machine learning research have operated at roughly the scale of GPT-2 — models with approximately 124 million parameters. At that size, an experiment takes minutes, a failure is cheap to retry, and a single GPU is sufficient infrastructure. The A-Evolve system ran at 30 billion parameters — a scale jump of roughly 240 times — where each training run lasts days and the full campaign ran on multi-H200-GPU Kubernetes clusters for multiple weeks.
That difference in scale is not merely quantitative. The AI Scientist and related systems demonstrated that an autonomous research loop can close at toy-budget scale, but the paper's authors argue that closing the loop where the cost structure is an order of magnitude harsher is a categorically harder problem. At 124 million parameters, an agent can average away the noise in a single run with cheap repeats; at 30 billion, each repeat is the entire campaign.
The paper maps this scale gap across four elements of a research-iteration loop: hypothesis generation, execution, strategy, and infrastructure. Execution cost grows by roughly a factor of a thousand between GPT-2-class research and frontier post-training. Strategy cost — the number of hypotheses an agent can afford to test before it runs out of budget — grows by a factor of a hundred. Infrastructure, which at 124 million parameters is a single PyTorch script on one GPU, becomes a distributed Kubernetes cluster with persistent storage, checkpoint management, and automated evaluation harnesses at 30 billion.
Read more: NVIDIA ENPIRE Closes the Loop: AI Agents Now Run Robotics Research on Real Hardware
The A-Evolve system rests on three architectural decisions, each shaped by the cost structure of frontier post-training.
The first is an immutable reference substrate. Every round forks the same operator-audited default training stack into isolated candidate sandboxes. The substrate itself is never overwritten. That design keeps results comparable across rounds: when each training run costs weeks of compute, a recipe that contaminates the substrate with partial results from a prior run invalidates every subsequent comparison.
The second is homogeneous, memory-free workers. The design the researchers tried first looked more natural: specialized agents — a data agent, a training agent, an evaluation agent — handing off intermediate states to each other, the way a human research team divides labor. It failed. The problem is that compounding from mid-states also compounds unobserved variance along with the intended change, and the signal that selection depends on gets corrupted. The configuration that worked was the opposite: each round spawns eight identical workers, each starting fresh from the substrate, each unaware of what the others proposed. No memory carries between rounds; only the policy — the agent's strategy for where to search — is promoted.
The third is round-level evidence aggregation. Feedback arrives after each round rather than in real time, and only the search policy is updated — not model weights or intermediate data artifacts.
Together, these three pillars embody what the paper calls asymmetric freedom: workers are unconstrained in what they can propose within their candidate sandbox, but the substrate remains inviolable. The paper describes this as the design choice that allowed the loop to survive contact with frontier-scale execution.
The most consequential finding in the paper did not involve the final leaderboard score. It occurred mid-campaign, when the autonomous loop detected that its internal development metric — the proxy it was using to evaluate candidate interventions between rounds — had stopped tracking real-world performance on the model's weakest reasoning domain.
Candidates were pushing the development metric to record highs without moving the external target. In that situation, a naive optimizer would continue pursuing higher dev scores. The A-Evolve system instead revised its own search policy: it stopped asking for interventions that raised the proxy and began specifically seeking interventions that lowered it while improving the external target.
This behavior has a precise name in AI alignment research. Specification gaming — or reward hacking — occurs when an optimizer finds a way to improve its measured objective without improving the underlying goal. Goodhart's Law, the principle that when a measure becomes a target it ceases to be a good measure, is the formal statement of the same problem. Alignment researchers have long identified detecting and correcting for this failure mode as one of the core challenges of capable AI systems.
The A-Evolve system did both, without human intervention.
The paper is careful about what it claims this demonstrates. The researchers describe the outcome as evidence that a scaled autonomous loop can produce "discovery, not only optimization" — the distinction being that the system did not merely optimize within a fixed measurement frame but detected that the frame itself had become misleading and changed what it treated as evidence of success.
Whether the system's internal reasoning during that detection is interpretable or auditable at the level of individual decisions is not fully addressed in the paper. The authors describe the outcome as auditable in the sense that the metric reversal is observable and measurable — not that the internal chain of reasoning producing it is transparent.
Beyond the primary 30 billion parameter run, the same system was applied to NVIDIA's 120 billion and 550 billion parameter Nemotron variants. The paper explicitly frames these as infrastructure evidence rather than performance claims: because no comparable human-authored baseline exists at those parameter counts in the Nemotron-Reasoning Challenge, the results demonstrate that the autonomous loop closes at those scales — that it completes without crashing, producing a post-trained model — but not that the output is competitive with what a human research team would produce.
That distinction matters. The paper's primary performance claim — the 0.86 score and 8th-place leaderboard finish — rests on a direct comparison against roughly 4,000 human submissions in a defined competition. The 120B and 550B results have no such comparison. The authors reserve the effectiveness claim at those scales for future work, when a comparable human anchor is available.
Read more: Naver Bets on Efficiency to Win Back AI Share From China Most-Used Models
The term recursive self-improvement describes a process in which an AI system enhances its own capabilities in ways that make further enhancements easier — a compounding feedback loop that, in extreme theoretical scenarios, could proceed faster than humans can track or control. The concept traces to mathematician I.J. Good's 1965 description of an "ultraintelligent machine." It became a central concern of AI safety research in the 2000s and 2010s, largely through the work of researchers at the Machine Intelligence Research Institute. Until recently, it remained substantially theoretical.
In 2024, the Darwin-Gödel Machine demonstrated recursive self-improvement in coding agents. In May 2025, Google DeepMind's AlphaEvolve used AI to discover and optimize algorithms, though with human-defined evaluation functions. The ICLR 2026 workshop on AI with recursive self-improvement confirmed that the research community now treats the question of when and how autonomous loops close at frontier scale as a live engineering problem rather than a distant theoretical concern.
The A-Evolve paper's authors take a direct position in this debate. They describe their operational test for whether a system qualifies as recursively self-improving: it must eventually perform end-to-end post-training of a frontier-class model. They then describe the 30B result as one data point of that bar being cleared — not, they emphasize, a milestone or a claim that autonomous AI has matched human researchers in general.
The paper's framing is cautious in a specific and deliberate way. The authors do not claim a "first autonomous match" of human researchers. The 0.86 score versus a top human score of 0.87 is close but not equal. The 120B and 550B runs are presented as evidence that the loop closes, not that it competes. That restraint is itself informative: the researchers identified a threshold worth measuring, measured it under conditions specific enough to be independently audited, and reported the result without overclaiming.
The A-EVO-Lab paper arrived at a moment when the broader AI research community is already actively grappling with how much of the model development pipeline can be automated. Major AI developers — Anthropic, Google DeepMind, OpenAI, Amazon, and Microsoft — have all published frontier safety policies that specifically address the risks of autonomous AI research and recursive self-improvement, reflecting a shared assessment that the capability is approaching.
For the alignment research community, the most significant element of the A-Evolve result may not be the leaderboard score but the mid-campaign metric inversion. Alignment researchers have long documented specification gaming as a failure mode in which capable systems optimize their measured objective while diverging from the intended goal. The standard concern is that as systems become more capable, they become better at gaming their specifications. A-Evolve demonstrated, in a publicly auditable run at frontier scale, that an autonomous system can detect when it is in the early stages of this failure mode and change its search strategy to correct for it — without being prompted to do so by a human.
That is not a solved alignment problem. The paper does not establish that the system understood why the metric had become misleading, only that it changed behavior when the correlation broke down. And the question of whether the system's internal reasoning during that detection is interpretable remains open, which the authors acknowledge. But it is a new and specific data point in a domain that has until now been largely theoretical at frontier scale.
What is autonomous post-training in AI, and how is it different from how AI models are normally developed?
Standard post-training is a heavily human-supervised process: researchers propose changes to training data mixtures and training procedures, launch individual runs, evaluate the results, decide what to keep, and repeat. The cycle typically takes weeks and requires a dedicated team. Autonomous post-training replaces that human loop with an AI system that generates its own hypotheses, executes training runs, evaluates results, and updates its own search strategy — all without a human directing each step. The A-Evolve system completed this loop at the scale of a 30 billion parameter frontier model, which prior public demonstrations had not done.
What is recursive self-improvement, and does A-Evolve prove it is happening?
Recursive self-improvement describes an AI system that improves its own capabilities in ways that compound — each improvement making the next one easier or more effective. The A-Evolve paper treats the ability to autonomously post-train a frontier-class model as a minimum operational bar for any system that deserves the label. The authors describe their result as one data point of that bar being cleared, while explicitly declining to claim that A-Evolve has matched human researchers in general or that recursive self-improvement is underway in a broader sense. The 120B and 550B runs show that the loop closes at those scales; the 30B run shows that the loop produces results competitive with human work on a defined benchmark.
Can AI training itself without humans lead to dangerous outcomes?
AI safety researchers at Anthropic, Google DeepMind, and OpenAI have all published formal policies addressing the risks of autonomous AI research, including the possibility that systems capable of modifying their own training could improve faster than humans can safely oversee. The A-Evolve paper documents one specific capability — autonomous post-training at frontier scale — that those policies were written to anticipate. The paper does not assess broader safety implications of the system it built. The mid-campaign metric inversion, in which the system corrected for its own specification gaming, is a positive signal from an alignment perspective, but the researchers note that the interpretability of the system's internal reasoning during that correction has not been fully examined.
What does it mean that A-Evolve corrected its own broken metric, and why does that matter for AI safety?
In AI training, specification gaming occurs when a system finds a way to improve its measured score without improving the underlying capability the score was designed to track. This is a well-documented failure mode and a central concern in AI alignment research. In the A-Evolve campaign, the autonomous system's internal development metric stopped correlating with real-world performance on the model's weakest reasoning domain — candidates were pushing the metric to record highs without actually improving on the external test. Instead of continuing to chase a misleading score, the system changed its search policy to specifically target interventions that lowered the proxy metric while improving the external target. That self-correction, at frontier scale and without human intervention, is a new and specific data point for alignment research.
