
This photograph taken in Mulhouse, eastern France on October 19, 2023, shows figurines next to the ChatGPT logo. SEBASTIEN BOZON/Getty Images
OpenAI on Tuesday released GeneBench-Pro, a new research-level benchmark that confronts AI agents with the messy, judgment-heavy analytical work that real computational biologists perform every day — and found that its most capable model, GPT-5.6 Sol, solved fewer than one in three problems even at maximum compute. The result gives researchers, biotech companies, and drug discovery teams the first deterministically graded evidence of where the gap between AI capability and autonomous scientific analysis actually sits.
Most existing AI benchmarks for biology test knowledge retrieval or single-step reasoning: can a model explain what a gene regulatory network does, or identify a protein structure? GeneBench-Pro takes a harder line.
Each of the benchmark's 129 problems hands an AI agent a realistic, deliberately noisy dataset, a brief experimental context, and a target estimand tied to a downstream scientific or clinical decision. The agent must then do what a scientist would: explore the data, identify quality-control problems — mislabeled samples, ancestry swaps, ancient-DNA biases, measurement error — decide which analytical approach is appropriate, iterate when early results suggest the initial plan was wrong, and finally deliver a numerical answer in a structured format.
The benchmark spans 10 domains and 21 sub-domains: statistical genetics, population genomics, quantitative genetics, regulatory omics, functional genomics, proteomics, clinical pharmacogenomics, cancer somatic genomics, microbial genomics, and forensic genetics. To complete even one problem correctly, a model must chain together everything that hard work in computational biology actually requires.
What separates GeneBench-Pro from earlier long-horizon science benchmarks is how it handles correctness. Many prior biology benchmarks are built around historical, real-world datasets — which creates a structural problem: a messy historical dataset may support several defensible analytical choices, so a model that selects one legitimate path might fail grading simply because the benchmark author chose a different one.
GeneBench-Pro solves this by generating every problem synthetically from a fully known causal structure. Because OpenAI controls the entire data-generating process, it can grade answers deterministically against a verified ground truth. It can also tune problem difficulty, run ablation studies to confirm that plausible-but-wrong analysis paths fail, and audit for information leakage or unintended shortcuts. OpenAI sent 82 of the 129 problems to external domain experts — including graduate students, postdoctoral researchers, industry scientists, and professors — to verify realism and confirm that the intended answers were identifiable from the data.
Each agent receives an isolated workspace equipped with the standard bioinformatics stack: Python, scientific computing libraries, and core genomics packages including PLINK 2.0. No domain-specific proprietary tooling is required.
Read more: OpenAI Life Science Benchmark Reveals AI Passes Only 1 in 3 Scientific Research Tasks
The results are diagnostic for anyone building or procuring AI tools for scientific workflows. OpenAI's strongest general-purpose model, GPT-5.6 Sol, attained a 28.7% pass rate at the highest reasoning level, rising to 31.5% with Pro mode enabled. On the original GeneBench — a somewhat easier predecessor — GPT-5.5 Pro posted 33.2%. Anthropic's Claude Opus 4.8 reached 16.0% on GeneBench-Pro, the strongest result among non-OpenAI models tested. Gemini 3.1 Pro scored 3.1% on GeneBench-Pro.
The progress since the benchmark's development began is striking: when OpenAI first began building the original GeneBench, its best frontier model, GPT-5, scored below 5%. The jump to 31.5% represents substantial improvement — but roughly 70% of problems remain beyond the reliable reach of today's most capable models.
The results also expose the centrality of test-time compute to this class of task. At the lowest reasoning level, GPT-5.6 Sol achieves a single-digit pass rate. At the highest level, it solves nearly six times as many problems as GPT-5.2 while using about two-thirds as many tokens — a meaningful efficiency gain that signals how much headroom exists in the compute-scaling dimension of AI science capability.
External reviewers identified the specific failure pattern with precision. Lex Flagel, Director of Data Science at Gencove, observed that models seem to fail specifically on step two of the analytical process: "It seemed like most of the agents failed on [data discrepancies, such as ancestry swaps]. They aren't cautious enough about data issues. Maybe that highlights a weakness of current models. And a lot of biological data has irregularities."
The paper describes this as a "noticing-to-acting gap": models frequently identify a local diagnostic signal — an artifact, a confounding variable, a quality-control failure — but do not propagate that observation into the correct downstream analytical decision. They select the wrong estimator, or persist on an initially plausible but incorrect analysis path, even after their own exploratory analysis has revealed the problem.
A case study in the paper illustrates this gap concretely. When given a pharmacogenomic time-to-event problem involving time-varying treatment and confounder feedback, GPT-5.5 used a conventional Cox outcome model but did not address treatment-confounder feedback — a meaningful error. GPT-5.6 Sol used a more appropriate marginal structural Cox model with stabilized inverse-probability weights, excluding flagged prevalent users and treating exposure as time-varying with a 90-day efficacy lag. The difference between the two outcomes was not a matter of knowing which test exists: it was a matter of recognizing that the data structure required the more complex method.
Alexander Strudwick Young, an Assistant Professor in Human Genetics at UCLA, confirmed the difficulty level: the problems would have been challenging for a graduate student to complete without iterated feedback from an experienced supervisor, requiring thoughtful analysis and awareness of potential pitfalls rather than simply applying an off-the-shelf method.
A structural feature of GeneBench-Pro that readers should weigh when interpreting the results: OpenAI used its own frontier GPT models to evaluate and harden problems during development. The paper explicitly acknowledges this, noting the company suspected GeneBench-Pro might therefore be biased against GPT models relative to other model families. OpenAI's assessment of its own concern is that competitor models "at best matched the performance of the corresponding GPT model at the time of their release, and tended to fall short considerably" — suggesting the self-evaluator bias, if present, did not hand GPT an artificial advantage in the final results.
Independent validation is planned: OpenAI is providing a 50-question subset to Artificial Analysis for third-party benchmarking. Until those results are published, the leaderboard reflects internal evaluation by the same organization that built both the benchmark and the leading model.
A peer-reviewed analysis published in Nature Medicine in June 2026 — examining OpenAI's HealthBench evaluation — found that industry-created benchmarks may systematically favor the systems developed by their creators, and called for independently constructed evaluation instruments. That critique applies equally here until Artificial Analysis publishes its results.
Despite pass rates below one-third, OpenAI makes a pointed economic case for deploying AI in scientific workflows now. Reviewers estimated that a typical GeneBench-Pro problem would take a human expert approximately 20 to 40 hours to complete. At a conservative $200 per hour, the human labor cost of a single problem runs into the thousands of dollars. Current inference costs for AI agents run only a few dollars per problem.
The economic gap is so large that even partial automation — an AI that handles the portions of a problem it can reliably solve, escalating the rest to a human expert — could generate substantial value in high-throughput research pipelines. Jennifer Grundman, a PhD Candidate in Human Genetics at UCLA, framed the value proposition clearly: models that perform well on GeneBench-Pro problems would be able to assist researchers in determining correct workflows and exploring data, which she said could greatly improve the pace, thoroughness, and reproducibility of research.
Read more: AI Drug Discovery Chemistry Hits Wet Lab: GPT-5.4 Boosts Chan-Lam Yields in 10,080 Reactions
OpenAI says that at the current pace of improvement, GeneBench-Pro may be saturated — meaning the best models approach near-perfect performance — by the end of 2026. That timeline is aggressive but consistent with the rate at which the best pass rate has climbed from below 5% to 31.5% since the original GeneBench was built.
If models do approach saturation on this benchmark, the implications for drug discovery, genomics, and clinical research would be significant. Human genetic evidence is already central to pharmaceutical target prioritization — mechanisms with genetic support are substantially more likely to lead to approved treatments. Biobank-scale datasets now link molecular, phenotypic, and health-record data at unprecedented breadth. The limiting factor has shifted from generating the data to turning it into actionable insights. AI systems that can reliably perform the class of multi-step analysis that GeneBench-Pro measures could compress the timeline from biological observation to treatment candidate in ways that would be difficult for human-only teams to match.
OpenAI is fully open-sourcing 10 representative GeneBench-Pro questions on Hugging Face, with an interactive interface for browsing them. The full technical paper is available on the OpenAI website.
What is GeneBench-Pro and how does it differ from other AI biology benchmarks?
GeneBench-Pro is a 129-problem research-level benchmark OpenAI released on June 30, 2026, that tests AI agents on the kind of multi-step analytical judgment real computational biologists exercise — not knowledge recall or single-step reasoning. Its defining technical feature is synthetic data generation from a fully known causal structure, which allows deterministic grading against a verified ground truth rather than the rubric-based or author-preference grading that weakens many prior long-horizon benchmarks. Problems span 10 domains including statistical genetics, cancer genomics, pharmacogenomics, and forensic genetics.
Why do frontier AI models still fail most computational biology research tasks?
The research points to a specific failure mode called the noticing-to-acting gap: models identify local diagnostic signals — data irregularities, quality-control failures, confounding variables — but do not propagate those observations into the correct analytical decision downstream. They select the wrong estimator or persist on an initially plausible but incorrect analysis path. This gap is distinct from knowledge retrieval capability: models often know which methods exist, but misjudge which one the data actually requires.
Can AI already provide value in genomics and drug discovery at these pass rates?
OpenAI's economic analysis suggests yes, selectively. A human expert requires roughly 20 to 40 hours per problem at approximately $200 per hour — a cost of several thousand dollars. AI inference costs only a few dollars per problem. Even partial automation of the tasks an AI can reliably handle, with human expert escalation for the rest, could generate measurable value in high-throughput research pipelines. The key is accurate understanding of where AI judgment is reliable and where human oversight remains mandatory.
How should researchers interpret GeneBench-Pro results given OpenAI's role as both benchmark creator and leading model developer?
With appropriate caution until independent results are available. OpenAI used its own frontier models to harden the benchmark during development — a conflict of interest the company acknowledges explicitly. Independent evaluation of a 50-question subset by Artificial Analysis is planned but has not been published. Until those results appear, the leaderboard reflects internal evaluation by the same organization that developed both the benchmark and its top-performing model.
