OpenAI's GeneBench-Pro makes biology benchmarks about judgment

OpenAI published GeneBench-Pro on June 30, describing it as a research-level benchmark for AI agents working in computational biology. The benchmark is built around a simple problem with current AI evaluation: real scientific work is full of judgment calls, but many benchmarks reward clean recall, clean math, or a fixed workflow.

GeneBench-Pro has 129 questions across 10 domains and 21 subdomains, including statistical genetics, population genetics, quantitative genetics, regulatory omics, clinical diagnostics, pharmacogenomics, cancer genomics, microbial genomics, and forensic genetics. OpenAI says the benchmark is meant to test whether models can choose an analysis path, handle ambiguity, revise assumptions, and know when a result is decision-ready.

That makes it a useful marker for scientific AI. The hard question is no longer only whether a model can run code or read papers. It is whether it can make consequential analytical choices under uncertainty.

Synthetic data is the control mechanism

OpenAI says GeneBench-Pro problems are built synthetically so the benchmark creators know the full causal structure and data-generating process. That is not a minor implementation detail. It is the mechanism that lets the benchmark grade judgment-heavy tasks without collapsing into arbitrary author preference.

Many long-horizon science benchmarks use messy historical datasets. Those are realistic, but they often have no single correct path. Two competent analysts may pick different cutoffs or defensible methods. If both approaches are reasonable, a benchmark can accidentally score the benchmark author’s taste rather than the model’s capability.

OpenAI is trying to avoid that failure mode. Because it controls the simulated data, it can tune complexity, accept reasonable analytical variation, and verify that incorrect analysis fails. It also says problem drafts were audited through trace analysis to check for leakage and unintended solution shortcuts.

That is the right direction for scientific evaluation. A benchmark should punish shallow shortcuts, not alternative but legitimate statistical choices.

Expert review matters here

OpenAI says it sent 82 of the 129 questions to external domain experts, including graduate students, postdoctoral researchers, industry scientists, and professors. Reviewers assessed realism, whether the target answer was identifiable, and whether the methods and estimators were appropriate.

That external review does not make the benchmark final or neutral. It does make it more credible than a closed internal task set with no outside domain pressure. Biology tasks can look plausible to non-specialists while hiding unrealistic assumptions, unmeasurable targets, or grading traps. Expert review is one way to catch that.

The benchmark also includes metadata around intended analysis structure, attached data files, case studies, and expert review outcomes. OpenAI says it is open-sourcing 10 representative questions on Hugging Face and will provide a 50-question subset to Artificial Analysis for independent third-party benchmarking in the near future.

That last part is important. OpenAI-reported results are useful, but the benchmark becomes more valuable when an outside evaluator can run the subset and compare models under a published method.

The reported scores show how early this is

OpenAI says GPT-5.6 Sol reaches a 28.7% pass rate at the highest reasoning level, or 31.5% with Pro mode enabled. It says GPT-5 scored below 5% when OpenAI began building the original GeneBench. OpenAI also says GPT-5.6 Sol at the highest reasoning level solves nearly six times as many questions as GPT-5.2 while using about two-thirds as many tokens.

Those are OpenAI’s reported numbers. They should not be treated as settled cross-model rankings until third-party runs are available. But the shape is still useful. A pass rate around 30% on the strongest configuration means the task family is not saturated today. It also means a model can be impressive and still fail most of the time on judgment-heavy scientific work.

That is a healthier message than “AI can do science now.” The useful claim is narrower: frontier models are improving on parts of scientific analysis, but the benchmark is still hard enough to expose gaps.

Our analysis

The benchmark is a product signal, not an automation license

OpenAI has been building a life-sciences line through GPT-Rosalind, rare-disease work, an AI chemist study, and now GeneBench-Pro. The pattern is clear. OpenAI wants scientific reasoning to be a measurable frontier, not just a collection of demos.

GeneBench-Pro gives that strategy a more inspectable target. If a model can handle ambiguous computational-biology tasks, it may be more useful in research settings where the value comes from choosing the right analysis, checking assumptions, and understanding when an answer is not ready.

The operational caveat is serious. Scientific agents need provenance, audit trails, human review, and domain-specific validation. A model that scores well on GeneBench-Pro should still not be trusted to make clinical or research decisions alone. The benchmark is evidence about analytical capability, not a license to automate scientific judgment end to end.

The best use of GeneBench-Pro is comparative and diagnostic. Which models fail because they choose the wrong method? Which fail because they make numerical mistakes? Which fail because they ignore uncertainty? Those questions are more useful than a single score.

That is why this release matters. It moves the scientific-AI conversation toward the messy middle of research: not facts, not code execution alone, but judgment under uncertainty.

OpenAI's GeneBench-Pro makes biology benchmarks about judgment

Synthetic data is the control mechanism

Expert review matters here

The reported scores show how early this is

Sources