A computational biology bench with simulated genomic data paths branching into reviewed analysis decisions
A computational biology bench with simulated genomic data paths branching into reviewed analysis decisions
+ OpenAI News

OpenAI's GeneBench-Pro makes biology benchmarks about judgment

GeneBench-Pro tests whether AI agents can handle ambiguous computational-biology analysis, not just clean benchmark questions.

about 2 hours ago

OpenAI published GeneBench-Pro on June 30, describing it as a research-level benchmark for AI agents working in computational biology. The benchmark is built around a simple problem with current AI evaluation: real scientific work is full of judgment calls, but many benchmarks reward clean recall, clean math, or a fixed workflow.

GeneBench-Pro has 129 questions across 10 domains and 21 subdomains, including statistical genetics, population genetics, quantitative genetics, regulatory omics, clinical diagnostics, pharmacogenomics, cancer genomics, microbial genomics, and forensic genetics. OpenAI says the benchmark is meant to test whether models can choose an analysis path, handle ambiguity, revise assumptions, and know when a result is decision-ready.

That makes it a useful marker for scientific AI. The hard question is no longer only whether a model can run code or read papers. It is whether it can make consequential analytical choices under uncertainty.

Synthetic data is the control mechanism

OpenAI says GeneBench-Pro problems are built synthetically so the benchmark creators know the full causal structure and data-generating process. That is not a minor implementation detail. It is the mechanism that lets the benchmark grade judgment-heavy tasks without collapsing into arbitrary author preference.

Many long-horizon science benchmarks use messy historical datasets. Those are realistic, but they often have no single correct path. Two competent analysts may pick different cutoffs or defensible methods. If both approaches are reasonable, a benchmark can accidentally score the benchmark author’s taste rather than the model’s capability.

OpenAI is trying to avoid that failure mode. Because it controls the simulated data, it can tune complexity, accept reasonable analytical variation, and verify that incorrect analysis fails. It also says problem drafts were audited through trace analysis to check for leakage and unintended solution shortcuts.

That is the right direction for scientific evaluation. A benchmark should punish shallow shortcuts, not alternative but legitimate statistical choices.

Expert review matters here

OpenAI says it sent 82 of the 129 questions to external domain experts, including graduate students, postdoctoral researchers, industry scientists, and professors. Reviewers assessed realism, whether the target answer was identifiable, and whether the methods and estimators were appropriate.

That external review does not make the benchmark final or neutral. It does make it more credible than a closed internal task set with no outside domain pressure. Biology tasks can look plausible to non-specialists while hiding unrealistic assumptions, unmeasurable targets, or grading traps. Expert review is one way to catch that.

The benchmark also includes metadata around intended analysis structure, attached data files, case studies, and expert review outcomes. OpenAI says it is open-sourcing 10 representative questions on Hugging Face and will provide a 50-question subset to Artificial Analysis for independent third-party benchmarking in the near future.

That last part is important. OpenAI-reported results are useful, but the benchmark becomes more valuable when an outside evaluator can run the subset and compare models under a published method.

The reported scores show how early this is

OpenAI says GPT-5.6 Sol reaches a 28.7% pass rate at the highest reasoning level, or 31.5% with Pro mode enabled. It says GPT-5 scored below 5% when OpenAI began building the original GeneBench. OpenAI also says GPT-5.6 Sol at the highest reasoning level solves nearly six times as many questions as GPT-5.2 while using about two-thirds as many tokens.

Those are OpenAI’s reported numbers. They should not be treated as settled cross-model rankings until third-party runs are available. But the shape is still useful. A pass rate around 30% on the strongest configuration means the task family is not saturated today. It also means a model can be impressive and still fail most of the time on judgment-heavy scientific work.

That is a healthier message than “AI can do science now.” The useful claim is narrower: frontier models are improving on parts of scientific analysis, but the benchmark is still hard enough to expose gaps.

Sources

The AI Feed Desk

The AI Feed Desk

Editorial desk

The AI Feed Desk tracks AI provider updates, model releases, agent tooling, and enterprise adoption, turning fast-moving announcements into source-linked context for builders and operators.

Noticed a typo, incorrect information, or translation error?

Tell us so we can fix it.

Help Improve This Article

Related Articles

OpenAI's rare-disease study makes old genome cases worth reopening

OpenAI says o3 Deep Research helped experts reanalyze 376 previously unsolved rare-disease cases and establish 18 diagnoses after clinical review.

The AI Feed Desk

By The AI Feed Desk

OpenAI puts o3 and GPT-4.5 on a ChatGPT sunset clock

OpenAI will retire GPT-4.5 from ChatGPT on June 27 and OpenAI o3 on August 26, with no API change. Teams should audit model-specific workflows now.

The AI Feed Desk

By The AI Feed Desk

GPT-5.5 Instant makes health a default ChatGPT test

OpenAI says GPT-5.5 Instant improves ChatGPT health responses for free users, with physician rubrics, HealthBench evaluations, and production factuality monitoring.

The AI Feed Desk

By The AI Feed Desk

OpenAI puts ChatGPT Enterprise spend into the admin console

OpenAI is adding credit usage analytics and updated spend controls for ChatGPT Enterprise, including ChatGPT and Codex usage by user, product, and model.

The AI Feed Desk

By The AI Feed Desk

OpenAI brings ChatGPT and Codex to Samsung Electronics employees

OpenAI says Samsung Electronics is deploying ChatGPT Enterprise and Codex to all employees in Korea and all Device eXperience employees worldwide.

The AI Feed Desk

By The AI Feed Desk