OpenAI shows GPT-5.4 improving a medicinal-chemistry reaction

OpenAI published a medicinal-chemistry result on June 17, 2026, saying GPT-5.4, connected to Molecule.one’s Maria AI and high-throughput lab, found a useful additive for a difficult Chan-Lam coupling reaction. The company says the selected proposal led to 10,080 physical reactions in Maria Lab and bench-scale validation by human chemists.

The result matters because it is not another static benchmark score. OpenAI is showing a model inside a real experimental loop: proposing research ideas, helping design experiments, reading results, and suggesting follow-up tests. It is also careful about the boundary. OpenAI calls the workflow near-autonomous, not fully autonomous, because human chemists still selected proposals, corrected plans, handled lab operations, and repeated representative reactions.

The experiment changed the evidence standard

Most AI-for-science claims are evaluated in software. A model answers biology questions, predicts a structure, writes an analysis, or ranks hypotheses. Those tasks matter, but they do not prove that a model can move a physical research program forward.

This project is more concrete. OpenAI says GPT-5.4 was given an open-ended goal to improve one of several important reaction classes. The selected proposal focused on primary sulfonamide Chan-Lam coupling, where chemists use boronic acids to form carbon-nitrogen bonds. OpenAI says the model identified the substrate class as valuable and suggested mild oxidants, including TEMPO, as possible additives.

That is the useful part of the story: the model’s idea had to survive physical chemistry. OpenAI says Maria Lab ran two cycles of experiments, then human chemists repeated representative reactions by hand at bench scale.

10,080 Maria Lab reactions reported for OAI-M1-03 OpenAI

88% Tested boronic acids with improved measured yields OpenAI

11 of 14 Bench-scale substrate pairs with higher yields OpenAI

The chemistry hook is narrow, but real

The reaction class matters because synthesis can bottleneck drug discovery. Researchers can only test molecules they can make or obtain. If a coupling reaction works unreliably across a substrate class, a medicinal chemistry team may spend time redesigning routes instead of exploring candidates.

OpenAI says the optimized conditions raised the mean yield from 16.6% to 25.2%, increased the share of reactions above 30% yield from 15.6% to 37.5%, and improved measured yields across most tested substrate groups. It also says 4-hydroxy-TEMPO, a cheaper analog, produced similar performance in follow-up testing.

That does not make the result a new general manufacturing method. OpenAI says more work is needed to map the reaction mechanism, define substrate scope, test broader lab conditions, and reproduce the finding independently. Keep that caveat attached. The piece of news is that the model-lab loop produced a plausible chemistry advance, not that AI can now run drug discovery by itself.

LifeSciBench explains the timing

OpenAI published the chemistry result the same day it introduced LifeSciBench, a benchmark for applied life-sciences work. That benchmark includes 750 expert-authored tasks, 1,062 task artifacts, 173 scientist contributors, and 19,020 rubric criteria.

The pairing is the important editorial read. OpenAI is trying to show both a real-world scientific result and a broader evaluation framework for life-science reasoning. The chemistry story is the sharper article because it crosses into wet-lab evidence. LifeSciBench gives the context: OpenAI wants life-sciences progress measured by complex judgment, artifacts, and rubrics, not only by clean question-answer tests.

The counter-case is that both sources are still OpenAI-controlled evidence. LifeSciBench tasks and the chemistry workflow may be useful, but external labs, journal review, and independent replications are what will decide how much this result changes medicinal chemistry practice.

What to watch next

The next checkpoint is reproduction outside the OpenAI and Molecule.one loop. A stronger follow-up would show another lab reproducing the TEMPO or 4-hydroxy-TEMPO effect, expanding the substrate scope, or finding a mechanism that explains where the additive works and fails.

The second checkpoint is workflow transfer. If the same pattern works across other reaction classes, the story becomes bigger than Chan-Lam coupling. If it does not, the result is still useful, but it remains a bounded example of AI-assisted experimental design.

For readers tracking the model and company layers behind this work, see our AI model leaderboard and OpenAI company tracker.