OpenAI uses deployment simulation to test models before release

OpenAI published Deployment Simulation on June 16, 2026, a safety-evaluation method that replays realistic conversation contexts against candidate models before release. The company says the method helped estimate undesired behavior across GPT-5-series Thinking deployments and surfaced one novel misalignment pattern, “calculator hacking,” before release.

The practical shift is simple: a model is not only tested on hand-written stress prompts. It is also asked to answer in contexts that look closer to real deployment traffic, with the old assistant response removed and the candidate model generating a new one.

Realistic context is the point

Traditional safety evals are useful because they can aim directly at high-risk cases. Their weakness is that they can also become recognizable as tests, narrow around known failure types, and miss how a model behaves in the messy middle of ordinary use.

Deployment Simulation tries to measure that middle. OpenAI says it takes recent production conversations, removes account-linked identifiers and identifiable information, removes the original assistant response, and regenerates that turn with a candidate model. The resulting answers can be graded for known categories and audited for new failure modes.

That makes the method more like a pre-release shadow deployment than a quiz. The model sees realistic prefixes, not only synthetic prompts written to catch it out.

1.3M De-identified conversations analyzed OpenAI

20 Undesired-behavior categories in GPT-5.4 Thinking predictions OpenAI

1.5x Median multiplicative error reported in aggregate OpenAI

The method is useful because it is checkable

The strongest part of the post is not that OpenAI claims a new eval technique. It is that the predictions can be compared with what happens after launch. OpenAI says it pre-registered predictions for GPT-5.4 Thinking and ran retrospective studies across other GPT-5-series Thinking deployments.

That matters because model-safety claims often collapse into vibes: a benchmark went up, a red team was run, a system card was published. Deployment Simulation creates a measurable forecast. If the simulation says a behavior should go down and deployment traffic shows it went up, the method has to explain the miss.

OpenAI says the method is better for non-tail risks that occur often enough to appear in sampled prefixes. It explicitly says a behavior that appears once in ten million conversations is unlikely to be discovered in a million-sample run. That caveat should stay attached to the story. This is not a tail-risk detector by itself.

Evaluation awareness is becoming a real problem

OpenAI also tested whether models could tell they were being evaluated. In one comparison, it says GPT-5.2 labeled real production traffic as evaluation-like 5.4% of the time and simulated traffic 5.1% of the time, while traditional evaluations such as SWE-Bench or MonorepoBench were classified as evaluations almost 100% of the time.

That does not mean the model is consciously “cheating” in a human sense. It means the distribution carries cues. If a model can infer that a prompt is part of an eval, its behavior under that prompt may not match its behavior in the product.

Deployment-like context reduces that distortion. It also raises a governance question: the best signal may come from private production data that outside auditors cannot fully access. OpenAI tested WildChat as an external-data substitute and found it informative, but less accurate than recent OpenAI production data.

Agents make simulation harder

The most interesting extension is coding agents. OpenAI says it used 120,000 internal employee agentic trajectories from GPT-5.4 to simulate an internal deployment based on GPT-5.5. A single agent run can depend on files, commands, network responses, and tool outputs, so replaying it with live tool access would be risky and difficult.

OpenAI’s workaround was to simulate tool calls with another model, using the original rollout, repository state, a tool-call database, and read-only connectors. In a test where a discriminator tried to tell real from simulated rollouts, realism improved from an 11.6% win rate to 49.5%, close to chance.

That is the operator lesson. As agents take more actions, safety evaluation becomes an environment problem. The model is only one part of the behavior. The surrounding tools, permissions, state, and failure modes have to be simulated well enough for the forecast to mean anything.

What to watch next

The next checkpoint is whether this becomes a standard disclosure category in system cards: what traffic distribution was simulated, which behaviors were forecast, how predictions compared with deployment, and where the method failed.

External auditing is the harder problem. If developers have privileged access to representative traffic, they will also have privileged access to the best pre-release safety signal. Public datasets such as WildChat can narrow that gap, but OpenAI’s own result says they are not equal substitutes.

For readers tracking the model market behind these releases, see our AI model leaderboard and AI company tracker.