Before shipping a new model, OpenAI now runs it on replayed conversations first

PREVIEWBefore shipping a new model, OpenAI now runs it on replayed conversations first · MD

OpenAI is now running its newest AI models on replays of real conversations from earlier deployments before releasing them, a method the company calls Deployment Simulation.

The method, described by OpenAI on June 16, 2026, takes conversation traces from prior deployments and runs them through a candidate model under a privacy-preserving setup, so the lab can estimate how often undesired behaviors will surface in real use rather than only checking whether a model can be made to produce them in a test prompt.

Deployment Simulation sits alongside the traditional pre-release toolkit: targeted evaluations (curated test prompts for specific capabilities and risks), red-teaming (adversarial human testers probing for failure modes), and other safety checks. OpenAI frames it as a complementary signal that can surface where new forms of misbehavior may emerge, and at what frequency, before the model reaches users.

The method also extends to agentic rollouts, the industry's term for multi-step tasks in which the model uses tools such as browsing, code execution, or file operations to pursue a goal. Those settings are harder to test with static prompts, and OpenAI says Deployment Simulation is designed to cover them too.

OpenAI says it is already using Deployment Simulation across multiple models in its GPT-5-series Thinking line, the reasoning-oriented generation in its current flagship family, and is positioning the practice as a response to a broader problem the lab names plainly: as model capabilities grow, safety work has to track not just what a model can do, but what it is likely to do in deployment.

That shift is the story. Pre-release safety testing has historically tried to bound risk by asking a model a battery of hard questions and probing for failure. Deployment Simulation tries to estimate the rate at which a model will misbehave when it sees traffic that looks like the traffic it will actually face, which is a different statistical claim about what pre-release work can know.

What the announcement does not say is as important as what it does. OpenAI asserts that the conversation replay is privacy-preserving, but does not, in the public blog post, describe the mechanism by which traces are protected. The company does not publish a false-negative rate, meaning an estimate of how many real deployment misbehaviors the simulation misses because they were absent from the replayed traces, or because the candidate model behaves differently knowing it is being tested. The gap between simulated deployment and live traffic also remains unaddressed. A model run on stored traces is not the same as a model responding to a live user, and behavioral estimates from one do not automatically transfer to the other.

Independent researchers and regulators will, in the coming months, be asked to evaluate claims like these as the practice spreads. The questions worth asking of any lab that adopts a similar method: which traces are replayed, what is the privacy mechanism, what is the estimated false-negative rate, and how is the simulation validated against live deployment data once the model is in production.

For now, Deployment Simulation is one lab describing one approach. The strongest signal is methodological: pre-release safety work is moving from a fixed battery of adversarial prompts toward deployment-shaped estimates of behavioral frequency. Whether that move produces better models is a question the announcement itself cannot answer.

Before shipping a new model, OpenAI now runs it on replayed conversations first — type0 | type0

Before shipping a new model, OpenAI now runs it on replayed conversations first

Sources