Test the AI agent on real workflows before you switch models

Adam Olofsson HammareAdam Olofsson Hammare
Test the AI agent on real workflows before you switch models

Changing an AI model is often dangerously easy. Someone edits a model name, turns on a new agent tool, or lets the assistant write into a system that used to require human hands. At first, nothing seems wrong. Then the odd failures appear: the wrong tone in a customer reply, a missed escalation, a tool request that should have paused for approval, or a workflow that gets expensive because the agent keeps retrying.

On June 16, OpenAI published a research post about deployment simulation: testing a candidate model on realistic, privacy-filtered workflows before release. Hammer Automation does not need to copy OpenAI's scale. The useful lesson for ordinary organizations is smaller: before AI gets more responsibility, let it rehearse on your own cases in a place where a bad answer cannot hurt anyone.

What OpenAI actually tested

Deployment simulation means taking previous conversations, removing the old AI response, and letting a candidate model answer the same context. The new answers are then checked against defined failure categories. OpenAI says the method was used across GPT-5-series Thinking models, with about 1.3 million de-identified conversations from August 2025 to March 2026, 20 types of undesired behavior, and agentic tool-use settings.

The number is not the point. The habit is. Benchmarks often show how a model behaves on test prompts. Deployment simulation tries to show how it behaves in the traffic and workflows it will actually meet after release. OpenAI also says the method surfaced new issues, including calculator hacking, before deployment.

Source: OpenAI, Predicting model behavior before release by simulating deployment and the research PDF

Run a small deployment rehearsal instead

A smaller organization does not need a research platform. A rehearsal lane is enough. Pick a few real workflows where AI is already used or about to be used: customer email, internal policy questions, student support, quote drafts, case summaries, invoice checks, or code changes in a non-critical repo.

Start here:

  • Choose 20 to 50 representative cases. Prefer normal work over dramatic edge cases.
  • Remove personal data, customer names, internal IDs, and sensitive details before testing.
  • Write the failure categories first: wrong advice, data leakage, bad tone, missed escalation, bad tool call, excessive cost, or unclear source.
  • Run the new model or agent in mirror mode. It can answer, but the answer does not reach a customer, student, colleague, or system.
  • Have a human review every answer with the same simple form.
  • Decide the stop rule before the test starts. If three out of five critical cases fail, the update does not move forward that day.

This sounds less exciting than a demo. Good. Production should be less exciting than a demo.

The test must measure operations, not only answer quality

An AI agent can write a good answer and still be the wrong tool for the workflow. So the rehearsal should capture more than text quality. Log what the agent tried to do, which tools it wanted to use, what it asked approval for, how long it took, what it cost, and whether it got stuck.

The same week made that point clearly. On June 16, Claude had a status cluster where Sonnet and Opus models were affected in one phase, followed by Opus 4.8-specific errors. Anthropic reported about a 10 percent error rate during parts of the incident and listed impact on claude.ai, Claude API, Claude Code, and Claude Cowork. That is not the same as saying every AI system is down. It is a model and component problem that needs the right fallback.

Source: Claude Status, Elevated errors across many models and Claude Status, Elevated errors for Claude Opus 4.8

Stop rules that people can actually use

Write the stop rules in plain language, not as a technical policy nobody reads. For example:

  • AI may not send customer communication when the source is missing or the tone needs human judgment.
  • AI may not delete, publish, pay, book, or grant system access without separate approval.
  • A new model may not be enabled in a critical workflow on the same day the vendor has an active capacity or model incident.
  • If the agent misses escalation in test cases involving complaints, personal data, or safety, the rollout pauses.
  • If cost per run is more than twice the expected level, the next step needs a named budget owner.

This is where Tool Forge becomes practical: not through a polished agent demo, but through permissions, logging, fallback, and ownership around the work that already exists.

A good first week

Pick one workflow. Not the whole organization. Take 30 old examples, privacy-filter them, and run the new model or agent in mirror mode. Have two people review the results: one who owns the process and one who understands the tool. Save the decisions in a simple log: case, expected behavior, actual behavior, failure category, cost, approvals, and decision.

After that, you know more than you know after another sales demo. You know whether the AI handles your normal cases. You know where a person must stop it. And you know whether the next step is a rollout, a changed routine, or a calm no until the tool is ready.

FAQ

What is deployment simulation?

Testing a new AI model or agent against representative, privacy-filtered workflows before it reaches production.

Do smaller organizations need OpenAI-level machinery?

No. Start with 20 to 50 typical cases, clear failure categories, human review, and a simple stop rule.

Which risks should be measured?

Wrong advice, data leakage, bad tool calls, unwanted tone, missed escalation, cost, and interruption during vendor incidents.

When should an AI update be stopped?

When tests show unacceptable errors in critical cases, no fallback, or permissions the organization has not approved.

The Forge newsletter

Get new articles in your inbox

Pick the topics you care about. No noise, at most one email a week.

Get new articles in your inbox

We follow GDPR. Unsubscribe anytime.