AI agent incident runbook: fallback, replay and vendor status

AI incident runbook: what to do when the vendor stumbles

AI status pages are easy to ignore until they suddenly describe your own workflow. On June 7, four different kinds of disruption showed up in the same day: model errors at Claude, OCR degradation at Mistral, higher token usage in Gemini API context caching, and ChatGPT Conversations issues for OpenAI Free and Go users.

That is not a reason to stop using AI. It is a reason to stop saying "AI is down" as if every failure is the same.

If AI already drafts customer replies, reads incoming documents, searches internal policies, or runs admin steps, you need a small incident runbook. Not a heavy crisis binder. Just a routine that says: what do we pause, what may continue, what switches fallback, and what must be replayed?

Source: Claude Status, Mistral AI Status, Google AI Studio Status, OpenAI Status.

An AI incident is almost never "all of AI"

The component matters. A chat issue for a subscription tier is not the same as an API issue. An OCR incident can affect document workflows without breaking normal text generation. A caching problem can return good answers while giving you the wrong cost picture.

That makes incident handling more practical than dramatic. Ask first:

Which vendor and component are affected?
Is it chat, API, a model, OCR, file upload, caching, or account access?
Which of our workflows depend on that exact component?
Did any jobs run during the incident window that should be reviewed or replayed?

For Claude, June 7 involved elevated errors across several models and later Opus 4.7. Mistral's incident was clearly scoped to the OCR API for just under half an hour. OpenAI's status item affected ChatGPT Conversations for Free/Go users, not the OpenAI API or Codex. Gemini was trickier: context caching could cause higher token usage than expected, which is a cost incident even if the response still looks fine.

Source: Claude Status incident history, Mistral OCR API Degraded, OpenAI Status RSS, Google AI Studio Status.

Turn the status line into a decision

A useful incident runbook does not have to start with engineering. It can start with a small decision card for each AI workflow.

Write down the primary vendor, critical component, acceptable fallback, stop rule, owner, and logging requirement. That gets you surprisingly far. The point is to avoid inventing the decision while someone is waiting for a reply, an invoice, or a customer draft.

Examples:

An internal summarization flow can often wait or switch model.
An agent that approves invoice material should pause rather than guess.
An OCR flow for applications should queue documents and mark which files were parsed during the incident window.
A cost incident should not be marked only as a "successful run" if token usage was wrong.

It sounds dry. Good. Incident routines should be a little boring. Panic costs more.

The first version of the runbook

Start here, not with a twenty-page document.

Detect. Subscribe to status sources for the vendors you actually use. Also log your own errors, latency, token cost, and unusual replays.
Scope. Write the incident as component plus workflow: "Mistral OCR for incoming PDFs", "Claude API for support drafts", "Gemini context caching for evals". Avoid the broad label "AI problem".
Pause risky steps. Stop jobs that change data, send customer communication, approve finance items, or build decision material directly from uncertain input.
Switch fallback where it is safe. Drafts, summaries, and internal search can often switch model or wait in a queue. Legal, finance, HR, and database writes need stricter rules.
Replay with traceability. Save time window, input ID, vendor, model, component, cost, and result status. Replay only the jobs that were actually affected.
Write plain human copy. If customers or colleagues are affected, say what happened, what you paused, and when you expect the next update. Do not write "our AI is down" if the real issue was OCR, caching, or a specific chat tier.
Do the review. Adjust fallback rules, budget alerts, and test cases while the event is still fresh.

OpenAI's SchemaFlow example is relevant here even though it is about database changes, not outages. The pattern is useful: let AI produce traceable artifacts and plans, but do not run risky changes without checks, tests, and approval.

Source: OpenAI Cookbook: SchemaFlow.

This is not only for developers

Most incident decisions are business decisions. Should customer service wait before replying? Should finance pause approvals? Should a school send student summaries if the source file was read during an OCR incident? Should a consultant use the fallback model for a first draft but not for final advice?

That is why ownership has to sit close to the workflow, not only with the person who connected the API. A process owner can say: "this can wait", "this can be replayed", or "this needs human review before it leaves the building".

A simple Hammer way forward

If you already use AI in real routines, start with a 30-minute drill. Pick one workflow. Write down vendor, component, stop rule, fallback, and replay log. Then test the question: what would we do if this failed today at 14:00?

Hammer can help make this practical through Tool Forge or Mindset Forge: map AI workflows, set stop rules, add status and cost logs, and build replay where it is needed. Not because every incident is dangerous. Because AI used in everyday work needs everyday routines too.

FAQ

What is an AI incident runbook?

A simple routine for detecting vendor problems, pausing risky AI jobs, choosing fallbacks, and documenting what must be replayed.

When should an AI workflow be paused?

When the incident affects the component the workflow actually depends on, such as OCR, context caching, model API, file upload, or chat conversations.

What is the difference between a chat incident and an API incident?

A chat incident may affect users in an interface or subscription tier. An API incident can affect integrations, automations, and your own apps. Always check the component before pausing everything.

What should be logged after an AI incident?

Time window, vendor, component, model, affected jobs, cost anomaly, fallback decision, customer messaging, and which tasks were replayed.

The Forge newsletter

Get new articles in your inbox

Pick the topics you care about. No noise, at most one email a week.

We follow GDPR. Unsubscribe anytime.

Getting started with AI: the first step is a conversation

Mindset ForgePrompt Engineering

28 April 2026

Getting started with AI: the first step is a conversation

AI does not need to start with agents and automations. Start simpler: choose a tool, ask questions, and build the habit of conversing with AI.

When AI stops being a chatbot and becomes infrastructure

Agentic AINews

1 May 2026

When AI stops being a chatbot and becomes infrastructure

A short summary of a NotebookLM episode about how AI is moving beyond chatbots and into operational infrastructure.

AI is leaving the chat box: workflows are starting to run themselves

Agentic AI

3 May 2026

AI is leaving the chat box: workflows are starting to run themselves

This week’s podcast unpacks how Anthropic, Perplexity, OpenAI and Mistral are moving AI from simple prompts into governed, asynchronous work execution.