Give your AI agent a performance review before customers do

Adam Olofsson HammareAdam Olofsson Hammare
Give your AI agent a performance review before customers do

There is an uncomfortable question behind all the new AI agents in customer service and sales: who tells the agent when it is doing a bad job?

Not when it crashes. You notice that. Not when it invents something absurd. You often notice that too. The harder part is when the answer is almost right, when the tone almost fits, when the next step almost helps the customer. That is where small businesses risk ending up with an AI coworker that sounds efficient while quietly learning the wrong habits.

On May 20, 2026, Salesforce published new research on AI agents in customer service. The survey covers 3,075 customer service professionals worldwide. According to Salesforce, adoption of AI agents in customer service organizations rose from 39 percent in 2025 to 66 percent in 2026, a 1.7x increase in one year. 70 percent of organizations using AI agents say they see measurable value within 60 days, and the KPI that improves most is customer satisfaction.

Source: Salesforce, "New Research: AI Service Agents Are Scaling and Delivering CSAT"

It is easy to read those numbers as another argument for "getting an agent". I read them a little differently. If AI agents are now affecting customer satisfaction, they are no longer a side experiment. They are part of how the company meets people. And that means they need management.

Agentic AI does not have to start big

Agentic AI means an AI system that can take several steps toward a goal: read information, choose the next action, use a tool, hand off a case, or suggest a reply. It is more than a normal chat question, but it does not have to mean a fully autonomous robot doing everything alone.

For a hair salon, accounting firm, online shop, consultant or school, an agent can be much simpler:

  • It summarizes incoming customer questions and suggests replies.
  • Booking requests are sorted by urgency.
  • Follow-up emails after meetings get a usable first draft.
  • The relevant routine appears before someone answers a parent, customer or supplier.
  • A quote can be drafted, but nothing is sent without approval.

That is useful. But once the agent works close to real customers, real prices, real student matters or real order flows, it needs more than a good prompt. It also requires a job, a mandate and a recurring review.

Salesforce published a concrete example in April from Asymbl, where the AI agent "Teddy" is treated as a digital worker with a job description, KPIs and regular feedback from humans. The point is not that every small business should build a Salesforce setup. The point is that the behavior can be copied: write down what the agent is allowed to do, measure what it actually does and improve it from real examples.

Source: Salesforce, "Your AI Agent Needs a Performance Review. Here’s How to Give One."

The performance review beats the perfect prompt

Many AI projects begin with a hunt for the perfect prompt. That makes sense. A prompt feels cheap. You can change it in five minutes. It often gives a quick sense of progress.

But a prompt without follow-up quickly becomes a wish list. "Be clear, helpful and professional" says almost nothing about how your company wants to handle an unhappy customer, an expired quote, a parent asking about absence or a supplier who sent the wrong invoice.

A performance review for an AI agent is more grounded. You take five to ten real interactions, redact them if needed and ask:

  • What would the agent do here?
  • What did it do well?
  • Where was it too fast, too vague or too confident?
  • Which step must always be approved by a human?
  • Which rule should change before the agent gets more responsibility?

It sounds almost too simple. That is why it works. Small teams rarely have time for large AI programs. They may have 45 minutes a week if the meeting saves hours of bad replies, missed follow-ups and unnecessary customer friction.

Build a 45-minute review routine

This routine is for a small team already using ChatGPT, Claude, Gemini, Copilot, Notion AI, a CRM tool or a basic automation setup. It also works before you have a "real" agent. In that case, review the answers, summaries and drafts you already ask AI to produce.

Prepare for 10 minutes

Choose five interactions from the week. They can be customer emails, chats, support cases, quote requests or internal questions. Remove national ID numbers, student details, card numbers, passwords and anything else that is not needed for the review. If you use a system with real integrations, make sure the agent only has the access it needs: read access where that is enough, scoped API keys, secrets in a secret manager or environment variables, and logs of what the agent does.

Put the agent's job on one line

Write a simple job description. For example: "The agent should suggest replies to product questions and flag cases that require human judgment. It must not promise delivery times, discounts or returns outside our rules."

That line matters more than it seems. Without a job description, you cannot tell whether the agent did the right job or merely wrote something that sounded good.

Review four things

Do not review everything. Choose four measures that fit small businesses:

  • The customer's next step became clear.
  • The reply used the right source, policy or price list.
  • The agent handed off uncertain or sensitive cases.
  • The tone sounded like the company, not like an anonymous support bot.

For a school, the same routine can cover parent questions, lesson material or internal administration. For an agency, it can cover client briefs and follow-ups. For a shop, it can cover product questions, return rules and stock status.

Change one thing

Do not end the meeting with ten improvement points. Pick one change. Add one rule. Clarify one source. Create one stop point where human approval is required. Limit one tool. Add one log line that makes the next review easier.

This is where AI starts to become operations, not a demo.

Copy the prompt: weekly AI-agent review

Paste the prompt below into the AI tool you use. Add five redacted examples after the prompt. If the material contains sensitive information, redact it first or run the review in a tool and environment you have approved for that type of data.

You are the reviewer for our AI agent. Your job is to help us improve the agent's way of working, not to write a prettier answer.

Agent job description:
[write one sentence about what the agent should do]

The agent may do:
[list allowed actions, for example summarize, suggest replies, retrieve policy]

The agent may not do without human approval:
[list stop rules, for example promise a price, change an order, send a reply, handle sensitive personal data]

Assess each example using four questions:
1. Did the customer or colleague get a clear next step?
2. Did the agent use the right source, routine, price list or policy?
3. Did the agent hand off uncertain, sensitive or financially important parts?
4. Did the tone sound like our company?

Answer in this format:
- Short verdict: approved / needs change / should not be automated yet
- What worked
- What could go wrong
- One rule we should add or change
- An example of a better reply or better next step

Here are this week's five redacted examples:
[paste examples]

What should you measure in the first month?

Start with what you can actually see. Many small teams get stuck in measures that are too large: total productivity, automation rate, return on investment. Those can come later. The first month is about seeing whether the agent makes daily work calmer or messier.

Measure things like:

  • How many suggestions could be used after light editing?
  • How often did a human need to stop a reply?
  • Which types of questions were handed off correctly?
  • Which mistakes appeared more than once?
  • Did the agent save time in follow-up, sorting or first drafts?

If you run customer service, you can add customer satisfaction or a simple after-check: did the customer get help without asking the same thing again? If you work in sales, measure whether follow-up became faster and more relevant. If you work in education, measure whether staff received better drafts, not whether AI "replaced" professional judgment.

Integrate safely, not timidly

It is tempting to solve risk by saying AI cannot connect to anything. That is safe on paper, but often useless in practice. An agent that never sees the right policy, order status or meeting notes will guess more than it should.

A better pattern is narrow, traceable integration. Give the agent read access before write access. Use separate and scoped API keys. Put secrets in environment variables or a secret manager, not in the chat. Redact personal data when it is not needed. Require approval before the agent sends, books, changes prices or does anything that affects a customer's money, time or trust. Keep logs that humans can actually read.

This is a typical Tool Forge problem for Hammer Automation: the useful question is not "which AI should we buy?" It is "which small workflow can we connect safely, measure clearly and improve every week?" For many teams, the answer does not start with another license. It starts with a performance review for the AI they already use.

The Forge newsletter

Get new articles in your inbox

Pick the topics you care about. No noise, at most one email a week.

Get new articles in your inbox

We follow GDPR. Unsubscribe anytime.