AI Enablement Radar week 21: agent work needs a receipt

Adam Olofsson HammareAdam Olofsson Hammare
AI Enablement Radar week 21: agent work needs a receipt

The clearest signal this week is simple: AI is moving from open-ended chat into workflows where someone needs to see what happened. KPMG is giving Claude to more than 276,000 employees, GitHub lets the Copilot cloud agent handle code-review fixes, OpenAI Codex gets longer-running Goal Mode, and security models are finding vulnerabilities fast enough that verification and patching become the bottleneck. For a Nordic small team, the lesson is not to connect everything on Monday. The lesson is to make the next AI test produce a source receipt, use clear permissions, and leave a human in charge of approval.

Top signals this week

  • KPMG is rolling out Claude broadly. Anthropic and KPMG announced a global alliance where Claude will be embedded in KPMG's Digital Gateway and made available to more than 276,000 employees across 138 countries and territories. That is a clear example of AI being built into existing client and knowledge platforms, not bolted on as a separate chat window.

Source: Anthropic, KPMG integrates Claude across its core business and workforce of more than 276,000

  • AI agents need better connection surfaces. Anthropic is acquiring Stainless, a company that builds SDKs, CLIs, and MCP servers. MCP, Model Context Protocol, is a way for AI agents to connect to tools and data sources through clearer interfaces. The point is hard to miss: more of the value now sits in how the agent reaches the right system, not only in the model.

Source: Anthropic, Anthropic acquires Stainless

  • Coding agents get more responsibility, and more audit surface. GitHub added a REST API for auditing Copilot cloud agent configuration per repository, including MCP servers, enabled tools, GitHub Actions workflow policy, and firewall configuration. In the same week, the Copilot flow for code-review feedback became more explicit, with choices for where the change should land, which model should be used, and what instructions the agent receives.

Source: GitHub Changelog, Audit repository Copilot cloud agent configuration via the REST API

Source: GitHub Changelog, Easily apply Copilot code review feedback with Copilot cloud agent

  • OpenAI Codex is moving toward longer agent jobs. The Codex changelog for May 21 lists Appshots on macOS, Goal Mode as generally available, and remote computer use. Goal Mode matters because the agent can work toward a goal for hours or days. That makes the starting brief, boundaries, and final review more important than a polished prompt.

Source: OpenAI Developers, Codex changelog

  • Security work turns into a patch queue. Anthropic says Project Glasswing and Claude Mythos Preview have found more than 10,000 high- or critical-severity vulnerabilities together with roughly 50 partners. Cloudflare found 2,000 bugs, including 400 high- or critical-severity issues. The new bottleneck is not only finding defects. It is verifying, disclosing, and patching them.

Source: Anthropic Research, Project Glasswing: An initial update

What organizations are actually doing with AI

The KPMG signal is worth pausing on. Digital Gateway is built on Microsoft Azure and used in client work across tax, legal, private equity, cybersecurity, and business-function modernization. This is not a tiny side pilot. It is AI in an operating layer where data, tools, accountability, and client work already meet.

Source: Anthropic, KPMG integrates Claude across its core business and workforce of more than 276,000

PwC's recent Anthropic partnership points in the same direction. PwC plans to train and certify 30,000 people on Claude, create a joint Center of Excellence, and use Claude Code and Claude Cowork in areas such as underwriting, cybersecurity, HR, and mainframe modernization. In one example, Anthropic says insurance underwriting that took 10 weeks now takes 10 days. Numbers like that are tempting to chase, but the practical lesson is plainer: measure one real lead-time metric before and after AI, or nobody knows what improved.

Source: Anthropic, PwC is deploying Claude to build technology, execute deals, and reinvent enterprise functions for clients

Databricks' report on AI agents adds a wider benchmark. It draws on insights from more than 20,000 global organizations and finds that organizations using AI governance tools push 12 times more AI projects into production. Those using evaluation tools move nearly 6 times more AI systems into production. Evals are recurring tests of AI outputs against examples, rules, and quality requirements. For a small team, that can be as simple as 20 old cases with an answer key and a scored human review.

Source: Databricks, Enterprise AI Agent Trends

Google Cloud's agent report gives concrete examples of what agent workflows are used for. Telus is said to have more than 57,000 AI-using team members and save 40 minutes per AI interaction. Danfoss, according to Google, automates 80 percent of transactional decisions in email-based order handling and reduced response time from 42 hours to near real time. For a small organization, the pattern matters more than the scale: choose one recurring case, define what AI may suggest, and let a human approve before anything is sent.

Source: Google Cloud, 5 ways AI agents will transform the way we work in 2026

The tooling layer: platforms, agents, and workflows

An agentic workflow is a multi-step workflow where AI does more than write a response. It plans, retrieves context, uses tools, and suggests or performs the next step. This week's tooling news is less about magic and more about control points.

GitHub is useful because it shows both sides of the same shift. Copilot cloud agent can take code-review feedback and make changes, while GitHub also adds inspection of the agent's configuration: which MCP servers exist, which tools are active, which Actions policy applies, and what the firewall setup looks like. Translated to a non-technical organization, the question becomes: which systems may AI read, which may it write to, and where does the log live?

Source: GitHub Changelog, Audit repository Copilot cloud agent configuration via the REST API

OpenAI Codex is moving in the same direction. Appshots send context from the frontmost macOS window to Codex, and Goal Mode makes longer-running agent work possible. That can become useful outside coding too: proposal review, document cleanup, or an internal knowledge base. But the longer the job runs, the more you need a starting brief, a stop rule, and a final report a person can review.

Source: OpenAI Developers, Codex changelog

GitHub's npm update also matters beyond software teams. Staged publishing puts a package into a queue first, where it must be approved with 2FA before it becomes installable. The new install flags can restrict packages from files, remote URLs, directories, and git sources. Copy the pattern: let automation prepare the work, then place a human approval step before anything reaches production, customers, or students.

Source: GitHub Changelog, Staged publishing and new install-time controls for npm

Governance and risk: what needs to be in place before scaling

AI governance means the rules, roles, and controls that make AI use traceable and accountable. It does not have to start as a large policy document. It can start with four questions on one page: what may AI read, what may AI suggest, who approves, and where are source receipts and decisions stored?

The EU AI Act uses a risk-based model with prohibited uses, high-risk uses, transparency risks, and minimal-risk uses. It matters especially for schools, recruitment, credit, welfare, worker management, and other areas where automated decisions can affect people. Small organizations do not need to pretend they are enterprises, but they should be able to explain why an AI workflow is low risk, or which controls exist if it touches sensitive decisions.

Source: European Commission, AI Act

The Commission's guidelines for providers of general-purpose AI models, GPAI, clarify which obligations apply and how enforcement is building up. The obligations started applying on August 2, 2025, and the Commission's enforcement powers start applying on August 2, 2026. Even if many Hammer readers are not model providers, this affects procurement: ask vendors about documentation, data sources, incident reporting, and how the model may be used in the EU.

Source: European Commission, Guidelines for providers of general-purpose AI models

NIST AI RMF is voluntary, but practical. The framework helps organizations think about risks to individuals, organizations, and society, and in April 2026 NIST released a concept note for an AI RMF profile for critical infrastructure. For small teams, NIST's language works as a checklist: govern, map, measure, and manage. You can start with a simple risk log.

Source: NIST, AI Risk Management Framework

This week's practical Hammer test

Try a "source receipt" workflow in 30 to 45 minutes. Choose one real but bounded workflow: support cases, student questions, quote requests, policy questions, or internal tickets. Use copies or redacted examples.

Do this:

  1. Choose 10 old cases where you already know what a good answer or next step looked like.
  2. Write three rules: AI may read this, AI may suggest this, AI may not send or change anything by itself.
  3. Put any keys and system access behind environment variables, a secret manager, or an account with limited permissions. Do not give the agent more access than the task needs.
  4. Ask AI to summarize the case, propose the next human action, mark uncertainty, and list exactly which sources or fields it used.
  5. Review every suggestion manually and score it: correct, useful after editing, or wrong.
  6. After 10 cases, decide whether the workflow should stop, be adjusted, or be built further with logging, approval gates, and redaction of sensitive information.

You can copy this prompt:

You are a work assistant, not a decision-maker. Read the material below and do four things:
1. Summarize the case in no more than five bullets.
2. Propose the next human action.
3. State which sources, fields, or quotes you used as evidence.
4. Mark uncertainty and what a human must check before anything is sent or changed.

Rules:
- Do not invent missing information.
- Do not send, publish, change system data, or contact anyone.
- If the material is not enough, say what is missing.

Material:
[paste redacted case here]

If the test shows time savings without lowering quality, it is a good first step toward Tool Forge, where Hammer helps build workflows with the right permissions, approvals, logs, and day-to-day usability. Book a short walkthrough if you want to try the same pattern on your own cases.

Companies and tools to watch

  • Anthropic and KPMG: shows how AI is being rolled into large knowledge platforms where accountability and client work already exist.
  • Stainless and MCP: important for anyone who wants agents to reach the right APIs and tools without improvised connections.
  • GitHub Copilot cloud agent: worth watching even for non-developers because it shows agent work getting configuration, policy, and audit trails.
  • OpenAI Codex: shows why longer agent jobs need clearer goals, boundaries, and review.
  • Databricks and Deloitte: provide useful numbers for why governance and evals are not brakes. They are often what make production possible.

Deloitte writes that worker access to AI rose by 50 percent in 2025, while only one in five organizations has a mature governance model for autonomous AI agents. That sums up the week fairly well. The tools are getting more capable. The practical gain arrives when the work can be followed, measured, and approved.

Source: Deloitte, The State of AI in the Enterprise 2026

The Forge newsletter

Get new articles in your inbox

Pick the topics you care about. No noise, at most one email a week.

Get new articles in your inbox

We follow GDPR. Unsubscribe anytime.