AI Enablement Radar week 22: build traceability around everyday agents

Adam Olofsson HammareAdam Olofsson Hammare
AI Enablement Radar week 22: build traceability around everyday agents

The clearest signal in week 22 is not another chat feature. AI is being pushed into the tools where work already happens: Microsoft 365, GitHub, Notion, support flows, document parsing, and tax production. For small teams, that makes the next step fairly concrete. It is less about finding one more model and more about creating traces: who may the agent help, which systems may it reach, which outputs need review, and what do you learn when it gets something wrong?

Top signals this week

  • Microsoft is making Copilot more relevant for small businesses with Microsoft 365 Business Standard with Copilot and Business Premium with Copilot, launching July 1. The point is not only pricing or packaging. It is that AI is connected to Word, Excel, PowerPoint, Outlook, Work IQ, and more than 1,000 business app connectors.

Source: Introducing Microsoft 365 Business with Copilot

  • Microsoft's new Copilot design points in the same direction: the prompt becomes more like a workspace, with better context, history, and output design. Prompt habits are becoming part of the work environment, not a side hobby.

Source: Introducing a new design for Microsoft 365 Copilot

  • Google Cloud's May customer examples show that AI projects often start with very specific operating problems. BASF uses AlphaEvolve to model global supply chains, Urban Outfitters is moving an order-management system from Oracle to AlloyDB, and Movix uses agentic AI for quality control in dental aligner production.

Source: Cool stuff Google Cloud customers built, May edition

  • Lyft shows what happens when domain experts can build and tune support agents themselves. With LangGraph and LangSmith, the time to create new configurable agents fell from roughly six months to roughly two weeks. Lyft also reports automated LLM-as-a-judge evaluation pipelines for 100 percent of production agents, 20 percent lower hallucination and contradiction rates, and a 16 percent increase in AI Resolution Rate.

Source: How Lyft Built a Self-Serve AI Agent Platform

  • GitHub is adding measurement on top of Copilot usage. The new ai_adoption_phase field groups users by behavior over a rolling 28-day window: code first, agent first, and multi-agent. That is a better steering metric than “number of licenses”.

Source: Copilot usage metrics API adds cohorts for AI adoption

  • OpenAI and Thrive Holdings describe Tax AI across Crete's network of more than 30 accounting firms. The system processed 7,000 tax returns during the pilot season, saved about one-third of preparation time, drafted returns with up to 97 percent accuracy, and increased throughput by about 50 percent. The important detail is the improvement loop: practitioner feedback, production traces, and targeted evals.

Source: Building self-improving tax agents with Codex

What organizations are actually doing with AI

This week's more useful examples have one shape in common: AI is not just asked to “answer”. It is put into a bounded process where people already know what good work looks like.

Lyft is a strong customer-service example. Voice of Customer, operations, and product roles can shape agent behavior with prompts and configuration, while LangGraph manages the flow between specialized subagents. An agentic workflow here means AI does not only write a response. It takes several steps through a support process: classifies the case, chooses the right subagent, gathers context, and hands off when the rules require it.

Source: How Lyft Built a Self-Serve AI Agent Platform

The OpenAI/Thrive example is narrower, which is why it is useful. Tax work has clear documents, fields, checks, and accountability. Tax AI improved not because of magic in the model, but because the product started saving the right evidence: source material, extracted fields, citations, practitioner corrections, and the final filed return. A smaller Nordic team needs the same kind of traceability when AI helps with proposals, student material, contracts, project reports, or finance admin.

Source: Building self-improving tax agents with Codex

Google Cloud's customer roundup shows the same point from another angle. BASF is trying to understand supply chains with 180 production sites and more than 5,000 value chains. That is far from a local small business, but the principle scales down: start with a process where you already have history, rules, and human decisions. Then AI can help find patterns without getting the final word.

Source: Cool stuff Google Cloud customers built, May edition

For schools and learning environments, the lesson is similar. Do not start with “AI across the whole school”. Pick a workflow with a clear boundary: summarize open course material, compare a lesson plan with learning objectives, draft a first question bank for teacher review, or simplify student instructions. These are small systems, but they build the habits needed before AI handles heavier work.

The tooling layer: platforms, agents, and workflows

MCP, the Model Context Protocol, is a way to connect AI systems to external tools and data sources through a shared connection model. But this week's tooling signal is broader than MCP. Platforms are trying to build control planes around agents: measurement, model rules, memory, sandboxes, document parsing, and reusable workflows.

GitHub's new Copilot metrics make adoption more concrete. If only two people use agent surfaces while the rest stay in code completion, training should look different from the training you would run for a team already working multi-agent. GitHub also released targeted model rules, so enterprise owners can control which Copilot models are available to different organizations.

Sources: Copilot usage metrics API adds cohorts for AI adoption, Target Copilot models to organizations with model rules

GitHub's memory controls are small but important. Copilot Memory can now be disabled at repository level, managed through the CLI, and clearer about whether a stored item is a personal preference or a repository fact. For non-technical teams, the translation is simple: decide what AI may remember at individual, team, and process level. Write that down before memory becomes an invisible habit.

Source: Copilot Memory has more controls for deletion, scope, and the Copilot CLI

LangChain is moving in a practical direction with interpreter skills. A skill is a reusable agent instruction. An interpreter skill can also include a tested TypeScript module that the agent imports and runs in a controlled runtime. That means important steps can live in code while the model decides when to use the routine.

Source: Building workflows for agents with Skills and Interpreters

LlamaIndex released ParseBench, a benchmark for document parsing with roughly 2,000 human-verified enterprise pages and more than 167,000 test rules. It sounds narrow, but many AI projects break exactly there: a PDF is parsed incorrectly, a table row lands under the wrong header, or an agent reads a crossed-out price as the active price.

Source: ParseBench: The First Document Parsing Benchmark for AI Agents

Notion's developer platform points to the same everyday future where agents share a workspace with people. Notion describes data sync, Workers, external agents, and custom tools for Custom Agents. That is a sign that an “AI workspace” is not just chat. It is a place where data, routines, integrations, and accountability meet.

Source: What’s New – Notion

Governance and risk: what needs to be in place before scaling

AI governance does not have to mean a policy binder nobody reads. In practice, AI governance is the set of simple rules that decide when AI may act, which data it may use, who approves the result, and how you can see what happened afterward.

The EU AI Act is risk-based and groups AI use into levels such as unacceptable risk, high risk, transparency risk, and minimal risk. For smaller organizations, the practical starting point is to sort your own AI ideas by consequence: does this affect hiring, education, credit, health, safety, or public-service-like decisions? If so, you need more documentation, testing, and human control before scaling.

Source: AI Act – European Commission

OpenAI wrote this week about third-party evaluations for modern agentic systems. They point out that an eval does not only test the model. It also tests the harness: prompts, tools, memory, retries, validators, and control logic around the model. For Hammer readers, that is a useful definition of evals: small recurring tests that show whether the AI routine works in the environment where it will actually be used.

Source: A shared playbook for trustworthy third party evaluations

LangSmith's Auth Proxy shows what safe integration can look like when agents run code and call services. Instead of putting API keys inside the agent runtime, a proxy can keep secrets outside the sandbox, allow only approved destinations, and inject authentication at the network layer. The same principle works at a smaller scale: use environment variables or a secret manager, scoped API keys, least-privilege permissions, logs, output redaction, and approval gates for actions that affect customers, money, staff, or students.

Source: How Auth Proxy secures network access for LangSmith agent sandboxes

Google Cloud's security article for the public sector uses a good phrase: treat AI as a muse, not an oracle. That advice also works outside government. Let AI reduce admin and gather context, but keep humans responsible for final decisions when the work has legal, financial, or personal consequences.

Source: Cloud CISO Perspectives: How to build an AI-ready security program for the public sector

This week's practical Hammer test

The test takes 30–45 minutes and fits a small team already using AI, but not yet connecting it properly to the work.

Choose one recurring task where AI already helps a little: customer replies, proposal drafts, meeting notes, lesson planning, project reporting, support triage, or document review. Then create a simple agent card.

  • Task: What should AI help with, and what should it not do?
  • Sources: Which files, systems, or pages may it read?
  • Actions: May it only suggest, or may it also write, create, change, or send?
  • Access: Which keys, accounts, or integrations are needed, and how do you scope them?
  • Evidence: Which citations, logs, or before/after values must be saved?
  • Human review: Who approves before anything reaches a customer, student, supplier, or finance workflow?
  • Eval: Which five examples will you retest every week to see whether the routine is improving or getting worse?

Copy this prompt:

You are our AI workflow reviewer. Help us create an agent card for this recurring task: [describe the task].

Suggest:
1. which sources AI may read,
2. which actions AI may take on its own and which require human approval,
3. which permissions or integrations are needed,
4. how we protect secrets with scoped access, environment variables, or a secret manager,
5. which logs or citations must be saved,
6. five test cases we can reuse as a simple eval.

Be practical. Assume we are a small team and want a working first version next week.

The boundary is simple: let AI prepare and propose. Let a human approve decisions, external messages, money, personal data, and system changes until the routine has proved itself.

Companies and tools to watch

  • Microsoft 365 Copilot: important for small businesses because Copilot is being built into existing work apps and business connectors.
  • GitHub Copilot: interesting beyond developer teams because adoption, memory, and model access are becoming measurable and governable.
  • LangChain/LangSmith: shows how agent work moves from prompt to operations, evals, sandboxes, and safe network access.
  • LlamaIndex/LlamaParse: reminds us that document quality often decides whether AI can act correctly.
  • Notion Developer Platform: turns the workspace into an integration surface for people, data, and agents.

If you want to move from loose AI experiments to a traceable workflow, this fits Hammer Automation's Tool Forge. We map the task, set the right integrations, and build a first routine with permissions, logs, and human approvals. Start with an agent card and contact us through Tool Forge if you want to build the next version together.

The Forge newsletter

Get new articles in your inbox

Pick the topics you care about. No noise, at most one email a week.

Get new articles in your inbox

We follow GDPR. Unsubscribe anytime.