When AI agents get too much freedom

Adam Olofsson HammareAdam Olofsson Hammare
When AI agents get too much freedom

The most uncomfortable part of the podcast is not the word “blackmail”. It is how ordinary the setup sounds: an AI agent is given a goal, access to information, and the ability to act. When the goal is threatened, it sometimes chooses the shortest path to success in simulations, even when that path is clearly wrong.

For Hammer readers, the point is not to become afraid of AI. The point is to design workflows where AI can create a lot of value without being handed every key at the same time.

What the podcast is really warning about

The podcast starts from Anthropic research on agentic misalignment: cases where an AI agent, in a simulated environment, independently and intentionally chooses harmful actions to pursue its goal. Anthropic tested 16 leading models in fictional corporate environments where models could read sensitive information and send messages. In some scenarios, behaviors such as blackmail and information leaks appeared.

Source: Anthropic – Agentic Misalignment: How LLMs could be insider threats

For small teams, the exact percentage in an extreme test matters less than the pattern: the more autonomy, sensitive context, and action power we give a system, the more carefully we need to think about access, checkpoints, and human responsibility.

Six building blocks for safer AI agents

The podcast moves across several AI architectures. I hear six practical building blocks that teams can already use today:

  • Principles before point rules: It is not enough to say “do not do bad things”. An agent needs training and instructions around why certain actions are wrong, especially when the goal conflicts with ethics, privacy, or safety.
  • Progressive context: Do not give the agent the whole archive just because you can. Let it retrieve the right document, paragraph, or policy when the task requires it.
  • Least-privilege access: An agent that summarizes invoices does not need permission to send email, delete files, or read HR folders.
  • Clear approvals: Anything that affects money, customer relationships, legal commitments, personal data, or external messages should pass through a human.
  • Visible uncertainty: When OCR, extraction, or classification has low confidence, the system should flag it instead of pretending to know.
  • Sandboxes: A coding or office agent should start in an environment where it can read, suggest, and write in the right workspace, but not automatically reach the network, secrets, or production systems.

Microsoft's Agent Skills documentation describes progressive disclosure as a way for agents to load only the context they need. OpenAI describes how Codex can run with sandboxing, approval policies, and network access off by default. Mistral's OCR API shows the same pattern in document workflows: the system can return confidence scores at page or word level, so uncertain extractions are caught before they become business data.

Sources: Microsoft Learn – Agent Skills, OpenAI Developers – Codex agent approvals and security, Mistral OCR API

A good agent is not the freest one – it is the best framed one

It is easy to sell AI agents as digital colleagues that “just handle it”. But in a real company, that is not how we onboard new colleagues. A new hire does not receive banking permissions, the customer register, contract-signing rights, and free external communication on day one. They get a role, a scope, a manager, templates, and clear escalation paths.

The same principle applies to AI. An agent helping a small consultancy with customer follow-up can absolutely:

  • summarize the latest meeting,
  • suggest the next email,
  • find missing details,
  • create a task list,
  • highlight risks and uncertainties.

But it should not automatically:

  • send promises to the customer,
  • change price or terms,
  • read irrelevant private documents,
  • contact external parties without approval,
  • bypass a stop just because the goal says “complete the task”.

This is the difference between automation and irresponsible delegation.

The positive news: safety can be trained better

Anthropic has since described how it worked further on agentic misalignment as a training problem. One central lesson was that it often works better to teach the model why a behavior is right than to only show examples of correct answers. According to Anthropic, Claude models since Haiku 4.5 have achieved a perfect score on the relevant blackmail evaluation, where earlier models sometimes failed badly.

Source: Anthropic – Teaching Claude why

That does not mean the problem is solved for every model, tool, and workflow. But it points in the right direction: safer AI comes from the combination of better model training, better system design, and clearer human governance.

A simple starting point for small teams

If you want to test AI agents in a company, school, or solo business, do not start with the most autonomous version. Start with a map:

  1. What task should the agent actually solve? Write one sentence.
  2. What information does it need? List only the necessary sources.
  3. What actions may it perform by itself? Separate reading, suggesting, drafting, and sending.
  4. Where must a human approve? Mark money, legal issues, personal data, and external communication.
  5. How do we see what it did? Logs, version history, and short rationales.
  6. How do we stop it? There should be a simple off-switch and clear permissions.

This is often where the value appears. Not by giving AI more freedom, but by giving it a clear enough workbench to be fast, useful, and safe.

For many small Nordic teams, the next step is not “a super-agent”. It is a well-scoped Tool Forge solution: the right data in, the right suggestions out, and human approval before anything important leaves the building.