OpenAI Codex release notes: self-improving eval loops in practice

Adam Olofsson HammareAdam Olofsson Hammare
OpenAI Codex release notes: self-improving eval loops in practice

This is not a new button in the Codex CLI. It still belongs in an OpenAI Codex release-notes watchlist because OpenAI is showing how Codex gets used when a production system should improve from real feedback rather than another pile of prompts.

For Hammer readers, the point is practical. A coding agent is an AI agent that can read code, suggest changes and run checks inside a bounded environment. An eval is a test or measurement that tells you whether the agent actually improved a specific task. When those two are tied to production evidence, human review and clear stop rules, Codex starts to look less like a chat box and more like a controlled improvement workshop.

OpenAI Codex release notes: the signal is feedback becoming work

The formal Codex changelog still lists Codex CLI 0.134.0 as the latest stable CLI entry. We already covered that release, with profiles, MCP and local history search. Today's new signal comes instead from OpenAI Engineering: the Tax AI article, built by OpenAI and Thrive Holdings for Crete's network of accounting firms.

Source: Codex changelog: Codex CLI 0.134.0 and OpenAI Engineering: Building self-improving tax agents with Codex.

OpenAI describes a production workflow where practitioner corrections become more than support tickets. They become structured findings, targeted evals and bounded Codex tasks. That is the release-note value here: Codex moves from "help me write code" toward "investigate this repeated failure, improve the right part of the system and prove it with evals."

What OpenAI showed in Tax AI

Tax AI helps US tax practitioners prepare 1040 and 1041 returns. In the pilot environment, the system processed 7,000 returns, saved about one-third of preparation time, drafted returns with up to 97% accuracy and increased throughput by about 50%. OpenAI also says the share of returns reaching 75% correct field completion rose from a quarter at launch to 86% within six weeks.

Source: OpenAI Engineering: measurable self-improvement in Tax AI.

The important part is not US tax itself. It is how the team made errors usable. A changed value could mean a real extraction miss, a mapping problem, missing product support, a practitioner preference or normal workflow noise. Before Codex was allowed to act, the differences had to be grouped, reviewed and made measurable.

From production trace to controlled Codex task

OpenAI describes three parts: stay close to experts, build the product so production creates evidence, and let Codex work against reviewed evals. A production trace is the saved chain from source document to extracted field, provenance, mapping, human correction and final result. Without that trace, the agent does not know what to improve.

Source: OpenAI Engineering: practitioner feedback, production traces and Codex-driven eval loops.

The most useful part of the article is the work environment. OpenAI shows a bounded candidate setup where Codex gets a writable worktree, relevant product files, targeted evals, regression suites, skills and documentation. Production traces, source documents and tax-engine documentation sit as read-only evidence. That is a healthy integration pattern: let the agent see enough to debug, but do not give it free rein over original data or production systems.

Source: OpenAI Engineering: bounded Codex task environment.

What Swedish teams can test now

A Swedish automation team can use the same pattern at a smaller scale: customer support tickets, quote workflows, scheduling exceptions, invoice extraction, internal knowledge bases or report automation. Pick one repeated error that people already correct. Save before-and-after values, source, decision and test case. Then build a small eval before Codex gets to suggest changes.

This fits Tool Forge work: build an AI work environment where the integration is useful but controlled. Use environment variables or a secret manager for keys, scoped permissions, read-only traces, redaction of sensitive fields, approval gates before changes and logs people can review. Then Codex becomes part of the improvement process, not a black box next to it.

Short example: use the new Codex feature

Human step: choose one repeated, already reviewed failure and collect examples, expected output, relevant files and one eval command in a separate branch or candidate folder. Share only the evidence the agent needs.

Then paste a short instruction into Codex:

Review this candidate improvement loop. Use the provided trace, expected output and eval command to identify the smallest product change that could fix the repeated failure. Do not edit files yet. Return: suspected root cause, files you would inspect, evals to run, data that must stay read-only, and the approval point before any code change.

Good output should:

  • separate writable code from read-only production evidence
  • suggest a small change, not a rebuild of the whole workflow
  • name the evals and regressions that would prove the improvement
  • stop at human approval before files change

What to watch next

I would not migrate a process just because a GitHub alpha tag exists. Wait for a clear Codex changelog or stable release if you need version decisions. But start collecting traces and evals now. When the next Codex CLI release lands, the team with good test cases, clear permissions and reviewed improvement candidates will be in a much better place than the team with a long list of "AI ideas."

The Forge newsletter

Get new articles in your inbox

Pick the topics you care about. No noise, at most one email a week.

Get new articles in your inbox

We follow GDPR. Unsubscribe anytime.