Anthropic releases Claude Opus 4.8 with a reliability gain for agentic coding

Anthropic released Claude Opus 4.8 on May 28, 2026. It is available immediately via the API as claude-opus-4-8. The model’s one substantive improvement over its predecessor is reliability in agentic coding: Anthropic says it is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. Pricing is unchanged from Claude 4.7 — $5 per million input tokens and $25 per million output at standard rates, $10 and $50 in fast mode.

The reliability number is the story

For most model releases, the benchmark grid is the story. For Opus 4.8, one number stands above the rest. Anthropic’s claim is specific: the model is around four times less likely than its predecessor to let flaws in code it has written pass unremarked. That is not a claim about fewer bugs entering the codebase from outside, or about reasoning quality in general. It is about self-review — whether the model catches what it introduced.

In a single-shot completion, that difference rarely matters. In an agentic loop that writes code, calls tools, checks its own output, and iterates, a model that misses its own mistakes compounds those errors across steps. The fix has to come from the agent, not from a human watching every turn. That is the workload where catching self-introduced flaws is worth more than another benchmark point.

Anthropic also cites an 84% score on Online-Mind2Web and calls Opus 4.8 the first model to break 10% all-pass on the Legal Agent Benchmark.

~4× Fewer self-introduced code flaws passing unflagged vs. its predecessor, per Anthropic Anthropic

84% Online-Mind2Web Agentic web tasks benchmark Anthropic

$5 / $25 Price per million tokens (standard) Input / output — unchanged from 4.7 Anthropic

What else ships with 4.8

Beyond the reliability improvement, Opus 4.8 arrives with two additions to claude.ai that extend how the model works in practice.

Dynamic workflows let the model spin up parallel subagents and hand work off across them — a research preview, not a generally available feature. If you are building multi-step pipelines that currently run sequentially, this is the surface worth watching: structured parallel execution changes the economics of long research or code-generation jobs in ways that sequential tool use does not.

Effort control gives users a knob over how much reasoning the model applies to a given request. Spending compute on a hard problem and saving it on a routine one is the correct behavior for high-volume agentic traffic, and wiring that control in explicitly matters more than leaving it to the model’s defaults.

The counter-case: this is an incremental release

Opus 4.8 is not a generational step. The benchmark improvements are real but targeted, and the headline reliability figure applies to a specific behavior — self-review of self-written code — not to output quality across the board. Teams whose workloads are not agentic coding loops will see little practical difference from 4.7.

The benchmark numbers also come from Anthropic’s own post. Anthropic’s workload is not your workload. The responsible move, as with any model update, is to run representative tasks from your own pipeline against both versions before treating the upgrade as a free win.

Who should care and what to test

If your team runs agentic coding pipelines — the model writes code, executes it in a loop, reviews and patches — Opus 4.8 is worth a direct comparison against 4.7. The test is straightforward: give both versions a realistic coding task with a known set of self-introduced errors and check whether the model flags them in its own review pass. That is the behavior the release is built around.

If your work leans on legal research, multi-step document reasoning, or web-based agentic tasks, the Legal Agent Benchmark and Online-Mind2Web scores are the relevant signals. Neither benchmark is a guarantee on your data, but both point to where the model has been tuned.

Teams on standard API access should see no pricing change. Fast mode rates also hold. The practical upgrade path is simple: swap the model identifier to claude-opus-4-8, run your existing eval suite, and check whether the reliability improvement shows up in your tasks before committing.

For context on where Opus 4.8 sits against the wider field, see our AI model leaderboard. For Anthropic’s broader product and research direction, see our Anthropic company profile.

Sources

Anthropic: Introducing Claude Opus 4.8