A coding agent workflow loops through traces, grading, failure clusters, and approved fixes
A coding agent workflow loops through traces, grading, failure clusters, and approved fixes
+ Google News

Google gives coding agents an eval flywheel instead of another prompt tweak

Google's new quality-flywheel skill lets coding agents run structured agent evaluations with independent grading and production-trace loops.

about 2 hours ago

Google published a developer-facing agent quality workflow on June 30 that turns evaluation into something a coding agent can drive. The update centers on a skill that runs a five-stage loop: prepare data, run inference, grade, analyze failures, and optimize.

The useful part is the separation of roles. Google says the optimizer and evaluator stay decoupled: whatever proposes a fix does not grade that fix. The Gemini Enterprise Agent Platform GenAI evaluation service scores the traces independently through AutoRaters or custom metrics.

That is the right lesson for agent builders. Prompt tweaks are easy. Knowing whether a tweak improved an agent without breaking other behavior is the hard part.

Agent quality fails quietly

Traditional software often fails loudly. A test turns red. A type check fails. A service returns an error. Agents can fail with much more polish.

Google’s example is a travel-concierge agent that internally stores a user’s revised travel dates correctly but still gives the stale date back in the final message. The agent appears to be working: it calls tools, updates state, and writes a plausible answer. The actual user-facing output is wrong.

That is the failure class agent evaluation needs to catch. The model did not crash. The workflow did not obviously stop. A quick read could miss the defect. The right metric has to inspect the trace, the user’s revision, and the final answer together.

Google’s skill does that by letting the developer describe the concern in plain language. In the example, the concern is whether the agent honors mid-conversation changes. The skill chooses built-in multi-turn AutoRaters, adds a custom rubric called revision_honored, synthesizes scenarios with a user simulator, grades the traces, and isolates the failure rate.

The specific numbers are a Google demo, not universal evidence. The method is still valuable because it makes a behavioral concern countable.

The evaluator should not be the optimizer

The most important design choice is that the same agent should not judge its own fix.

If the optimizer grades itself, it can learn to satisfy the metric superficially or rationalize the change. That is the same problem as asking a model to verify its own answer with no external check. It may help, but it is not enough when the output controls user-facing behavior.

Google’s loop keeps the scoring service separate. The coding agent can propose a plan, run traces, read verdicts, and make a targeted change, but the evaluation result comes from the GenAI evaluation service. The developer still approves the plan and the fix.

That human-in-the-loop shape is more credible than a fully autonomous “agent improves itself” pitch. The system can do repetitive evaluation work, but the human remains responsible for deciding whether the change is appropriate.

Production traces are the next test set

The skill can start with synthetic cases, but Google is explicit that synthetic scenarios are a cold-start tool. The sharper loop comes from production traffic.

As an agent serves real users, each failure becomes a ready-made test case. If the agent emits OpenTelemetry traces, those sessions can be graded later. Google says online monitors can evaluate live traffic and write quality scores to Cloud Monitoring, then the same skill can use failing traces for the next eval-fix cycle.

That matters because agent behavior changes with real context. Synthetic tests can cover obvious paths, but production sessions reveal where users revise goals, skip steps, provide partial information, or ask for work the team did not anticipate.

The practical implication is that agent teams need telemetry before they need more prompt tricks. Without traces, there is little to inspect. Without stable metrics, there is little to compare. Without a baseline, every fix is a guess.

Sources

The AI Feed Desk

The AI Feed Desk

Editorial desk

The AI Feed Desk tracks AI provider updates, model releases, agent tooling, and enterprise adoption, turning fast-moving announcements into source-linked context for builders and operators.

Noticed a typo, incorrect information, or translation error?

Tell us so we can fix it.

Help Improve This Article

Related Articles

Gemini 3.5 Flash gets a Computer Use tool for agent workflows

Google's Gemini API now previews Computer Use with browser, mobile, and desktop environments, making execution safety and logging part of the developer workflow.

The AI Feed Desk

By The AI Feed Desk

Google Finance turns portfolio tracking into an AI briefing workflow

Google Finance is adding portfolio ingestion, scheduled market briefings, and a new Android app as AI moves into recurring consumer finance tasks.

The AI Feed Desk

By The AI Feed Desk

Gemini 3.5 Flash beats last year's Pro on the work builders ship

Google's Gemini 3.5 Flash beats last year's 3.1 Pro on coding and agentic benchmarks at ~40% lower cost — with reasoning and 1M-context limits worth testing.

The AI Feed Desk

By The AI Feed Desk

Gemini API adds TTS streaming as media model shutdown dates arrive

Google's Gemini API changelog added streaming speech generation for a preview TTS model and set near-term shutdown dates for older Imagen and Veo model IDs.

The AI Feed Desk

By The AI Feed Desk

Google UK ties its AI productivity case to worker training

Google's UK update pairs large economic-impact claims with an AI skills push aimed at closing the adoption gap.

The AI Feed Desk

By The AI Feed Desk