Google gives coding agents an eval flywheel instead of another prompt tweak

Google published a developer-facing agent quality workflow on June 30 that turns evaluation into something a coding agent can drive. The update centers on a skill that runs a five-stage loop: prepare data, run inference, grade, analyze failures, and optimize.

The useful part is the separation of roles. Google says the optimizer and evaluator stay decoupled: whatever proposes a fix does not grade that fix. The Gemini Enterprise Agent Platform GenAI evaluation service scores the traces independently through AutoRaters or custom metrics.

That is the right lesson for agent builders. Prompt tweaks are easy. Knowing whether a tweak improved an agent without breaking other behavior is the hard part.

Agent quality fails quietly

Traditional software often fails loudly. A test turns red. A type check fails. A service returns an error. Agents can fail with much more polish.

Google’s example is a travel-concierge agent that internally stores a user’s revised travel dates correctly but still gives the stale date back in the final message. The agent appears to be working: it calls tools, updates state, and writes a plausible answer. The actual user-facing output is wrong.

That is the failure class agent evaluation needs to catch. The model did not crash. The workflow did not obviously stop. A quick read could miss the defect. The right metric has to inspect the trace, the user’s revision, and the final answer together.

Google’s skill does that by letting the developer describe the concern in plain language. In the example, the concern is whether the agent honors mid-conversation changes. The skill chooses built-in multi-turn AutoRaters, adds a custom rubric called revision_honored, synthesizes scenarios with a user simulator, grades the traces, and isolates the failure rate.

The specific numbers are a Google demo, not universal evidence. The method is still valuable because it makes a behavioral concern countable.

The evaluator should not be the optimizer

The most important design choice is that the same agent should not judge its own fix.

If the optimizer grades itself, it can learn to satisfy the metric superficially or rationalize the change. That is the same problem as asking a model to verify its own answer with no external check. It may help, but it is not enough when the output controls user-facing behavior.

Google’s loop keeps the scoring service separate. The coding agent can propose a plan, run traces, read verdicts, and make a targeted change, but the evaluation result comes from the GenAI evaluation service. The developer still approves the plan and the fix.

That human-in-the-loop shape is more credible than a fully autonomous “agent improves itself” pitch. The system can do repetitive evaluation work, but the human remains responsible for deciding whether the change is appropriate.

Production traces are the next test set

The skill can start with synthetic cases, but Google is explicit that synthetic scenarios are a cold-start tool. The sharper loop comes from production traffic.

As an agent serves real users, each failure becomes a ready-made test case. If the agent emits OpenTelemetry traces, those sessions can be graded later. Google says online monitors can evaluate live traffic and write quality scores to Cloud Monitoring, then the same skill can use failing traces for the next eval-fix cycle.

That matters because agent behavior changes with real context. Synthetic tests can cover obvious paths, but production sessions reveal where users revise goals, skip steps, provide partial information, or ask for work the team did not anticipate.

The practical implication is that agent teams need telemetry before they need more prompt tricks. Without traces, there is little to inspect. Without stable metrics, there is little to compare. Without a baseline, every fix is a guess.

Practical context

The useful test is one behavior at a time

Google lists two packages: google-agents-cli-eval for teams building ADK agents with agents-cli, and agent-platform-eval-flywheel for teams working directly with the Evaluation SDK. Both depend on the same evaluation service.

That means the skill is not a standalone proof that an agent is good. It is orchestration around a testing discipline. Teams still need representative traces, meaningful rubrics, thresholds that match product risk, and review habits that prevent overfitting to the latest failure.

The best use is narrow. Pick one behavior that matters: honoring user revisions, preserving tool evidence, refusing unsafe actions, using the right data source, or ending with the correct handoff. Make that behavior measurable. Run before and after. Keep the broad AutoRater score as a health signal, but track the specific behavior you changed.

That is more useful than asking whether an agent is “better.” Better at what, under which traces, with what trade-off, and at what cost?

Google’s update is worth watching because it frames agent quality as an engineering loop. The next stage of agent adoption will not be won by the team with the cleverest prompt. It will be won by teams that can prove their agents improve without hiding new failures behind fluent output.

Google gives coding agents an eval flywheel instead of another prompt tweak

Agent quality fails quietly

The evaluator should not be the optimizer

Production traces are the next test set

Sources