Google published a developer-facing agent quality workflow on June 30 that turns evaluation into something a coding agent can drive. The update centers on a skill that runs a five-stage loop: prepare data, run inference, grade, analyze failures, and optimize.
The useful part is the separation of roles. Google says the optimizer and evaluator stay decoupled: whatever proposes a fix does not grade that fix. The Gemini Enterprise Agent Platform GenAI evaluation service scores the traces independently through AutoRaters or custom metrics.
That is the right lesson for agent builders. Prompt tweaks are easy. Knowing whether a tweak improved an agent without breaking other behavior is the hard part.
Agent quality fails quietly
Traditional software often fails loudly. A test turns red. A type check fails. A service returns an error. Agents can fail with much more polish.
Google’s example is a travel-concierge agent that internally stores a user’s revised travel dates correctly but still gives the stale date back in the final message. The agent appears to be working: it calls tools, updates state, and writes a plausible answer. The actual user-facing output is wrong.
That is the failure class agent evaluation needs to catch. The model did not crash. The workflow did not obviously stop. A quick read could miss the defect. The right metric has to inspect the trace, the user’s revision, and the final answer together.
Google’s skill does that by letting the developer describe the concern in plain language. In the example, the concern is whether the agent honors mid-conversation changes. The skill chooses built-in multi-turn AutoRaters, adds a custom rubric called revision_honored, synthesizes scenarios with a user simulator, grades the traces, and isolates the failure rate.
The specific numbers are a Google demo, not universal evidence. The method is still valuable because it makes a behavioral concern countable.
The evaluator should not be the optimizer
The most important design choice is that the same agent should not judge its own fix.
If the optimizer grades itself, it can learn to satisfy the metric superficially or rationalize the change. That is the same problem as asking a model to verify its own answer with no external check. It may help, but it is not enough when the output controls user-facing behavior.
Google’s loop keeps the scoring service separate. The coding agent can propose a plan, run traces, read verdicts, and make a targeted change, but the evaluation result comes from the GenAI evaluation service. The developer still approves the plan and the fix.
That human-in-the-loop shape is more credible than a fully autonomous “agent improves itself” pitch. The system can do repetitive evaluation work, but the human remains responsible for deciding whether the change is appropriate.
Production traces are the next test set
The skill can start with synthetic cases, but Google is explicit that synthetic scenarios are a cold-start tool. The sharper loop comes from production traffic.
As an agent serves real users, each failure becomes a ready-made test case. If the agent emits OpenTelemetry traces, those sessions can be graded later. Google says online monitors can evaluate live traffic and write quality scores to Cloud Monitoring, then the same skill can use failing traces for the next eval-fix cycle.
That matters because agent behavior changes with real context. Synthetic tests can cover obvious paths, but production sessions reveal where users revise goals, skip steps, provide partial information, or ask for work the team did not anticipate.
The practical implication is that agent teams need telemetry before they need more prompt tricks. Without traces, there is little to inspect. Without stable metrics, there is little to compare. Without a baseline, every fix is a guess.