NVIDIA says Blackwell leads the first AgentPerf benchmark

NVIDIA says its GB300 NVL72 platform leads the first published AgentPerf results, running up to 20x more agents per megawatt than the prior-generation H200 system on an agentic coding workload. The June 12, 2026 post ties the result to AgentPerf, a new Artificial Analysis benchmark built for multi-step agent workloads rather than single chat completions.

The important change is the unit of measurement. Agentic systems do not only need one fast answer. They chain model calls, tool calls, code edits, command results, and growing context. NVIDIA’s pitch is that infrastructure buyers should evaluate how many active agents a system can keep responsive per GPU and per megawatt, not only how quickly it returns a single response.

20x NVIDIA-claimed gain in agents per megawatt NVIDIA Blog

61.4K GB300 NVL72 concurrent agents per MW in SLO=30 table NVIDIA Technical Blog

2.6K H200 concurrent agents per MW in SLO=30 table NVIDIA Technical Blog

Agent benchmarks are getting closer to the workload

Artificial Analysis describes AA-AgentPerf as a hardware benchmark that measures how many active agents an inference deployment can support under realistic agentic workloads while meeting per-agent performance targets for time to first token and output speed. The methodology page says the benchmark uses real agentic trajectories, sustained concurrent load, market-derived service-level objective tiers, and production-scale deployment assumptions.

That is a different question from the usual leaderboard prompt. A coding agent session may read files, reason, call tools, edit code, run commands, and then repeat. The Artificial Analysis methodology says its dataset contains trajectories covering several use cases, programming languages, and models, generated in an agentic harness from real public code repositories. Input sequence lengths range from roughly 5,000 to 131,000 tokens, with an average around 27,000 tokens.

For buyers, that matters because agent systems are capacity systems. The pain is not only whether one agent feels fast in a demo. It is whether hundreds or thousands of agent sessions remain usable when they are all carrying long context and waiting on interleaved actions.

NVIDIA is framing power as the capacity metric

NVIDIA’s public blog says the first AgentPerf round uses DeepSeek V4 Pro, a large mixture-of-experts model, and that GB300 NVL72 delivers the highest performance in the benchmark. The technical post gives the most useful planning numbers: under the SLO=30 configuration, NVIDIA lists GB300 NVL72 at 61.4K concurrent agents per megawatt and H200 at 2.6K. It also lists concurrent agents per GPU at 57.5 for GB300 NVL72 and 1.4 for H200.

Those are NVIDIA-published results, so they should be treated as vendor evidence rather than independent purchasing advice by themselves. The stronger point is that the metric is changing. If agents become a major inference workload, power-normalized concurrency may become a core infrastructure question. A rack that can serve more useful agent sessions per megawatt has a direct effect on data-center planning, queue times, and the unit economics of agent products.

NVIDIA attributes the result to full-stack design: GB300 NVL72 links 72 GPUs into one rack-scale system; software optimizations distribute mixture-of-experts execution; and TensorRT LLM separates input processing from output generation so each side can be optimized as concurrency rises.

What this means for agent builders

The practical read is not “buy the biggest rack.” It is that agent products should start measuring their own serving demand in agent-native terms. A team shipping coding agents, research agents, sales agents, or support agents should know how many model calls an average task consumes, how context grows over the session, which steps are latency-sensitive, and what output-speed floor keeps the product useful.

Once those numbers exist, AgentPerf-style metrics become easier to interpret. The right question is not only “which accelerator is fastest?” It is “how many complete agent sessions can we support at our required service level, for our power budget, with our model and tool stack?”

That distinction also helps teams avoid benchmark overreach. AgentPerf is valuable because it tests a more realistic shape than a single prompt. It still cannot represent every product. A browser-use agent, a coding agent, a claims-processing agent, and a customer-support agent may all stress the system differently. Tool latency, retrieval, permissions checks, and human approval steps can dominate the end-to-end experience even when model serving is fast.

The benchmark market will get messier

NVIDIA benefits from a benchmark that rewards full-stack hardware and software integration. Cloud providers and model-serving companies will benefit from results that show better capacity per watt, better price per completed task, or better performance under a specific model mix. The useful outcome for readers is not a single permanent winner. It is a vocabulary for comparing agent infrastructure without pretending agent workloads are ordinary chat completions.

The next checkpoint is whether Artificial Analysis publishes broader cross-vendor hardware results and whether providers expose enough configuration detail for buyers to map benchmark numbers to production deployments. Agentic AI will make inference less uniform. Benchmarks that capture concurrency, long context, service levels, and power will matter more than benchmark charts built for one clean response at a time.

For readers tracking AI infrastructure and model economics, see our AI model leaderboard and AI company tracker.