TraceLab, a University of Washington project, released a public dataset for studying coding agents as real serving workloads. The project site says the public pool contains 4,265 sessions, 357,161 agent steps, 432,510 tool calls, 54.9 billion input tokens, 52.6 billion cached-read tokens, and 186.9 million output tokens across Claude Code and Codex traces.
That makes it a different kind of coding-agent artifact from a benchmark. A benchmark asks whether an agent solved a task. TraceLab asks what agent workloads look like when people actually use them: how long sessions run, how many tools are called, how much context is reused, how latency behaves, and where serving systems spend work.
The arXiv paper, published June 29, says the dataset comes from roughly 4,300 coding-agent sessions with about 350,000 LLM steps and 430,000 tool calls. The public site gives the more exact pool counts.
Coding agents are weird serving workloads
Most LLM serving optimization was shaped around request-response traffic. A user sends a prompt, the model returns an answer, and the system tries to reduce latency and cost for that exchange.
Coding agents behave differently. They run loops. They read files, edit files, run tools, inspect results, and continue. They can accumulate long contexts while producing relatively short outputs. They can pause around human timing. They can trigger many small tool calls instead of a single clean response.
TraceLab’s paper says coding-agent workloads feature long autonomous loops, long contexts with short outputs, diverse and heavy-tailed tool calls, and high but imperfect prefix-cache hit rates. Those details matter because each one points to a different systems problem.
Long contexts stress prefill. Short outputs change the balance between prefill and decode. Tool calls add overhead and create latency that is not only model time. Cache behavior becomes central because agents repeatedly reuse large parts of prior context while appending new observations.
The cache numbers explain the economics
The project site’s snapshot lists 54.9 billion total input tokens and 52.6 billion cached-read tokens. That ratio shows why prefix caching is not a nice-to-have for agent systems. It is part of the cost and latency story.
But the paper also says cache hit rates are high but imperfect. That is the hard part. If every step reused context perfectly, serving optimization would be straightforward. In real sessions, users interrupt, agents branch, tools return new data, and context changes shape.
TraceLab points to several optimization opportunities: lower-overhead tool calling, append-length-aware prefill, semantic-aware tool-latency prediction, and better KV-cache management around human-paced gaps.
Those are infrastructure problems, not prompt-engineering problems. They matter to labs, cloud providers, IDE vendors, and teams building agent platforms because coding agents can look efficient in a demo and still be expensive or slow at scale.
The split is also useful
The public site says 140,338 agent steps came from Claude and 216,823 from Codex, a 39% to 61% split in the public pool. The project should not be read as a market-share dataset. It reflects the contributors and collection process, not the entire coding-agent world.
It is still useful because it covers two major agent families and exposes enough structure for systems research. Real traces let researchers ask questions that synthetic task suites cannot answer cleanly: how often agents call tools, how much context they carry forward, where latency clusters, and which parts of a session look predictable.
That is the practical value of TraceLab. It does not tell a team which coding agent to buy. It gives infrastructure builders a sharper picture of what they have to serve when coding agents become a daily workflow instead of a novelty.