Artificial Analysis has published AA-Briefcase, an agentic knowledge-work benchmark built around realistic business projects rather than single prompts. The benchmark evaluates models across four multi-week projects and 91 tasks, using nearly 2,000 source files, more than 3,500 emails, and 25,000 Slack messages.
That structure is the story. Agent benchmarks are moving toward the shape of actual office work: messy context, conflicting inputs, multiple deliverables, and judgments about whether the answer is correct, analytical, and presentable.
AA-Briefcase is not just asking whether a model can produce a polished response. It asks whether the model can navigate a project folder.
The benchmark is closer to how agents are sold
Most enterprise agent promises are not about trivia answers. They are about knowledge work: read the documents, find the relevant thread, understand the spreadsheet, produce the memo, prepare the slide, reconcile the contradictions, and finish the task.
AA-Briefcase tries to simulate that environment. Artificial Analysis says the scenarios are multi-week workflows in data science, product management, and corporate strategy. The tasks were built by experts from companies including Google, McKinsey & Company, and Boston Consulting Group.
That matters because the unit of work is a deliverable, not a chat response. A model can sound competent in a short answer while still missing the hidden constraint in an email, misunderstanding a Slack discussion, or producing a presentation that looks good but fails the rubric.
The grading mixes correctness and quality
Artificial Analysis describes AA-Briefcase as combining rubric checks with pairwise grading. The rubric side tests verifiable task success. The pairwise side evaluates analytical quality and presentation quality.
That combination is useful because business work has more than one failure mode. A model can be factually wrong. It can be analytically weak. It can present the right answer in a way that a stakeholder cannot use. It can produce something polished but unsupported.
The benchmark’s combined AA-Briefcase Elo aggregates rubric pass rate, analytical quality Elo, and presentation Elo. That makes the score more holistic than a pure pass/fail metric. It also means readers should be careful: an Elo score is a benchmark-specific measurement, not a universal promise that one model will perform best inside a particular company.
Long-running tasks create a cost question
Artificial Analysis followed the benchmark launch with a time-per-task article. That is the right next question. Long-horizon agents do not only differ in quality; they differ in how long they run, how many steps they take, and how much they cost per deliverable.
For enterprise buyers, that is often the real comparison. A model that produces better analysis but takes much longer may still be worth it for a strategy memo. The same trade-off may be unacceptable for routine operations work. A cheaper model that is “good enough” for structured tasks may beat a frontier model on cost-performance.
That is why the benchmark’s business-work framing matters. It gives teams a way to discuss agent performance in units closer to work output: tasks, deliverables, time, and quality.
The caveat is transfer
No benchmark can fully represent a company’s internal reality. Every business has its own document formats, jargon, permission boundaries, messy data, and stakeholder expectations. AA-Briefcase is useful because it is more realistic than many static evaluations, but it is still an external benchmark.
The practical use is comparative and diagnostic. It can show which models handle messy, long-horizon work better under Artificial Analysis’ setup. It can also show failure modes that enterprises should test for in their own evals: missing context, weak analysis, polished wrong answers, and poor presentation.
The next checkpoint is whether companies build private versions of this benchmark shape. The useful pattern is clear: use realistic source files, evaluate deliverables, grade correctness and quality separately, and measure time and cost per task.
That is the direction agent evaluation needs. The question is no longer only which model answers hardest prompts. It is which system can finish useful work when the inputs look like a real inbox, drive, and project room.