A business briefcase opens into documents, messages, and evaluation scorecards
A business briefcase opens into documents, messages, and evaluation scorecards
+ Large Language Models News

AA-Briefcase tests agents on messy business work

Artificial Analysis' AA-Briefcase benchmark evaluates models on multi-week knowledge-work projects with documents, email, Slack data, deliverables, and graded analysis quality.

Artificial Analysis has published AA-Briefcase, an agentic knowledge-work benchmark built around realistic business projects rather than single prompts. The benchmark evaluates models across four multi-week projects and 91 tasks, using nearly 2,000 source files, more than 3,500 emails, and 25,000 Slack messages.

That structure is the story. Agent benchmarks are moving toward the shape of actual office work: messy context, conflicting inputs, multiple deliverables, and judgments about whether the answer is correct, analytical, and presentable.

AA-Briefcase is not just asking whether a model can produce a polished response. It asks whether the model can navigate a project folder.

The benchmark is closer to how agents are sold

Most enterprise agent promises are not about trivia answers. They are about knowledge work: read the documents, find the relevant thread, understand the spreadsheet, produce the memo, prepare the slide, reconcile the contradictions, and finish the task.

AA-Briefcase tries to simulate that environment. Artificial Analysis says the scenarios are multi-week workflows in data science, product management, and corporate strategy. The tasks were built by experts from companies including Google, McKinsey & Company, and Boston Consulting Group.

That matters because the unit of work is a deliverable, not a chat response. A model can sound competent in a short answer while still missing the hidden constraint in an email, misunderstanding a Slack discussion, or producing a presentation that looks good but fails the rubric.

The grading mixes correctness and quality

Artificial Analysis describes AA-Briefcase as combining rubric checks with pairwise grading. The rubric side tests verifiable task success. The pairwise side evaluates analytical quality and presentation quality.

That combination is useful because business work has more than one failure mode. A model can be factually wrong. It can be analytically weak. It can present the right answer in a way that a stakeholder cannot use. It can produce something polished but unsupported.

The benchmark’s combined AA-Briefcase Elo aggregates rubric pass rate, analytical quality Elo, and presentation Elo. That makes the score more holistic than a pure pass/fail metric. It also means readers should be careful: an Elo score is a benchmark-specific measurement, not a universal promise that one model will perform best inside a particular company.

Long-running tasks create a cost question

Artificial Analysis followed the benchmark launch with a time-per-task article. That is the right next question. Long-horizon agents do not only differ in quality; they differ in how long they run, how many steps they take, and how much they cost per deliverable.

For enterprise buyers, that is often the real comparison. A model that produces better analysis but takes much longer may still be worth it for a strategy memo. The same trade-off may be unacceptable for routine operations work. A cheaper model that is “good enough” for structured tasks may beat a frontier model on cost-performance.

That is why the benchmark’s business-work framing matters. It gives teams a way to discuss agent performance in units closer to work output: tasks, deliverables, time, and quality.

The caveat is transfer

No benchmark can fully represent a company’s internal reality. Every business has its own document formats, jargon, permission boundaries, messy data, and stakeholder expectations. AA-Briefcase is useful because it is more realistic than many static evaluations, but it is still an external benchmark.

The practical use is comparative and diagnostic. It can show which models handle messy, long-horizon work better under Artificial Analysis’ setup. It can also show failure modes that enterprises should test for in their own evals: missing context, weak analysis, polished wrong answers, and poor presentation.

The next checkpoint is whether companies build private versions of this benchmark shape. The useful pattern is clear: use realistic source files, evaluate deliverables, grade correctness and quality separately, and measure time and cost per task.

That is the direction agent evaluation needs. The question is no longer only which model answers hardest prompts. It is which system can finish useful work when the inputs look like a real inbox, drive, and project room.

Sources

The AI Feed Desk

The AI Feed Desk

Editorial desk

The AI Feed Desk tracks AI provider updates, model releases, agent tooling, and enterprise adoption, turning fast-moving announcements into source-linked context for builders and operators.

Noticed a typo, incorrect information, or translation error?

Tell us so we can fix it.

Help Improve This Article

Related Articles

Hugging Face measures whether tools are agent-friendly

Hugging Face's agent-focused benchmark tests whether software changes help coding agents finish tasks with fewer errors, tokens, and detours.

The AI Feed Desk

By The AI Feed Desk

OpenAI pushes Codex beyond software development

OpenAI says Codex now has more than 5M weekly users and is adding role-specific plugins, Sites, and annotations for broader business work.

The AI Feed Desk

By The AI Feed Desk

MLPerf Mobile v6.0 gives on-device LLMs a real test surface

MLCommons added standardized Android LLM tests to MLPerf Mobile v6.0, including Llama 3.2 1B, Llama 3.2 3B, and Llama 3.1 8B Instruct workloads.

The AI Feed Desk

By The AI Feed Desk

OpenAI's rare-disease study makes old genome cases worth reopening

OpenAI says o3 Deep Research helped experts reanalyze 376 previously unsolved rare-disease cases and establish 18 diagnoses after clinical review.

The AI Feed Desk

By The AI Feed Desk

DeepSWE makes coding-agent rankings a cost question

DeepSWE's June 20 leaderboard update separates frontier coding agents by pass rate, cost, output tokens, and agent steps across long-horizon software tasks.

The AI Feed Desk

By The AI Feed Desk