IBM Research published ScarfBench on June 30, a benchmark for testing whether AI coding agents can migrate Java applications across enterprise frameworks while preserving behavior.
That last part is the benchmark. ScarfBench is not asking whether an agent can make code compile or generate a plausible patch. It asks whether the migrated application still behaves the way the original application behaved.
The result is blunt. IBM Research says the strongest current agents achieve less than 10% behavioral success on the benchmark.
Framework migration is a harder coding task
Enterprise modernization work often looks boring from the outside. A team moves an application from one framework to another, updates annotations, changes dependency injection, replaces configuration patterns, and preserves routes, services, data behavior, and tests.
That is exactly why it is hard for agents. The job is not a single algorithmic puzzle. It is many small semantic commitments across files. A migration can build and still be wrong. It can pass shallow tests and still change runtime behavior. It can produce idiomatic code in the target framework while losing one important edge case from the old system.
ScarfBench is built around that problem. IBM Research says the benchmark includes 34 applications, 102 framework implementations, 204 migration tasks, roughly 151,000 lines of code, about 2,000 source and test files, and 1,331 expert-written tests. The tasks cover migrations across frameworks such as Spring Boot, Quarkus, and Micronaut.
The benchmark evaluates three stages: whether the agent produces code, whether the result builds, and whether the migrated application passes behavioral tests. That separation matters because each stage catches a different failure class.
Build success is not modernization success
The public summary says agents can get part of the way through the migration pipeline, but behavior preservation remains weak. That is the enterprise lesson.
A modernization agent that makes code compile is useful. It may save time by finding API changes, updating repetitive patterns, and proposing an initial patch. But compilation is not the end state. The application still has to serve the same contract.
ScarfBench’s value is that it makes that distinction visible. It gives teams a way to ask whether an agent preserves behavior across a realistic migration, rather than rewarding a patch for looking plausible.
That connects with other recent agent benchmarks. TraceLab measured real coding-agent serving data. AA-Briefcase tested longer corporate workflows. ScarfBench adds a narrower but highly practical enterprise task: framework migration under tests.
Sources
- Hugging Face / IBM Research: ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration
- ScarfBench leaderboard
- ScarfBench dataset
- The AI Feed: TraceLab turns real Codex and Claude Code sessions into serving data
- The AI Feed: Artificial Analysis launches AA-Briefcase to test long-horizon agents