ScarfBench shows coding agents still struggle with Java migrations

IBM Research published ScarfBench on June 30, a benchmark for testing whether AI coding agents can migrate Java applications across enterprise frameworks while preserving behavior.

That last part is the benchmark. ScarfBench is not asking whether an agent can make code compile or generate a plausible patch. It asks whether the migrated application still behaves the way the original application behaved.

The result is blunt. IBM Research says the strongest current agents achieve less than 10% behavioral success on the benchmark.

34 Applications IBM Research

204 Migration tasks IBM Research

1,331 Expert-written tests IBM Research

<10% Best behavioral success IBM Research

Framework migration is a harder coding task

Enterprise modernization work often looks boring from the outside. A team moves an application from one framework to another, updates annotations, changes dependency injection, replaces configuration patterns, and preserves routes, services, data behavior, and tests.

That is exactly why it is hard for agents. The job is not a single algorithmic puzzle. It is many small semantic commitments across files. A migration can build and still be wrong. It can pass shallow tests and still change runtime behavior. It can produce idiomatic code in the target framework while losing one important edge case from the old system.

ScarfBench is built around that problem. IBM Research says the benchmark includes 34 applications, 102 framework implementations, 204 migration tasks, roughly 151,000 lines of code, about 2,000 source and test files, and 1,331 expert-written tests. The tasks cover migrations across frameworks such as Spring Boot, Quarkus, and Micronaut.

The benchmark evaluates three stages: whether the agent produces code, whether the result builds, and whether the migrated application passes behavioral tests. That separation matters because each stage catches a different failure class.

Build success is not modernization success

The public summary says agents can get part of the way through the migration pipeline, but behavior preservation remains weak. That is the enterprise lesson.

A modernization agent that makes code compile is useful. It may save time by finding API changes, updating repetitive patterns, and proposing an initial patch. But compilation is not the end state. The application still has to serve the same contract.

ScarfBench’s value is that it makes that distinction visible. It gives teams a way to ask whether an agent preserves behavior across a realistic migration, rather than rewarding a patch for looking plausible.

That connects with other recent agent benchmarks. TraceLab measured real coding-agent serving data. AA-Briefcase tested longer corporate workflows. ScarfBench adds a narrower but highly practical enterprise task: framework migration under tests.

Benchmark context

The useful score is behavior preserved per review hour

ScarfBench should not be read as proof that coding agents are useless for modernization. It is better read as evidence that the value is still in assisted migration, not autonomous migration.

The useful operational metric is not only behavioral pass rate. It is behavior preserved per review hour. If an agent can produce a reasonable first migration in minutes, a human team may still save time if review and test repair are straightforward. If the agent scatters subtle behavior changes across many files, the cleanup can cost more than writing the migration directly.

That means teams should pilot modernization agents with strong regression tests, small bounded services, and explicit rollback points. A weak test suite turns an agent into a generator of hidden risk. A strong test suite turns it into a fast but supervised patch producer.

ScarfBench is valuable because it puts that distinction on the table. The benchmark asks the right enterprise question: not “can the agent code?”, but “can the agent preserve the business behavior while changing the framework underneath it?”

ScarfBench shows coding agents still struggle with Java migrations

Framework migration is a harder coding task

Build success is not modernization success

Sources