Abstract code blocks pass through build and deploy checkpoints before meeting a harder behavior-validation maze
Abstract code blocks pass through build and deploy checkpoints before meeting a harder behavior-validation maze
+ Large Language Models News

ScarfBench shows coding agents still struggle with Java migrations

IBM Research's ScarfBench tests whether AI coding agents can preserve behavior while migrating Java applications across enterprise frameworks.

17 minutes ago

IBM Research published ScarfBench on June 30, a benchmark for testing whether AI coding agents can migrate Java applications across enterprise frameworks while preserving behavior.

That last part is the benchmark. ScarfBench is not asking whether an agent can make code compile or generate a plausible patch. It asks whether the migrated application still behaves the way the original application behaved.

The result is blunt. IBM Research says the strongest current agents achieve less than 10% behavioral success on the benchmark.

34 Applications IBM Research
204 Migration tasks IBM Research
1,331 Expert-written tests IBM Research
<10% Best behavioral success IBM Research

Framework migration is a harder coding task

Enterprise modernization work often looks boring from the outside. A team moves an application from one framework to another, updates annotations, changes dependency injection, replaces configuration patterns, and preserves routes, services, data behavior, and tests.

That is exactly why it is hard for agents. The job is not a single algorithmic puzzle. It is many small semantic commitments across files. A migration can build and still be wrong. It can pass shallow tests and still change runtime behavior. It can produce idiomatic code in the target framework while losing one important edge case from the old system.

ScarfBench is built around that problem. IBM Research says the benchmark includes 34 applications, 102 framework implementations, 204 migration tasks, roughly 151,000 lines of code, about 2,000 source and test files, and 1,331 expert-written tests. The tasks cover migrations across frameworks such as Spring Boot, Quarkus, and Micronaut.

The benchmark evaluates three stages: whether the agent produces code, whether the result builds, and whether the migrated application passes behavioral tests. That separation matters because each stage catches a different failure class.

Build success is not modernization success

The public summary says agents can get part of the way through the migration pipeline, but behavior preservation remains weak. That is the enterprise lesson.

A modernization agent that makes code compile is useful. It may save time by finding API changes, updating repetitive patterns, and proposing an initial patch. But compilation is not the end state. The application still has to serve the same contract.

ScarfBench’s value is that it makes that distinction visible. It gives teams a way to ask whether an agent preserves behavior across a realistic migration, rather than rewarding a patch for looking plausible.

That connects with other recent agent benchmarks. TraceLab measured real coding-agent serving data. AA-Briefcase tested longer corporate workflows. ScarfBench adds a narrower but highly practical enterprise task: framework migration under tests.

Sources

The AI Feed Desk

The AI Feed Desk

Editorial desk

The AI Feed Desk tracks AI provider updates, model releases, agent tooling, and enterprise adoption, turning fast-moving announcements into source-linked context for builders and operators.

Noticed a typo, incorrect information, or translation error?

Tell us so we can fix it.

Help Improve This Article

Related Articles

Hugging Face measures whether tools are agent-friendly

Hugging Face's agent-focused benchmark tests whether software changes help coding agents finish tasks with fewer errors, tokens, and detours.

The AI Feed Desk

By The AI Feed Desk

DeepSWE makes coding-agent rankings a cost question

DeepSWE's June 20 leaderboard update separates frontier coding agents by pass rate, cost, output tokens, and agent steps across long-horizon software tasks.

The AI Feed Desk

By The AI Feed Desk

Cognition's FrontierCode asks whether AI code would survive review

FrontierCode evaluates coding agents on mergeability, code quality, scope, tests, and maintainer judgment instead of only functional correctness.

The AI Feed Desk

By The AI Feed Desk

Hugging Face and Every Eval Ever make model-card scores more inspectable

Community Evals and Every Eval Ever now connect model-page benchmark scores to structured provenance records.

The AI Feed Desk

By The AI Feed Desk

Hugging Face redesigns the hf CLI for coding agents

Hugging Face says Claude Code and Codex are the largest coding-agent cohorts on the Hub, and its redesigned hf CLI cuts token waste and command probing.

The AI Feed Desk

By The AI Feed Desk