DeepSWE makes coding-agent rankings a cost question

DeepSWE updated its public coding-agent leaderboard on June 20 with a simple but useful message: long-horizon software engineering is not only a pass-rate race. The benchmark now shows 113 tasks across 91 repositories and 5 languages, with model results listed next to average cost, output tokens, and agent steps.

That makes the table more useful than a single “best coding model” ranking. Claude Fable 5 leads DeepSWE v1.1 at 70% pass@1, GPT-5.5 follows at 67%, Claude Opus 4.8 reaches 59%, GPT-5.4 reaches 52%, and GLM-5.2 reaches 44%. But those rows have very different costs and operating profiles.

The AI Feed’s read is that DeepSWE is becoming a deployment benchmark, not just a capability benchmark. The question is not only which agent solves the most tasks. It is which agent solves enough tasks at a cost, token budget, and action count a team can afford to run repeatedly.

The leaderboard exposes a real tradeoff

DeepSWE reports average cost per task alongside pass@1. In the June 20 table, Claude Fable 5 leads at 70% with an average cost of $21.63 per task. GPT-5.5 is close behind at 67% with an average cost of $7.23. Claude Opus 4.8 is lower on pass rate at 59% with an average cost of $13.22. GLM-5.2 sits at 44% with an average cost of $3.92.

Model	DeepSWE pass@1	Average cost	Output tokens	Agent steps
Claude Fable 5 max	70%	$21.63	119k	88
GPT-5.5 xhigh	67%	$7.23	46k	82
Claude Opus 4.8 max	59%	$13.22	135k	120
GPT-5.4 xhigh	52%	$5.65	71k	70
GLM-5.2 max	44%	$3.92	78k	129

Those numbers should not be read as a universal buyer’s guide. DeepSWE is one benchmark. It measures a specific kind of long-horizon coding work. But it does show the operational shape of coding-agent evaluation: an agent that wins on pass rate may not win on cost, and an agent that is cheaper may still need more steps, more supervision, or a narrower task mix.

That is especially important for engineering teams that plan to run agents continuously. A one-off benchmark win matters less if the agent will be asked to triage issues, draft fixes, iterate on failing tests, and open pull requests hundreds or thousands of times.

DeepSWE is trying to avoid benchmark comfort

The benchmark’s design is part of the story. DeepSWE says its tasks are written from scratch rather than adapted from existing commits or pull requests. It also says the tasks span a broad set of repositories and require more code and more output tokens than SWE-bench Pro, while using hand-written verifiers to test behavior.

That matters because coding benchmarks can become too comfortable. If public tasks are close to training data, or if solutions can be memorized through common benchmark artifacts, the score starts measuring exposure as much as agent ability. DeepSWE is explicitly trying to make that harder.

The long-horizon part also changes what teams should watch. Short code-generation tests can tell you whether a model knows syntax and common patterns. Long-horizon tasks stress planning, repository navigation, tool use, testing behavior, and recovery after mistakes. Those are the skills that determine whether an agent helps in a real codebase or just produces plausible patches.

GLM-5.2 gets a different kind of proof point

GLM-5.2 does not lead DeepSWE. It is fifth in the displayed v1.1 table. But its row is still useful because it puts an open-weight model into the same operational comparison as the highest-scoring proprietary frontier models.

That is a better test than generic hype around open models. A team deciding whether to evaluate GLM-5.2 for coding-agent work can now ask specific questions: Is 44% pass@1 good enough for the class of tasks we can supervise? Does the lower average cost matter more than the lower pass rate? Are the extra agent steps acceptable inside our workflow? How does it behave on our repositories, languages, and test harnesses?

The answer may still be no for high-risk production changes. But DeepSWE gives teams a more grounded way to test that no than a leaderboard that only reports capability.

What to test next

Teams using coding agents should not copy DeepSWE’s ranking straight into procurement. They should use it to design a local evaluation. Pick representative issues, require the agent to run the real test suite, measure cost per accepted change, track how often humans need to intervene, and separate easy automation wins from changes that require architectural judgment.

The most useful local metric may be accepted fixes per dollar after review, not raw pass rate. DeepSWE’s June 20 update points in that direction by putting cost and behavior near the headline score.

For readers tracking model performance and company coverage, see our AI model leaderboard and AI company tracker.

Sources

DeepSWE leaderboard