A coding benchmark workbench with agent paths, repository blocks, cost meters, and verification checkmarks
A coding benchmark workbench with agent paths, repository blocks, cost meters, and verification checkmarks
+ Large Language Models News

DeepSWE makes coding-agent rankings a cost question

DeepSWE's June 20 leaderboard update separates frontier coding agents by pass rate, cost, output tokens, and agent steps across long-horizon software tasks.

35 minutes ago

DeepSWE updated its public coding-agent leaderboard on June 20 with a simple but useful message: long-horizon software engineering is not only a pass-rate race. The benchmark now shows 113 tasks across 91 repositories and 5 languages, with model results listed next to average cost, output tokens, and agent steps.

That makes the table more useful than a single “best coding model” ranking. Claude Fable 5 leads DeepSWE v1.1 at 70% pass@1, GPT-5.5 follows at 67%, Claude Opus 4.8 reaches 59%, GPT-5.4 reaches 52%, and GLM-5.2 reaches 44%. But those rows have very different costs and operating profiles.

The AI Feed’s read is that DeepSWE is becoming a deployment benchmark, not just a capability benchmark. The question is not only which agent solves the most tasks. It is which agent solves enough tasks at a cost, token budget, and action count a team can afford to run repeatedly.

The leaderboard exposes a real tradeoff

DeepSWE reports average cost per task alongside pass@1. In the June 20 table, Claude Fable 5 leads at 70% with an average cost of $21.63 per task. GPT-5.5 is close behind at 67% with an average cost of $7.23. Claude Opus 4.8 is lower on pass rate at 59% with an average cost of $13.22. GLM-5.2 sits at 44% with an average cost of $3.92.

ModelDeepSWE pass@1Average costOutput tokensAgent steps
Claude Fable 5 max70%$21.63119k88
GPT-5.5 xhigh67%$7.2346k82
Claude Opus 4.8 max59%$13.22135k120
GPT-5.4 xhigh52%$5.6571k70
GLM-5.2 max44%$3.9278k129

Those numbers should not be read as a universal buyer’s guide. DeepSWE is one benchmark. It measures a specific kind of long-horizon coding work. But it does show the operational shape of coding-agent evaluation: an agent that wins on pass rate may not win on cost, and an agent that is cheaper may still need more steps, more supervision, or a narrower task mix.

That is especially important for engineering teams that plan to run agents continuously. A one-off benchmark win matters less if the agent will be asked to triage issues, draft fixes, iterate on failing tests, and open pull requests hundreds or thousands of times.

DeepSWE is trying to avoid benchmark comfort

The benchmark’s design is part of the story. DeepSWE says its tasks are written from scratch rather than adapted from existing commits or pull requests. It also says the tasks span a broad set of repositories and require more code and more output tokens than SWE-bench Pro, while using hand-written verifiers to test behavior.

That matters because coding benchmarks can become too comfortable. If public tasks are close to training data, or if solutions can be memorized through common benchmark artifacts, the score starts measuring exposure as much as agent ability. DeepSWE is explicitly trying to make that harder.

The long-horizon part also changes what teams should watch. Short code-generation tests can tell you whether a model knows syntax and common patterns. Long-horizon tasks stress planning, repository navigation, tool use, testing behavior, and recovery after mistakes. Those are the skills that determine whether an agent helps in a real codebase or just produces plausible patches.

GLM-5.2 gets a different kind of proof point

GLM-5.2 does not lead DeepSWE. It is fifth in the displayed v1.1 table. But its row is still useful because it puts an open-weight model into the same operational comparison as the highest-scoring proprietary frontier models.

That is a better test than generic hype around open models. A team deciding whether to evaluate GLM-5.2 for coding-agent work can now ask specific questions: Is 44% pass@1 good enough for the class of tasks we can supervise? Does the lower average cost matter more than the lower pass rate? Are the extra agent steps acceptable inside our workflow? How does it behave on our repositories, languages, and test harnesses?

The answer may still be no for high-risk production changes. But DeepSWE gives teams a more grounded way to test that no than a leaderboard that only reports capability.

What to test next

Teams using coding agents should not copy DeepSWE’s ranking straight into procurement. They should use it to design a local evaluation. Pick representative issues, require the agent to run the real test suite, measure cost per accepted change, track how often humans need to intervene, and separate easy automation wins from changes that require architectural judgment.

The most useful local metric may be accepted fixes per dollar after review, not raw pass rate. DeepSWE’s June 20 update points in that direction by putting cost and behavior near the headline score.

For readers tracking model performance and company coverage, see our AI model leaderboard and AI company tracker.

Sources

The AI Feed Desk

The AI Feed Desk

Editorial desk

The AI Feed Desk tracks AI provider updates, model releases, agent tooling, and enterprise adoption, turning fast-moving announcements into source-linked context for builders and operators.

Noticed a typo, incorrect information, or translation error?

Tell us so we can fix it.

Help Improve This Article

Related Articles

Z.ai releases GLM-5.2 for long-horizon coding work

Z.ai's GLM-5.2 pairs a 1-million-token context pitch with long-horizon coding benchmarks, public docs, API pricing, and an MIT-licensed Hugging Face model card.

The AI Feed Desk

By The AI Feed Desk

MAI-Code-1-Flash moves across GitHub Copilot before enterprise access

GitHub says Microsoft's small coding model is expanding across Copilot CLI, app, chat, IDE, mobile, and Xcode surfaces before Business and Enterprise rollout.

The AI Feed Desk

By The AI Feed Desk

MLPerf Mobile v6.0 gives on-device LLMs a real test surface

MLCommons added standardized Android LLM tests to MLPerf Mobile v6.0, including Llama 3.2 1B, Llama 3.2 3B, and Llama 3.1 8B Instruct workloads.

The AI Feed Desk

By The AI Feed Desk

Anthropic releases Claude Fable 5 and Claude Mythos 5

Anthropic's first broadly available Mythos-class model arrives as Claude Fable 5, with sensitive requests routed to Opus 4.8 and Mythos 5 reserved for trusted access.

The AI Feed Desk

By The AI Feed Desk

Gemini 3.5 Flash beats last year's Pro on the work builders ship

Google's Gemini 3.5 Flash beats last year's 3.1 Pro on coding and agentic benchmarks at ~40% lower cost — with reasoning and 1M-context limits worth testing.

The AI Feed Desk

By The AI Feed Desk