Cognition introduced FrontierCode, a benchmark for AI coding systems that asks a harder question than whether generated code passes tests: would a maintainer actually merge the pull request?
The benchmark was surfaced in the live Hacker News AI sweep on June 29, and Cognition’s source post provides the benchmark details. FrontierCode uses 150 tasks across three nested difficulty subsets: Extended, Main, and Diamond. Diamond contains the 50 hardest tasks, Main contains the 100 hardest, and Extended contains all 150.
The benchmark reports pass rate and score. A solution passes only if it clears blocker criteria that a maintainer would treat as hard stops. If it does not clear those blockers, its score is zero. If it does pass, the score reflects weighted rubric items.
That design is the point. The benchmark is not trying to measure whether an agent can satisfy a narrow verifier. It is trying to measure whether the code is good enough to merge.
Correctness is no longer enough
Earlier coding benchmarks made sense when models struggled to produce working patches at all. But as coding agents become more capable, functional correctness is only the first gate. A patch can pass a test and still be too broad, too brittle, poorly scoped, unmaintainable, or inconsistent with the codebase.
Cognition says FrontierCode evaluates behavioral correctness, regression safety, mechanical cleanliness, test correctness, scope, and code quality. It uses a mix of unit tests, command checks, reverse-classical tests, scope checks, adaptive grading, and prompt-based rubric review.
That mix is important because production code review is not a single yes-or-no test. Maintainers judge whether a change belongs in the codebase. They ask whether it solves the actual problem, whether it creates maintenance debt, whether tests prove the right behavior, and whether the patch touches only what it needs to touch.
That is exactly where many AI coding demos are weakest. A model can produce something plausible, even useful, while still failing the standards of a real repository.
Maintainers are built into the benchmark
Cognition says 20-plus open-source developers built realistic coding tasks from repositories they maintain, spending more than 40 hours per task. Later in the post, Cognition says the work involved maintainers of 36 flagship open-source repositories.
That maintainer involvement is the benchmark’s strongest claim. The people defining “mergeable” are not only writing synthetic tasks. They are encoding the judgment they use when reviewing real contributions.
The post also says every task is manually reviewed by a Cognition researcher and that FrontierCode has an 81% lower false positive rate compared with SWE-Bench Pro. Cognition argues that this makes the scores a stronger signal of whether a model can write high-quality, maintainable code.
That claim should still be read with care. Cognition created the benchmark and reports the results. It is a source-of-record artifact, not an independent neutral standard. But the benchmark direction is useful regardless: code quality has to become part of coding-agent evaluation.
The hardest subset is still mostly unsolved
The reported results are sobering. Cognition says Claude Opus 4.8 scores 13.4% on FrontierCode Diamond, the hardest subset. GPT-5.5 scores 6.3%, Gemini 3.1 Pro scores 4.7%, and other models score lower. Cognition also says GPT-5.5 uses up to four times fewer tokens than Opus 4.8, giving it a better cost-intelligence tradeoff in that comparison.
On easier subsets, Cognition reports Opus 4.8 at 34.3% on Main and 51.8% on Extended. It also says Kimi K2.6 is the best-performing open-source model in its run, with 3.8% on Diamond, 16% on Main, and 37% on Extended.
The practical read is not that one model wins forever. It is that mergeable production code remains difficult even for leading models. If a benchmark includes scope, maintainability, test quality, and repository standards, the scores look much less saturated than on simpler correctness tests.
Teams should read benchmarks by failure mode
For engineering teams, the value of FrontierCode is not only the ranking. It is the failure taxonomy. If an agent fails because it writes bad tests, over-edits files, misses style conventions, or passes incomplete tests with a wrong solution, the mitigation is different in each case.
That is how coding-agent evaluation needs to mature. Teams should not ask only which model scored highest. They should ask which failure modes matter in their repositories, which checks can be automated, where human review is still essential, and how much token cost is justified for better code quality.
The benchmark is a reminder that AI code is not done when CI turns green. In real software, the merge button is a judgment call.