FrontierCode is a benchmark from Cognition that evaluates whether an AI coding agent’s patch on a real open-source issue is good enough to merge, not just whether it passes tests. Each task pairs a checked-out repository with a single issue, and the agent works autonomously in a container. Patches are graded against held-out tests and a maintainer-authored rubric covering behavioral correctness, regression safety, build and style cleanliness, and adherence to project conventions. Tasks are split into nested difficulty subsets, with the Diamond subset holding the 50 hardest problems.
We source results from Cognition’s public FrontierCode data. Our chart reports the Diamond score: each model’s rubric score on the hardest 50-task Diamond subset at its best-performing reasoning effort. Cognition also reports a separate binary pass rate, which we do not show.
FrontierCode grades each submission with a mean@5 aggregation against a weighted rubric, where failing any blocker criterion yields a zero. Models are run through agent harnesses such as Claude Code, Codex, Gemini CLI, mini-SWE-agent, and Devin; we keep each model’s harness and best-performing reasoning effort in the data export. Because the tasks come from real open-source repositories and are graded for mergeability rather than just test-passing, scores are low even for frontier models.
Have a question? Noticed something wrong? Let us know.
A benchmark testing whether coding agents can produce mergeable fixes for real, hard open-source issues.