FrontierSWE evaluates coding agents on hard, long-horizon software engineering tasks collected from real-world technical domains. The benchmark covers implementation tasks, performance engineering tasks, and research tasks. The public leaderboard reports average rank across tasks and a dominance metric: the win rate against a random opponent on a task.
We source results from the public FrontierSWE leaderboard. Our chart defaults to the leaderboard’s dominance metric. We also keep the leaderboard’s average rank and category-specific ranks for implementation, performance, and research tasks in the data export.
FrontierSWE evaluates coding agents on long-horizon software engineering tasks that are intended to take substantial amounts of work. The tasks are grouped into implementation, performance, and research categories, and submissions are run through agent harnesses such as Claude Code, Codex, Gemini CLI, and Kimi CLI. The leaderboard aggregates repeated runs using its Mean@5 setting and reports both average rank and category-specific ranks. We use dominance as the headline metric because it summarizes how often a system would beat another randomly selected system on a task.
Have a question? Noticed something wrong? Let us know.
An ultra-long-horizon software engineering benchmark testing coding agents on implementation, performance, and research tasks.