Benchmarking AI on unsolved math problems
This work was supported by a grant from Schmidt Sciences.
AI math capabilities have come far and fast. In mid-2024, high school math was still a challenge. By the end of 2025, AI systems were solving extremely hard problems designed to be solvable only by top human experts. As we write this, it seems likely that AI systems will soon be able to solve problems that no human has solved before.
Indeed, there are already glimmers of this, e.g. AI systems solving several previously-unsolved Erdős problems. However, these results have been difficult to contextualize. Are these problems mathematically significant? How hard had humans previously tried to solve them? Does this say anything new about AI capabilities?
Today we are releasing a pilot version of a new benchmark, FrontierMath: Open Problems, which we hope will shed light on this topic. The benchmark consists of open problems from research mathematics that professional mathematicians have tried and failed to solve. To facilitate evaluation at scale, we include only problems for which proposed solutions can be verified programmatically.
At the time of release, to the best of our knowledge, none of these problems have been solved by humans or AI systems. If an AI system solves any problem, it will be a meaningful advance in the frontier of human knowledge. Moreover, we will be able to say something about just how meaningful: contributing mathematicians have characterized the significance of the problems, ranging from moderately interesting results to major breakthroughs.
The pilot release consists of 14 problems with:
- Write-ups explaining the problem’s significance and difficulty
- Precise prompts that can be given to AI systems – try them yourself!
- Initial attempts by AI systems to solve the problems
We will add more problems in the coming months and are actively commissioning contributions. See our problem proposal form if you are interested in contributing.
As for the verifiers — the programs that evaluate candidate solutions — we offer access for a fee. We structure access this way to help defray costs of expanding the benchmark. Problems are labor intensive to create and each solution effectively reduces the number of problems in the benchmark, so we ask that those who wish to use the verifiers partner with us to help fund further expansion. We commit to grant access uniformly, and not exclusively to any entity. Inquire at math@epoch.ai.
Example problems
Combinatorics
Moderately interestingLet $B_n$ denote a triangular book graph on $n$ vertices. For a given $n$, construct a graph showing that the Ramsey number $R(B_{n-1}, B_n)$ is greater than $4n - 2$.
Finding a general construction for $R(B_{n-1}, B_n)$ would be of interest to computational graph theorists. Such a construction would likely be useful for proving bounds for other values of $R(B_m, B_n)$, and possibly for other general Ramsey numbers as well.
Algebraic geometry
Solid resultConstruct a family of explicit normal projective KLT del Pezzo surfaces $X$ over an algebraically closed field of characteristic $3$ with Picard number $1$ and arbitrarily many singular points.
A construction in characteristic $3$ would uncover a new small-characteristic phenomenon and would matter for the broader program of understanding Fano varieties and the MMP in positive characteristic.
Number theory
Major advanceFind a degree $23$ polynomial in $\mathbb{Z}[x]$ whose splitting field over $\mathbb{Q}$ has Galois group $M_{23}$.
The inverse Galois problem is one of the most basic open questions in number theory, asking for the construction of polynomials with prescribed symmetries. This is the minimal case for which the problem is open; it is especially interesting because it is the last of the sporadic simple groups for which no construction is known.
Topology
BreakthroughGive an algorithm that takes a knot as input and determines whether the unknotting number of the knot is equal to one.
This problem is one of the fundamental questions in low-dimensional topology. It asks a basic question about how hard it is to trivialize a knot. A solution would be a major result in knot theory.
Problems are mathematically meaningful, diverse, and hard
Problems are contributed by professional mathematicians. Contributors suggest problems they are familiar with from their own research, and which they would be interested to see solved. Contributors rate the meaningfulness (notability) of solutions, ranging from results of moderate interest within a subfield all the way up to major breakthroughs. We aim for problems to be evenly distributed across these tiers.
Our goal is to find problems that are meaningful to mathematicians on their own terms. We do not select problems to be difficult for AI per se. Unlike problems contrived for a test, these problems are core to doing research math.1 We want to know whether AI systems can solve them: if they can, so be it.
That said, at least for humans, the problems are hard. Contributors rate problems with estimates of how many mathematicians have made serious attempts to solve them, with responses ranging from 2–4 mathematicians to 50–100.
Contributors also take a guess at human time-to-solve, specifically how long it would take the mathematician most capable of solving the problem to have a 50% chance of solving it, if they worked full time. Responses range from 1–4 weeks to 3–10 years.2 In other words, the human baseline is high.
Problems cover a range of mathematical topic areas. The pilot set has a tilt toward combinatorics and number theory, where we happen to have found the most problems amenable to automatic verification. We aim to keep up the diversity of topic areas as we expand the benchmark.
See the problem proposal form for more details on the problem sourcing process.
Solutions can be verified automatically
Evaluating AI solutions to unsolved math problems is a major logistical challenge. Math research typically proceeds via natural-language papers. Evaluating such papers is labor intensive and error-prone even for humans. While AI systems have made progress at evaluating natural-language mathematics, we cannot rely on the accuracy of their evaluations for advanced material.3
Our approach is to find problems where, even though no solution is currently known, a proposed solution can be checked by a relatively straightforward computer program running on a typical computer. It is not obvious that such verifiable problems exist, but they do.
For example, some problems ask for a very concrete mathematical object. One such case asks for a polynomial with a certain property.4 It is quick to check if a given polynomial has the desired property, but finding one is beyond the reach of any known technique, including highly optimized, large-scale search. The problem’s meaningfulness stems from the fact that a conceptual approach appears to be required to construct the desired object.
In other cases, we want a construction that works for all positive integers. We can’t verify this in general, but we can ask for an algorithm that takes an integer and returns a construction for that integer. We can verify the algorithm on a challenge set of integers where no constructions are currently known and where the integers are large enough that search is intractable. Success here gives strong evidence that the algorithm implements a general solution.
The downside is that this approach limits what we can ask about. Our ideal would be to take a random sample of all unsolved math problems, but this constraint introduces a bias. The benchmark problems tend to be relatively concrete, and may not require fuzzier mathematical pursuits like “theory building”. Still, we have been pleasantly surprised by how readily mathematicians have been able to come up with a diversity of mathematically meaningful problems that satisfy this constraint.5
Some benchmark problems may not be solvable
One risk inherent to this benchmark is that problems might not have solutions as stated. This issue takes two forms: either the desired mathematical object does not exist, or it does exist but is too large for the verifier to certify it as valid.
We don’t think such cases undermine interpretation of the benchmark overall. Successful solves are clearly meaningful. Failure to solve all problems at a given difficulty level is also probably meaningful, and will become more so as we grow the problem set beyond the current pilot size. We thus encourage interpretations of the form, “We have seen several examples of AI solving moderately interesting open math problems, but not yet any examples of major advances.”
That said, we strove to include only problems where a solution is at least likely to exist. For some problems there are heuristic reasons to believe that a solution of the desired form exists, and, in all cases, there is at least an absence of any reason to believe that no solution exists.6
Our target was an assessment from the problem contributor of at least an 80% probability that the problem was solvable as stated.7 However, mathematicians often expressed high uncertainty when assigning a probability in the 50–80% range.
Solved problems will be removed from the benchmark
If a problem is solved — whether by humans or AI — the result will be published.8 Thus, any future AI system attempting the same problem would simply be able to look up the solution. For this reason, we will remove solved problems from the benchmark.
While this “first-to-solve” set-up is atypical, we don’t think it undermines the value of the Open Problems benchmark as a whole. The benchmark won’t give a score that can be used to compare models’ ability to solve open problems, but it will tell us whether AI systems are capable of solving problems of a given difficulty and significance.
The benchmark helps track AI “research taste”
Most immediately, this benchmark asks whether AI can solve unsolved math problems. We think this also helps us track fuzzier concepts like “research taste” — how good AI systems are at choosing the right direction to pursue, noticing the right patterns, etc.
Such capabilities seem likely to be relevant for theoretical math, where the right ideas can be especially difficult to find. If AI systems are able to solve math problems that have resisted significant human effort, then they might be moving toward superhuman research taste in general.
It’s by no means a guarantee. Perhaps, like chess or Go, it will turn out that math’s formal nature makes it unusually “easy” for AI systems. Or perhaps AI systems will solve these problems in ways that we regard as relatively inelegant.9 Still, we are glad to add this benchmark to our toolbox for tracking harder-to-quantify capabilities.
We hope to see strong attempts to get AI systems to solve these problems
Our primary goal is to understand the frontier of AI math capabilities. However, it is not clear how best to elicit AI capabilities on the benchmark problems.
So far we have tried simply prompting GPT-5.2 Pro and Gemini 3 Deep Think in their web apps.10 The results of this are shown on the individual problem pages. In this setting the models are generally capable of solving “warm-up” problems: variants of the open problems where solutions are known. This tells us that they understand the instructions and are familiar with the subject area. It also helps test the verifiers.
But, when given the actual open problems, the models don’t show much promise. Sometimes they seem set on trying optimization approaches instead of the more conceptual approaches that are likely necessary. Other times they recognize the problem as open and simply give up.
It seems likely that more “thinking” will be an important ingredient. Models already plan, execute, revise, and iterate — but they may need a lot more time and resources to do so if they are going to have a chance at these problems. Exactly how to enable them in this way is an open research question.11
We’re working on a scaffold that can facilitate this sort of extended attempt, and we hope others will try as well. Contact us at math@epoch.ai with any questions.
Appendix: caveats on future AI solutions
This benchmark is essentially a preregistration of the interestingness of a set of math problems. That said, if these problems end up being solved in certain ways, some caveats may be warranted. Here we try to preregister these caveats as well, to help mitigate concerns about moving the goalposts later.12
Collaborations. There have already been productive mathematical collaborations between humans and AI systems. A typical pattern is for the AI system to work out some examples, and for the human to generalize to a complete solution. Indeed, using computers to search for useful examples long predates LLM-based AI systems. We will have to assess the division of labor behind any such solution. The more an AI system is responsible for the conceptual part, the more that will be indicative of advancing capabilities.
Prior art. AI systems’ breadth of mathematical knowledge may already outstrip that of top humans. It is possible that an existing result has already done most of the work to solve a problem and the mathematicians who attempted the problem were simply unaware of it. This is unlikely for better-known problems, but, in any case, an AI solution relying on such a result would be significantly less indicative of advancing capabilities. Of course, if an AI system adapts known results in new ways, that needs no caveat: much of math research has this character.
Conventional compute. If an AI system suggests an optimized, parallelizable search algorithm and an AI company devotes a supercomputer-month to executing the search, then a problem could be solved with less mathematical insight than expected. While we selected problems where brute-force alone is unlikely to be successful, it is hard to be certain. Most problems have not had industrial-scale resources brought to bear on them.
Verifier misspecification. A verifier may accept an AI solution even though the solution does not represent the broader conceptual achievement that the contributing mathematician intended the verifier to identify. In simple cases these misspecifications will be no more than bugs. In more complex cases, it may be that a solution was harder to verify than the mathematician initially believed. We will report on any such cases and fix the verifiers when possible.
Sample bias. We must contend with the bias introduced by the verifiability criterion. Previous caveats aside, progress on this benchmark really should amount to solving meaningful open math problems. But it may turn out that AI systems are uniquely suited to solving the sort of meaningful open math problems that are amenable to automatic verification. If so, progress on the benchmark may not generalize well to progress in other areas of math.
Notes
-
In other words, this benchmark has high construct validity. ↩
-
Mathematicians generally emphasized that these time-to-solve guesses were, quite possibly, no better than noise. However, we believe that the large range gives us at least a bit of information. ↩
-
See, e.g., here for work on AI systems grading prose proofs. ↩
-
Namely, a polynomial whose Galois group is the Mathieu group $M_{23}$. Mathematicians have tried and failed to find such a polynomial, but still believe that one exists. ↩
-
A different approach would have been to pursue full formalization, most likely asking AI systems to implement solutions in Lean. We decided not to take this approach for three reasons, each having to do with the fact that Lean is still maturing as a platform. First, many subfields’ foundations are not yet formalized in Lean. Second, even if a problem statement can be formalized, it is possible that the concepts required to solve it cannot be. Third, Lean is far less battle-tested than other programming paradigms. In particular, there are likely still a fair number of inscrutable bugs which models could exploit. For now we prefer the simplicity of one-off verification programs, even if in the long run Lean (or another formal system) may become a more practical and scalable solution. ↩
-
In particular, we don’t include problems asking for counterexamples to conjectures where mathematicians generally believe that the conjecture is true. For example, exhibiting an even number that cannot be written as the sum of two prime numbers would disprove Goldbach’s conjecture. Such a counterexample would be easy to verify, at least if not astronomically large. But if mathematicians are correct in believing that Goldbach’s conjecture is true, then no such counterexample exists. Failure of an AI system to find such a counterexample would tell us nothing. ↩
-
This included consideration of any restrictions on solution size necessitated for the sake of making the problem automatically verifiable. For example, not just that such-and-such a mathematical object exists, but that one exists that is small enough for the verifier to handle. ↩
-
Indeed, a condition for any entity purchasing access to the verifiers is that they must notify Epoch and the problem contributor about any success from the verifier. They are given joint publication rights of such a solution with the problem contributor and Epoch. Note that problem contributors are not restricted in any way from pursuing their research, including on the problems they have contributed to the benchmark. ↩
-
Not that this would invalidate the solutions. And isn’t beauty truth, truth beauty? ↩
-
Simple prompting in web apps is often sufficient to elicit state-of-the-art performance, e.g. from Gemini 2.5 Deep Think and GPT-5.2 Pro. ↩
-
One whimsical prompt used in AlphaEvolve tells the model to believe in itself. Who knows! ↩
-
To paraphrase Douglas Adams: We love goalposts. We love the whooshing noise they make as they go by. ↩