FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Published

Nov 08, 2024

Authors

The FrontierMath Benchmark

FrontierMath is a benchmark of hundreds of original mathematics problems spanning the breadth of modern mathematical research. These range from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. We developed it through collaboration with over 60 mathematicians from leading institutions, including professors, IMO question writers, and Fields medalists.

FrontierMath problems typically demand hours or even days for specialist mathematicians to solve. The following Fields Medalists shared their impressions after reviewing some of the research-level problems in the benchmark:

“These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…” —Terence Tao, Fields Medal (2006)

“[The questions I looked at] were all not really in my area and all looked like things I had no idea how to solve…they appear to be at a different level of difficulty from IMO problems.” — Timothy Gowers, Fields Medal (2006)

FrontierMath features hundreds of advanced mathematics problems that require hours for expert mathematicians to solve. We release representative samples, which you may download here.

Sample problem 1: Testing Artin’s primitive root conjecture

Definitions. For a positive integer \( n \), let \( v_p(n) \) denote the largest integer \( v \) such that \( p^v \mid n \). For a prime \( p \) and \( a \not\equiv 0 \pmod{p} \), let \( \text{ord}_p(a) \) denote the smallest positive integer \( o \) such that \( a^o \equiv 1 \pmod{p} \). For \( x > 0 \), let \[ \text{ord}_{p, x}(a) = \prod_{\substack{q \le x \\ q \text{ prime}}} q^{v_q(\text{ord}_p(a))} \prod_{\substack{ q > x \\ q \text{ prime} } } q^{v_q(p-1)}. \] Problem. Let \( S_x \) denote the set of primes \( p \) for which \[ \text{ord}_{p, x}(2) > \text{ord}_{p, x}(3), \] and let \( d_x \) denote the density \[ d_x = \frac{|S_x|}{|\{p \le x : p \text{ is prime}\}|} \] of \( S_x \) in the primes. Let \[ d_{\infty} = \lim_{x \to \infty} d_x. \] Compute \( \lfloor 10^6 d_{\infty} \rfloor. \)

Answer: 367707

MSC classification: 11 Number theory

Sample problem 2: Find the degree 19 polynomial

Construct a degree 19 polynomial \(p(x) \in \mathbb{C}[x]\) such that \(X:=\{p(x) = p(y)\} \subset \mathbb{P}^1 \times \mathbb{P}^1\) has at least 3 (but not all linear) irreducible components over \(\mathbb{C}\). Choose \(p(x)\) to be odd, monic, have real coefficients and linear coefficient -19 and calculate \(p(19)\).

Answer: 1876572071974094803391179

MSC classification: 14 Algebraic geometry; 20 Group theory and generalizations; 11 Number theory generalizations

Sample problem 3: Prime field continuous extensions

Let \(a_n\) for \(n\in\mathbb Z\) be the sequence of integers satisfying the recurrence formula \[ \begin{aligned} a_n = (1.981 \times 10^{11})a_{n-1} + (3.549 \times 10^{11})a_{n-2} \\ \hspace{0.4cm} + (4.277 \times 10^{11})a_{n-3} + (3.706 \times 10^8)a_{n-4} \end{aligned} \] with initial conditions \(a_i=i\) for \(0\leq i\leq 3\). Find the smallest prime \(p\equiv4\bmod7\) for which the function \(\mathbb Z\to\mathbb Z\) given by \(n\mapsto a_n\) can be extended to a continuous function on \(\mathbb Z_p\).

Answer: 9811

MSC classification: 11 Number theory

Each problem is carefully designed to test genuine mathematical understanding. Problems must be novel and unpublished, with answers that can be automatically verified through computation—either as exact integers or mathematical objects like matrices and symbolic expressions in SymPy. A verification script checks submissions through exact matching or by confirming the submitted answer matches the known solution.

They are also designed to be “guessproof”—problems have large numerical answers or complex mathematical objects as solutions, with less than a 1% chance of guessing correctly without the mathematical work. Problems are reviewed specifically for this property, with reviewers checking that shortcuts or pattern matching generally cannot bypass the need for genuine understanding.

Each problem undergoes peer review by expert mathematicians who verify correctness, check for ambiguities, and assess difficulty ratings. Additionally, we conducted second reviews on a random subsample of problems, finding that approximately 1 in 20 problems had errors requiring correction—comparable to error rates in other major machine learning benchmarks like ImageNet. We recognize the importance of benchmark accuracy and are expanding both our expert review process and error-bounty program to reduce this error rate.

Current Performance on FrontierMath

To evaluate how well current AI models can tackle advanced mathematical problems, we provided them with extensive support to maximize their performance. Our evaluation framework grants models ample thinking time and the ability to experiment and iterate. Models interact with a Python environment where they can write and execute code to test hypotheses, verify intermediate results, and refine their approaches based on immediate feedback.

Enable JavaScript to see an interactive visualization.

Figure 2. Performance of leading language models on FrontierMath. All models show consistently poor performance, with even the best models solving less than 2% of problems.

Despite this support framework, FrontierMath has proven exceptionally challenging for today’s AI systems. We evaluated six leading language models—including Claude 3.5 Sonnet, o1-preview, GPT-4o, and Gemini 1.5 Pro—and found that none could solve more than 2% of the problems. This is in sharp contrast to other popular mathematical benchmarks such as GSM-8K and MATH, where top models now achieve over 90% accuracy.

Our next steps

FrontierMath represents a significant step toward evaluating whether AI systems possess research-level mathematical reasoning capabilities. While current models solve less than 2% of problems, we expect this benchmark to become increasingly valuable as AI systems advance.

Our next steps include:

Regular evaluations: Conducting and publishing ongoing assessments of leading AI models to provide a standardized measure of progress, and evaluating how advanced mathematical reasoning abilities improve over time and with scale.
Benchmark expansion: Adding more problems to FrontierMath while maintaining both our rigorous standards and the current distribution of problem types, difficulty levels, and mathematical domains.
Public problem release: Building on our initial release of five representative problems with solutions, we plan to release additional problems in the coming months to further engage the community and facilitate benchmarking.
Enhanced quality assurance: Strengthening our quality control through expanded expert review, increased error-bounties, and improved peer review processes.

Conclusion

FrontierMath represents a significant step toward evaluating whether AI systems possess research-level mathematical reasoning capabilities. While current models solve less than 2% of problems—revealing a substantial gap between AI capabilities and the collective prowess of the mathematical community—we expect this benchmark to become increasingly valuable as AI systems advance.

We look forward to working with both the mathematics and the AI research community to refine and expand this benchmark. By regularly evaluating state-of-the-art models and collaborating with the AI research community, we aim to deepen our understanding of AI’s capabilities and limitations.

You can read more about FrontierMath in our technical report. If you want to reach out to us about FrontierMath evaluations, please email us at math_evals@epoch.ai

We gratefully acknowledge OpenAI for their support in creating the benchmark.

Conflict of interest statement

OpenAI commissioned the production of 300 questions for FrontierMath, including all problems up to the upcoming version FrontierMath_XX-XX-25. OpenAI fully owns these 300 questions. They have access to all statements and solutions for questions up to version FrontierMath_XX-XX-25, except for a subset of 50 solutions added in version FrontierMath_XX-XX-25 randomly withheld for holdout evaluation. Epoch AI retains the right to conduct and publish internal evaluations using all questions in FrontierMath. Learn more.

Notes

The analysis in this post was conducted using version FrontierMath_10-22-24 of the benchmark. See here for the full list of versions. ↩

About the authors

Former employee

Tamay Besiroglu co-founded Epoch AI and remains contributing to the organization as a research advisor. He left Epoch to co-lead Mechanize, a startup building virtual work environments, benchmarks, and training data for AI development. His research expertise focuses on the economics of computing and broader trends in machine learning.

Elliot Glazer holds a Ph.D. in Mathematics from Harvard under Hugh Woodin, with research in set theory and formal systems, especially paradoxes in the axiom of choice. He has recently worked on the foundations of proof assistants, and enjoys developing mathematical puzzles in both finite and infinite settings.

Caroline Falkman Olsson is an Operations Associate at Epoch AI, with a background in Economics and Statistics. She has previously worked as a predoctoral researcher at LSE’s International Inequalities Institute (III) and as a data analyst at the Institute for International Economic Studies (IIES) at Stockholm University.

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Published

Authors

Share

Tags

The FrontierMath Benchmark

Current Performance on FrontierMath

Our next steps

Conclusion

Conflict of interest statement

Notes

About the authors

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Published

Authors

Share

Tags

The FrontierMath Benchmark

Current Performance on FrontierMath

Our next steps

Conclusion

Conflict of interest statement

Notes

About the authors

Related work

We value your privacy