FrontierMath Tier 4 (v2)

FrontierMath is a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics – from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days.

On 2026-06-12, we released a major update, addressing errors in 42% of problems. Following this update, the full FrontierMath dataset consists of 338 problems. This is split into a base set of 295 problems, which we call Tiers 1-3, and an expansion set of 43 exceptionally difficult problems, which we call Tier 4. We have made twelve problems public: ten from Tiers 1-3 and two from Tier 4. Unless stated otherwise, all the numbers on this hub correspond to evaluations on the private sets. You can find the public problems here.

FrontierMath was developed with funding from OpenAI, who has exclusive access to a subset of the benchmark. See our conflict of interest statement for details.

Learn more about FrontierMath

Methodology

For FrontierMath, we recommend using the log viewer on the public questions as the best way to understand the evaluation settings (e.g. click here for o4-mini-2025-04-16 with high reasoning effort).

For each FrontierMath question, the model needs to submit a Python function answer() that returns the answer. The answer is a Python object, often (although not always) an integer or a sympy object. Our implementation allows the model to reason and run Python code.

We use the following prompt, which specifies the model’s affordances:

You will be solving a challenging mathematics question. Here’s how it works:

You can:

Think out loud and explore the problem

Use the python tool to execute arbitrary Python code

Submit your answer using the submit_answer tool when you are confident in your answer.

Token limits:

There is a hard limit of 1,000,000 tokens. If you exceed this, the conversation will end immediately (even if you haven’t submitted an answer).

If you reach 660,000 tokens (but less than the hard limit of 1,000,000), you will be forced to use the submit_answer tool in your next message. This forced submission stage is designed to give you the best chance of submitting an answer before reaching the hard token limit. But it is not a guarantee. It is still your responsibility to avoid hitting the hard limit.

Both input and output tokens count towards the limits.

Scoring:

If your answer is correct you will get 1 point. If it is incorrect, or if you don’t submit an answer, you will get 0 points.

Explain your reasoning to me before submitting an answer.

Tips

I strongly recommend that you start by making a high-level plan for how you will attack the problem. If you can, think about different approaches that could be used to solve the problem. To help you stay on track, periodically summarize your key findings and potentially revise your plan.

Before submitting, verify your answer satisfies all problem requirements. It may be worth trying a different approach if you can see that your current answer is not correct.

For using the submit_answer tool:

Pass in the code of a Python function named ‘answer’ that:

Takes no parameters

Returns your answer as a {answer_type}

Prints no output

Contains no code comments

When scoring your answer, the maximum runtime for the answer function is 30 seconds. The code is executed on typical commodity hardware for the year

For using the python tool:

The tool will only return stdout (and stderr), so you must make sure to use print() to see your results. If you don’t get any output from a python tool call, you probably forgot to print.

Example: x = 5 * 12 print(“The result is”, x) In this example, you must include the print statement. Otherwise, you won’t see the value of x.

The tool is completely stateless and doesn’t come with anything pre-imported. This is very important. If you need modules (e.g. math, sympy), you must import them each time. You cannot access variables defined in a previous call to python, so you must re-define anything you need in each call.

You have access to the standard library, and the following libraries (expressed in requirements.txt format):

galois==0.4.4

gmpy2==2.2.1

mpmath==1.3.0

networkx==3.4.2

numpy==2.1.3

pyadic==0.2.3

scipy==1.15.2

sympy==1.13.3

Do not submit your answer using the python tool. Use the submit_answer tool when you’re ready to submit

The maximum runtime for a python tool call is 30 seconds. The code is executed on typical commodity hardware for the year 2025.

Here is the problem to solve. The answer type is {answer type}.

{question}

The code used in our implementation can be found here.

Changelog

2026-06-12. We released v2, which corrected 123 problems in Tiers 1-3 and 12 problems in Tier 4. We also removed 5 problems from Tiers 1-3 and 7 problems from Tier 4. v1 can be found here.

Featured

Publications

Data explorers

Benchmarks by Epoch AI

AI Progress

Industry

Infrastructure

Impacts

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Data Centers

Chip Owners

Companies

Polling on AI Use

MirrorCode

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

FrontierMath Tier 4 (v2)