FrontierMath

About FrontierMath

FrontierMath is a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics – from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days.

The full FrontierMath dataset contains 300 problems. As of 2025-03-04, we have made 10 of these problems public: we call this public set frontiermath-2025-02-28-public, and the remaining 290 problems frontiermath-2025-02-28-private. Unless explicitly mentioned otherwise, all the numbers on this hub correspond to evaluations on frontiermath-2025-02-28-private. You can find more information about the public problems here.

Methodology

For FrontierMath, we recommend using the log viewer as the best way to understand the evaluation settings (e.g. click here for claude-3.7-sonnet-20250219 with 16k tokens of extended thinking). While the log viewer is not available for the 290 private questions in frontiermath-2025-02-28-private, which we report numbers on, you can see the logs on the 10 public questions in frontiermath-2025-02-28-public.

For each FrontierMath question, the model needs to submit a Python function answer() that returns the answer. The answer is a Python object, often (although not always) an integer or a sympy object. Our implementation allows the model to reason and run Python code.

We use the following prompt, which specifies the model’s affordances:

You will be solving a challenging mathematics question. Here’s how it works:

You can:
- Think out loud and explore the problem
- Use the python tool to execute arbitrary Python code
- Submit your answer using the submit_answer tool when you are confident in your answer.
Token limits:
- There is a hard limit of 100,000 tokens. If you exceed this, the conversation will end immediately (even if you haven’t submitted an answer).
- If you reach 66,000 tokens (but less than the hard limit of 100,000), you will be forced to use the submit_answer tool in your next message. This forced submission stage is designed to give you the best chance of submitting an answer before reaching the hard token limit. But it is not a guarantee. It is still your responsibility to avoid hitting the hard limit.
- Both input and output tokens count towards the limits.
Scoring:
- If your answer is correct you will get 1 point. If it is incorrect, or if you don’t submit an answer, you will get 0 points.
Explain your reasoning to me before submitting an answer.
Tips
- I strongly recommend that you start by making a high-level plan for how you will attack the problem. If you can, think about different approaches that could be used to solve the problem. To help you stay on track, periodically summarize your key findings and potentially revise your plan.
- Before submitting, verify your answer satisfies all problem requirements. It may be worth trying a different approach if you can see that your current answer is not correct.
For using the submit_answer tool:
- Pass in the code of a Python function named ‘answer’ that:
  - Takes no parameters
  - Returns your answer as a {answer_type}
  - Prints no output
  - Contains no code comments
- When scoring your answer, the maximum runtime for the answer function is 30 seconds. The code is executed on typical commodity hardware for the year 2025.
For using the python tool:
- The tool will only return stdout (and stderr), so you must make sure to use print() to see your results. If you don’t get any output from a python tool call, you probably forgot to print.
- Example: x = 5 * 12 print(“The result is”, x) In this example, you must include the print statement. Otherwise, you won’t see the value of x.
- The tool is completely stateless and doesn’t come with anything pre-imported. This is very important. If you need modules (e.g. math, sympy), you must import them each time. You cannot access variables defined in a previous call to python, so you must re-define anything you need in each call.
- You have access to the standard library, and the following libraries (expressed in requirements.txt format):
  - galois==0.4.4
  - gmpy2==2.2.1
  - mpmath==1.3.0
  - networkx==3.4.2
  - numpy==2.1.3
  - pyadic==0.2.3
  - scipy==1.15.2
  - sympy==1.13.3
- Do not submit your answer using the python tool. Use the submit_answer tool when you’re ready to submit
- The maximum runtime for a python tool call is 30 seconds. The code is executed on typical commodity hardware for the year 2025.

Here is the problem to solve. The answer type is {answer type}.

{question}

This implementation is different from the code we used to run preliminary evaluations in the paper. It is also not the methodology used by OpenAI in their own FrontierMath evaluations, such as for the o3 and o3-mini models: we were not involved in running these evaluations. The difference between our results and OpenAI’s might be due to OpenAI evaluating with a more powerful internal scaffold, using more test-time compute, or because those results were run on a different subset of FrontierMath (the 180 problems in frontiermath-2024-11-26 vs the 290 problems in frontiermath-2025-02-28-private).

Currently, our Inspect agent only works for models that support tool calling. This is why we do not yet have results on models that do not support tool calling, such as DeepSeek-R1 or Gemini 2.0 Flash Thinking Experimental. We will run DeepSeek-R1 and Gemini Flash 2.0 Thinking evaluations as soon as these models add function calling. In the meantime, we have evaluated these 2 models on the 191 non published problems in the frontiermath-2024-12-04 checkpoint using a scaffold similar to the one we used in the paper. We found accuracies of 5.2% for DeepSeek-R1 and 2.6% for Gemini 2.0 Flash Thinking Experimental, compared to 8.9% for o3-mini (medium reasoning) with the same scaffold.

Gemini 2.5 models

For Gemini 2.5 models, we have faced difficulties using the API between March and June 2025. Some requests were systematically failing without an informative error message. These requests either returned a status code of 500, or failed to send any data through the network at all. In June 2025, Google staff confirmed that they did not intend to address the remaining issues we had documented.

As a result, as of July 2025, we are evaluating Gemini 2.5 models with specific scoring rules that apply only to Gemini 2.5 models:

Requests that fail in the ways described above are retried up to 10 times, with exponential backoff and a maximum backoff of 30 minutes.
If all 10 retries fail, the corresponding sample is marked as incorrect.

Samples were marked as incorrect due to this policy in the following cases:

Model	Benchmark	Accuracy	Samples marked as incorrect after 10 retries	Run ID
gemini-2.5-pro-preview-06-05	FrontierMath-2025-02-28-Public	30% (±15%)	1/10 = 10%	DVgoky7pTLzriyfcj3oiTC
gemini-2.5-pro-preview-06-05	FrontierMath-2025-02-28-Private	10% (±2%)	21/290 = 7%	hNLQNC6TASXNNzat8kVS77
gemini-2.5-pro	FrontierMath-2025-02-28-Public	40% (±16%)	1/10 = 10%	g7m7nM2iRMskxAUbyrV766
gemini-2.5-pro	FrontierMath-2025-02-28-Private	11% (±2%)	25/290=9%	HN7YDEAzGvPRMBDo7mGuXG

Grok 4

For grok-4-0709, we experienced timeouts and network errors using the API in July 2025.

As a result, as of July 2025, we evaluated Grok 4 using specific scoring rules:

FrontierMath-2025-02-28-Private was evaluated using our standard settings. The record ID is gda5UeWrA8HcbDCRuLJ56H. We used the streaming API. 1/290 was not scored due to the server not sending any response. (We allow up to 1% of samples to fail without being scored).

For OTIS Mock AIME 2024-2025, GPQA diamond, and FrontierMath-Tier-4-2025-07-01-Private, we used a maximum output length per request of 128,000 tokens (default is no maximum), as recommended by xAI. If any requests failed due to network errors or timeouts, we moved the corresponding sample directly to the scoring phase of the evaluation (which generally causes it be be marked as incorrect). This was due to the highly time-sensitive nature of the evaluations.

Benchmark	Accuracy	Samples with API errors	Run ID
OTIS Mock AIME 2024-2025	84% (±5%)	4 out of 45*8=360 (1%)	cvTPRDCM38zSTn9Y3MUb9d
GPQA diamond	87% (±2%)	7 out of 198 × 8 = 1584 (0.4%)	A85Zfq2qguE4X9xXBweBHP
FrontierMath-Tier-4-2025-07-01-Private	2% (±2%)	8 out of 48 (16%)	QxtNUmV2L34UyrySmBLTbv

xAI compensated us for this evaluation and provided compute credits. We signed no NDA and maintained complete editorial independence: we publish all results regardless of performance.