About OTIS Mock AIME 2024-2025

Mock AIME 2024-2025 is a collection of problems from the OTIS Mock AIME exams from 2024 and 2025. The problems were written by students from the Olympiad Training for Individual Study (OTIS) program. The OTIS Mock AIME is an annual 3-hour exam consisting of 15 problems whose answers are integers between 0 and 999. As the name suggests, it is an unofficial exam intended to emulate the American Invitational Mathematics Exam.

The dataset contains 45 problems from 3 exams:

These problems are harder than problems in MATH Level 5, but easier than the problems in FrontierMath.

Methodology

For Mock AIME, we recommend using the log viewer as the best way to understand the evaluation settings (e.g. click here for o3-mini-2025-01-31).

We use the following prompt:

Please solve this AIME problem step by step. The answer is an integer ranging from 000 to 999, inclusive.

{question}

Remember to show your work clearly and end with ‘ANSWER: X’ where X is your final numerical answer.

For the 3 questions among 45 which contain an image, we did not include the image as part of the input (this way all models are exposed to the same input, including those who don’t support images). However, we do include the TikZ or Asymptote code that generates the image as part of the input.

We use a model-based scorer that looks at the model’s submission and the true answer (which must be an integer between 0 and 999 inclusive), and returns whether the model’s final answer is correct. The model can still receive points for the question if its answer is not of the form “ANSWER: $ANSWER”, as long as the scorer model accepts it.

Grok 4

For grok-4-0709, we experienced timeouts and network errors using the API in July 2025.

As a result, as of July 2025, we evaluated Grok 4 using specific scoring rules:

FrontierMath-2025-02-28-Private was evaluated using our standard settings. The record ID is gda5UeWrA8HcbDCRuLJ56H. We used the streaming API. 1/290 was not scored due to the server not sending any response. (We allow up to 1% of samples to fail without being scored).

For OTIS Mock AIME 2024-2025, GPQA diamond, and FrontierMath-Tier-4-2025-07-01-Private, we used a maximum output length per request of 128,000 tokens (default is no maximum), as recommended by xAI. If any requests failed due to network errors or timeouts, we moved the corresponding sample directly to the scoring phase of the evaluation (which generally causes it be be marked as incorrect). This was due to the highly time-sensitive nature of the evaluations.

Benchmark Accuracy Samples with API errors Run ID
OTIS Mock AIME 2024-2025 84% (±5%) 4 out of 45*8=360 (1%) cvTPRDCM38zSTn9Y3MUb9d
GPQA diamond 87% (±2%) 7 out of 198 × 8 = 1584 (0.4%) A85Zfq2qguE4X9xXBweBHP
FrontierMath-Tier-4-2025-07-01-Private 2% (±2%) 8 out of 48 (16%) QxtNUmV2L34UyrySmBLTbv

xAI compensated us for this evaluation and provided compute credits. We signed no NDA and maintained complete editorial independence: we publish all results regardless of performance.