US models currently outperform non-US models
The best US models have consistently higher accuracies than the best non-US models on GPQA Diamond and MATH Level 5. For example, on GPQA Diamond the best-performing model is OpenAI’s o1, while on MATH Level 5 the leading model is o3-mini.
However, with the release of DeepSeek-R1 in January 2025, the gap between US and non-US models has reduced substantially: DeepSeek-R1 trails behind o3-mini by only 2 percentage points on MATH Level 5, and scores only 4 percentage points lower than o1 on GPQA Diamond.
Epoch’s work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.