Data Insight
Jan. 30, 2025
Updated Jun. 6, 2025

Over 30 AI models have been trained at the scale of GPT-4

By Robi Rahman, Lovis Heindrich, David Owen, and Luke Emberson

The largest AI models today are trained with over 1025 floating-point operations (FLOP) of compute. The first model trained at this scale was GPT-4, released in March 2023. As of June 2025, we have identified over 30 publicly announced AI models from 12 different AI developers that we believe to be over the 1025 FLOP training compute threshold.

Training a model of this scale costs tens of millions of dollars with current hardware. Despite the high cost, we expect a proliferation of such models—we saw an average of roughly two models over this threshold announced every month during 2024. Models trained at this scale will be subject to additional requirements under the EU AI Act, coming into force in August 2025.

Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.

Learn more about this graph

To search for AI models exceeding 1025 FLOP we examined three sources:

  • Recent releases from leading AI labs
  • Top-scoring models on AI benchmarks and leaderboards
  • The most-downloaded large models on the open model repository, Hugging Face

We performed an exhaustive review of recent model releases from AI developers known to have access to enough AI hardware for large-scale training: Google (including Google DeepMind), Meta, Microsoft, OpenAI, xAI, NVIDIA, and ByteDance. Other organizations that have developed models in the Epoch database of notable AI models were examined briefly for recent, large-scale model releases.

To identify other large models developed by other, lesser-known labs, we examined common language benchmarks such as MMLU, MATH and GPQA, as well as the Chatbot Arena and HELM leaderboards. We searched for any scores that matched or exceeded the performance of lower-performing models known to be trained with over 1025 FLOP (i.e. the minimum score across the original GPT-4, Inflection-2, or Gemini Ultra). We also searched for large models developed by Chinese developers. To do this, we examined the China-focused CompassRank and CompassArena leaderboards. We looked at all models scoring above models known to be trained with just under 1025 FLOP, such as Yi Lightning and Llama-3 70B.

Models that met these criteria were added to our wider database of AI models. We then estimated their training compute, and whether it might exceed 1025 FLOP. For several AI models, developers provided insufficient information for us to directly estimate training compute. In this case, we estimated training compute from model performance. For language models, we did this using imputation based on benchmark scores. For image and video models, where fewer standardised benchmark scores were available, we relied on other measures of output quality, such as user preference testing.

There are no standardized benchmarks for frontier image and video models, meaning we could not use benchmark scores to search for these models. For example, recent video models such as Veo 2 were evaluated in their release materials by human preference of their outputs versus competitors. To address this, we individually examined the leading video generation and drawing services. We used user preferences and developers’ compute resources to assess whether such models might exceed the 1025 FLOP threshold.

Data

Assumptions

Explore this data