Epoch's work is free to use, distribute, and reproduce provided the source and authors are credited under the Creative Commons BY license.
Learn more about this graph
To search for AI models exceeding 1025 FLOP we examined three sources:
- Recent releases from leading AI labs
- Top-scoring models on AI benchmarks and leaderboards
- The most-downloaded large models on the open model repository, Hugging Face
We performed an exhaustive review of recent model releases from AI developers known to have access to enough AI hardware for large-scale training: Google (including Google DeepMind), Meta, Microsoft, OpenAI, xAI, NVIDIA, and ByteDance. Other organizations that have developed models in the Epoch database of notable AI models were examined briefly for recent, large-scale model releases.
To identify other large models developed by other, lesser-known labs, we examined common language benchmarks such as MMLU, MATH and GPQA, as well as the Chatbot Arena and HELM leaderboards. We searched for any scores that matched or exceeded the performance of lower-performing models known to be trained with over 1025 FLOP (i.e. the minimum score across the original GPT-4, Inflection-2, or Gemini Ultra). We also searched for large models developed by Chinese developers. To do this, we examined the China-focused CompassRank and CompassArena leaderboards. We looked at all models scoring above models known to be trained with just under 1025 FLOP, such as Yi Lightning and Llama-3 70B.
Models that met these criteria were added to our wider database of AI models. We then estimated their training compute, and whether it might exceed 1025 FLOP. For several AI models, developers provided insufficient information for us to directly estimate training compute. In this case, we estimated training compute from model performance. For language models, we did this using imputation based on benchmark scores. For image and video models, where fewer standardised benchmark scores were available, we relied on other measures of output quality, such as user preference testing.
There are no standardized benchmarks for frontier image and video models, meaning we could not use benchmark scores to search for these models. For example, recent video models such as Veo 2 were evaluated in their release materials by human preference of their outputs versus competitors. To address this, we individually examined the leading video generation and drawing services. We used user preferences and developers’ compute resources to assess whether such models might exceed the 1025 FLOP threshold.
Data
Assumptions
Explore this data
Our comprehensive database of over 3200 models tracks key factors driving machine learning progress.