Notable AI Models Documentation
Epoch's AI Notable AI Models Dataset is a collection of historically significant or cutting-edge machine learning models for research about trends in the history and future of artificial intelligence.
Overview
Epoch AI’s Notable AI Models Dataset is a collection of historically significant or cutting-edge machine learning models, and key information about their training. This dataset is useful for research about trends in the history and future of artificial intelligence.
This documentation describes which models are contained within the dataset, its records (including data fields and definitions), processes for adding new entries and auditing accuracy. It also includes a changelog and acknowledgements.
The dataset is available on our website as a visualization or table, and is available for download as a daily-updated CSV file. For a quick-start example of loading the data and working with it in your research, see this Google Colab demo notebook.
If you would like to ask any questions about the database, or suggest a model that should be added, feel free to contact us at data@epochai.org.
If this dataset is useful for you, please cite it.
Citation
Epoch AI, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epoch.ai. Retrieved from: ‘https://epoch.ai/data/notable-ai-models’ [online resource]
BibTeX citation
@misc{epoch2022pcdtrends,
title = "Parameter, Compute and Data Trends in Machine Learning",
author = {{Epoch AI}},
year = 2022,
url = {https://epoch.ai/data/notable-ai-models},
note = "Accessed: "
}
Inclusion
The dataset focuses on notable ML models: models that have advanced the state of the art, had a large influence in the field’s history, or had a large impact within the world. Here, we detail criteria for inclusion, and give an overview of how the data have been collected.
Criteria
To be included in the dataset, an ML model must satisfy all inclusion criteria:
- there must be reliable documentation of its existence and relevance to machine learning;
- the model must include a learning component, it cannot be a non-learned algorithm;
- the model must actually have been trained, it cannot be a theoretical description without experimental results;
- the model must be notable, per any of the notability criteria defined below.
Notability
Models are notable if they satisfy any of the following:
- highly cited (over 1000 citations);
- large training cost (over $1,000,000, measured in 2023 USD);
- significant use (over one million monthly active users);
- state of the art performance (typically on a recognised ML benchmark, see below for further discussion);
- indisputable historical significance.
Where there are many related models, for example several checkpoints along training or several sizes of a given model family, the dataset preferentially includes the version that used the most compute. Other versions may be included where they are notable in their own right.
State of the art
Identifying whether a model is state of the art can be a more involved process, compared to simply checking citations or the training compute budget. We consider a model to be state of the art if there is good reason to believe that it was the best existing model at the time for a task of genuine interest. The default way to provide evidence for this is state-of-the-art performance on a recognised benchmark.
To be recognised, a benchmark should have any of the following:
- 100+ citations.
- 10+ submissions in total from 3+ research groups.
- An associated publication in a reputable peer-reviewed academic venue. The publication does not need to focus exclusively on the benchmark; however, the benchmark should be a key result.
At our discretion, we may also identify models as state of the art where no benchmark result exists, but there is convincing evidence that a model truly is state of the art. Eligible sources of evidence here are comparison on a non-benchmark database, a high-quality user preference study, or demonstration of state-of-the-art capabilities. For example, GraphCast is compared against other weather prediction models on a weather database that is not a standalone benchmark. Nevertheless, we take this as convincing evidence that it is state of the art.
Historical significance
Models can be included on the grounds of historical significance if they marked a significant advance in AI history, even if they did not strictly advance the state of the art on any application. For example, many neural network breakthroughs performed worse than other ML techniques, but were directly influential for later AI development. Evidence to support this status may come from citations in later notable models, discussion in reviews or textbooks, or other unambiguous identification as an influential result.
Example | Include? | Why |
---|---|---|
Human-level control through deep reinforcement learning | Yes | Well-documented learned model, over 1000 citations, advanced state of the art for game play. |
Stochastic Neural Analog Reinforcement Calculator | Yes | No individual associated paper, but other sources confirm its existence, and it was indisputably historically significant as one of the first neural learning systems. |
Theory of neural-analog reinforcement systems and its application to the brain model problem | No | Historically significant, but no experimentally trained models, it’s entirely a theoretical result. |
Scaling scaling laws with board games | No | Doesn’t meet any notability criteria. In addition to not being highly cited and using small compute models, there is no attempt at state of the art results. Rather, this is a paper examining scaling details. |
Search process
This data has been collected from a variety of sources: literature reviews, historical accounts of AI development, highly-cited publications from top conferences, high-profile models from leading industry labs, bibliographies of notable papers, pre-existing datasets curating AI papers (see Acknowledgements), and ad hoc suggestions from contributors.
We monitor news coverage, releases from key AI labs, and benchmarks to identify new notable models as they are released. This can lead to a lag for new models. Typically, we aim to add the most prominent releases (e.g. GPT-4) within days of release. For less prominent models, reporting lags may extend to months.
Coverage
As of December 03, 2024, the dataset contains 869 models, 425 of which have compute estimates.
The dataset does not provide exhaustive coverage of notable models. However, data collection efforts to support Epoch AI research projects have led to more thorough coverage within particular niches, such as models trained with large-scale compute, and biological sequence models.
- Coverage is most thorough for language and vision models developed since 2018 (271 models and 118 models respectively), albeit with a lag for the newest models. More specialist domains, such as robotics, likely have worse coverage in this period.
- Coverage is fair, but less thorough, for deep learning language and vision models between 2010-2018 (77 models and 106 models respectively). Again, other domains may have worse coverage.
- Coverage is quite sparse for historical models before 2010 (160 models before 2010 compared to 709 models after), particularly models outside the paradigm of deep learning. Entries here are focused on notable models mentioned in textbooks and reviews, rather than a systematic search across sources.
If you would like to ask any questions about the database, or suggest a model that should be added, feel free to contact us at data@epochai.org.
Records
The database focuses on information relevant to trends in AI model development. Records in this dataset have information about three broad areas:
Bibliographic information about the model and its associated publication, for example its title, URL, authors, citations, date, etc.
Training details such as training compute, parameters, dataset size, hardware used for training, etc.
Metadata about the record, such as notes on the above fields with supporting evidence and context, our confidence in the key models, etc.
We provide a comprehensive guide to the dataset’s fields below. This includes examples taken from Llama-2 70B, one of the best-documented recent models. If you would like to ask any questions about the database, or request a field that should be added, feel free to contact us at data@epochai.org.
Database Updates
This section provides more information about recurring processes in the database: adding new models, updating citation counts, and updating the hosted files by which the dataset can be accessed for analysis.
Adding new models
Entries are added to the dataset near-daily, including both newly-released models and older models newly identified as notable. Typically, most information that can easily be determined from public information is added at the time a model is entered in the database. However, it is common for some information to gradually be entered later. For example, a compute estimate might be omitted at first and only added after we devote further effort to calculating it.
If you would like to ask any questions about the database, or suggest a model that should be added, feel free to contact us at data@epochai.org.
Updating citation counts
When models are added to the database, citation counts are recorded for those with academic publications or preprints. At the beginning of each month, citation counts are automatically updated for publications listed in Semantic Scholar. Publications not listed in Semantic Scholar rely on manual entry of citation count.
Updating hosted files
Epoch AI’s Notable Models dataset is hosted as a CSV that is synced from the database daily. The easiest way to load the data in scripts is using the CSV URL. If you need the most up-to-date version reflecting unsynced changes, a CSV can be manually generated from the table view on the website.
Estimation
Some fields within the database require estimation, because they are often not straightforwardly reported within papers or other sources. Here, we detail how estimation works for compute, model size, dataset size, and the metadata on estimate confidence.
Estimating compute
Training compute is one of the most important pieces of information in our dataset, as reflected in its usage across Epoch AI’s research and elsewhere. However, estimating compute can be challenging. Here we outline how compute estimation is performed in the notable models dataset.
Compute is measured in units of floating point operations (FLOP). For older models, sometimes the relevant operations were integer operations - in this case we report these instead. We do not apply any multiplier to adjust for operations potentially being more valuable under different tensor formats or precisions, for example FP16 versus FP32 or BF16. Some sources report compute in multiplications-and-addition operations, fused multiply-adds (FMAs), or similar. We treat one multadd/FMA as being equivalent to two FLOP to match typical reporting of chip performance.
For a given model in the database, training compute is provided as the total training compute, including pretraining, and including pretrained base models used as components. Finetuning compute is recorded in its associated column. Finetuning is distinguished by authors’ descriptions of the training as finetuning, or unambiguous use of a pretrained model in a distinct phase of training.
In the simplest case, training compute is directly reported in a paper, and we enter this figure into the database. When compute is not reported, we use two main methods to estimate it:
- Hardware details and usage.
- Counting the operations based on model architecture and data.
When there is enough information to count the operations, this is preferred in our dataset, because typically hardware-based estimates require assumptions about utilization, which may reduce the estimates’ accuracy.
Hardware details and usage
Hardware details and usage is a relatively straightforward way to estimate compute, when the necessary details are known:
- The usage in chip-time, e.g. “trained on a cluster of 128 TPUv3 instances for two days” means 256 chip-days = 128 chips × 2 days. Sometimes this is reported as separate chips and time used, other times this may be reported directly in chip-time. When it is not reported, we may create estimates from publicly-known information, comparison to typical training runs, etc.
- The type of hardware used, e.g. NVIDIA H100, TPUv3. Ideally, this is reported in the paper. Otherwise, for more speculative estimates, one may have to make assumptions based on institution and year, e.g. that Google would have used TPUs of the corresponding generation in that year.
- The type of number representation used, e.g. FP32, FP16, BF16. Ideally, this is reported in the paper. When not reported, it can often be guessed. For example, the number representation was typically FP32 for models trained before 2019.
Once these details are known, the corresponding peak FLOP/s performance by hardware and number representation can be found from hardware documentation, or from the tool below. Finally, utilization rates account for real training runs falling significantly short of peak performance due to memory bottlenecks, network latency, etc. Typical utilization rates for large distributed training runs are around 30-50%. When these are not reported, they are estimated by reference to comparable models from a similar time period.
ImageGPT |
---|
Some training details are provided in the blogpost: “[..]iGPT-L was trained for roughly 2500 V100-days […]” The number representation is not specified, but given this was trained by a major corporation in 2020, we assume the number format was FP16. The V100 has 125 TFLOP/s tensor FP16 performance. Assuming a utilization of 0.3, this leads to the following compute estimate: 8.1e21 FLOP = 2500 V100-days × 125e12 FLOP/s × 0.3 utilization × 86.4e3 s/day |
Counting the operations
Counting the number of operations is often useful for older research, where hardware and usage details might be unavailable. The above formula sets out a widely-applicable heuristic for training compute of dense models. This works by first estimating required FLOP for a forward pass, which is approximately twice the number of connections. This can be modified by sparsity such as Mixture-of-Experts: in this case, the heuristic should use the number of connections in the number of active experts.
The forward pass FLOP is then multiplied by three to account for backward passes, as the ratio between forward and backward passes is 1:2 for non-recurrent dense models. Finally, this is multiplied by the number of passes performed on the data - the number of training examples multiplied by the number of epochs the model was trained. For transformer-based language models, this formula is equivalent to the commonly-used heuristic: Compute = 6 × # of parameters × # of training examples × # of epochs.
Sometimes, the FLOP for a forward pass is reported directly in a paper. In this case, this value can be used directly instead of 2 × # of connections. Otherwise, the FLOP for a forward pass are evaluated by summing FLOP over the network’s layers. These are set out in Table 3.
Layer | Forward pass FLOP per token (approx) |
---|---|
Fully connected layer from N neurons to M neurons | 2×N×M |
CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | 2×H^2×W^2×C×D/S^2 |
Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | 2×D×H×W×C^2×K^2 |
RNN with bias vectors taking an input of size N and producing an output of size M | 2×(N+M)×M |
Fully gated GRU with bias vectors taking an input of size N and producing an output of size M | 6×(N+M)×M |
LSTM with bias vectors taking an input of size N and producing an output of size M | 8×(N+M)×M |
Word Embedding for vocabulary size V and embedding dimension W | 0 |
Self attention layer with sequence length L, inputs of size W, key of size D and output of size N | 2×W×(2×D+N) + 2×L×(D+N) |
Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads | 2×H×(W×(2×D+N) + L×(D+N) + N×M) |
Attention Is All You Need |
---|
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 2×16×(64×(2×64+64) + 20×(64+64) + 64×1024) = 2.6e6 FLOP per token. Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN sublayer has 2×2×1024×4096 = 1.7e7 FLOP per token. Summing all its layers, the encoder-decoder stack has 6 × (3 × 2.6e6 + 2 × 1.7e7) ~= 2.5e8 FLOP per token. The final linear layer has 2 × 1024 × 3e4 = 6.1e7 FLOP per token. Summing these, a forward pass takes 3.1e8 FLOP per token. The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 × 3e5 × 3 × 3.1e8 = 6.97e18 FLOP. |
Estimating model size
Parameter counts are often reported by the model developer, but if parameter count is not stated, it can sometimes be estimated based on provided architectural details. Similar to estimating compute, estimating parameter count requires finding a description of the architecture, i.e. type, number, and configuration of the layers, then calculating the parameters in each layer and summing them. Table 5 lists the parameter counts for different layers. Alternatively, if an architecture implementation is available in code, it can be simpler to load an architecture in code and count the number of parameters.
Layer | Parameters (approx) |
---|---|
Fully connected layer from N neurons to M neurons | N×M |
CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | D×K^2×C |
Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | D×K^2×C |
RNN with bias vectors taking an input of size N and producing an output of size M | (N+M)×M |
Fully gated GRU with bias vectors taking an input of size N and producing an output of size M | 3×(N+M)×M |
LSTM with bias vectors taking an input of size N and producing an output of size M | 4×(N+M)×M |
Word Embedding for vocabulary size V and embedding dimension W | W×V |
Self attention layer with sequence length L, inputs of size W, key of size D and output of size N | W×(2×D+N) |
Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads | H×(W×(2×D + N) + N×M) |
Attention Is All You Need |
---|
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 16×(64×(2×64 + 64) + 64×1024) = 1.2e6 parameters. Each FCN layer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN layer has 2×1024×4096 = 8.4e6 parameters. Summing all its layers, the encoder-decoder stack has 6 × (3 × 1.2e6 + 2 × 8.4e6) ~= 1.2e8 parameters. The final linear layer has 1024 × 3e4 = 3.1e7 parameters. Two embedding layers each have 30e3 × 1024 parameters, so 6.2e7 in total. Summing these, the model has 2.1e8 parameters, matching the reported 213 million parameters in the paper. |
Estimating dataset size
Dataset size is measured in the number of datapoints used as training examples, as outlined in Table 7. The objective here is to provide an intuitive idea of how large a dataset is. We record the number of distinct datapoints, not multiplied by the number of epochs. The number of epochs is recorded separately. Table 8 provides worked examples across several of these ML problems.
ML problem | Way of measuring dataset size |
---|---|
Classification | # training examples |
Image classification | # images |
Image captioning | # captions |
Language modeling | # words |
Translation | # words in input language |
Text classification | # training examples |
Speech recognition | # words |
Reinforcement learning | # timesteps |
Image classification: Deep Residual Learning for Image Recognition |
---|
“We evaluate our method on the ImageNet 2012 classification dataset that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images.” We thus note down a dataset size of 1.28e6 (the number of images). |
Image captioning: Deep Residual Learning for Image Recognition |
---|
“We evaluate our method on the ImageNet 2012 classification dataset that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images.” According to the authors, the MSCOCO dataset is “arguably the largest and highest quality dataset” that they used. This had 82,783 training examples, each containing a single image and 5 sentences that are “relatively visual and unbiased”. To determine the dataset size, we consider the number of image-caption pairs. Thus we count 82,783 * 5 = 413,915 training examples. |
Language modeling: Language Models are Few-Shot Learners |
---|
From the paper, we determine that there are 410 + 19 + 12 + 55 + 3 = 499 billion tokens. We convert this to words by multiplying by 0.75 to give 374B words. |
Speech: An RNN-based prosodic information synthesizer for Mandarin text-to-speech |
---|
“A continuous-speech Mandarin database provided by the Telecommunication Laboratories, MOTC,1 R.O.C. was used… The data base was divided into two parts: a training set and an open test set. These two sets consisted of 28191 and 7051 syllables, respectively.” We convert this to words by multiplying 28,191 syllables by 0.62 to get 17,478 words. |
Language dataset sizes are usually reported in terms of tokens or gigabytes. These are converted to words per Table 9. Language dataset sizes are usually reported in terms of tokens or gigabytes. These are converted to words per Table 9. These factors are based on the OpenAI GPT-3 tokenizer for Western languages, and manual inspection that tokenizers typically have one token per word in Mandarin, Japanese and Korean. The ratio is tokenizer-dependent, meaning that when estimating a dataset’s size in words, one should consider whether the tokenizer might have a substantially different ratio. Speech recognition data are often expressed in terms of duration or syllables, which are converted to words per Table 10. Speech recognition data are often expressed in terms of duration or syllables, which are converted to words per Table 10.
Language | Words per token | Words per GB (approx) |
---|---|---|
English | 0.75 | 200M |
Mandarin Chinese | 1 | 167M |
German | 0.75 | 167M |
Spanish | 0.75 | 200M |
Japanese | 1 | 200M |
Korean | 1 | 111M |
Language | Words per minuteWPM | Words per hourWPH | Words per syllableWPS |
---|---|---|---|
English | 228 | 13,680 | ~0.73 |
Mandarin Chinese | 158 | 9,480 | ~0.62 |
German | 179 | 10,740 | ~0.59 |
Spanish | 218 | 13,080 | ~0.41 |
Japanese | 193 | 11,580 | ~0.43 |
Estimating power draw
The field “Training power draw (W)” contains the power draw of the hardware used to train the model, measured in watts. This field is filled in when the training hardware type and quantity are known, and is calculated as follows:
The power draw is calculated for the processing hardware (GPUs or TPUs), by multiplying the TDP (thermal design power) of the processor by the number of processors.
The power draw of the computing hardware is then scaled up to account for overhead due to networking equipment and data center efficiency:
-
We multiply by 2.03x to account for non-GPU server hardware (networking, switches, and CPUs), based on the specifications of NVIDIA DGX H100 servers.
-
We also multiply by data center PUE (power usage effectiveness) to obtain the total power drawn from the grid while training the model. PUE uses an average value of 1.09x in 2024 and this value decreases by about 4% per year, based on a quarterly historical PUE record of Google data centers.
Estimating confidence
As discussed in Records, the confidence statuses specify the following bounds as 90% confidence intervals:
- Confident - ±3x (~0.5 orders of magnitude).
- Likely - ±10x (1 order of magnitude).
- Speculative - ±31x ( ~1.5 orders of magnitude).
Confidence applies to the recorded values for Training compute, Parameters, and Training dataset size. It describes confidence for the most uncertain of these values, for the ones that have a non-empty entry.
To estimate confidence statuses, we consider which parts of an estimate are uncertain, and how large the uncertainty is.
- If details (compute, model size, and dataset size) are all directly reported, then the value is Confident. There is little room for error.
- If a detail is estimated without any assumptions having to be made, then the value is Confident. For example, if hardware type, quantity, training time, number format, and utilization are all reported, then the ensuing compute estimate is unambiguous.
- When details are ambiguous, and an assumption has to be made, we consider the uncertainty in that assumption.
- For example, it is often necessary to estimate utilization when estimating training compute from hardware details. Given the typical language modeling range is 0.3-0.5, this estimate should fall within the Confident category.
- Further ambiguity may move estimates into the Likely category. For example, MedBERT was trained for one week using one V100 GPU, but the authors do not report the arithmetic precision or usage of tensor cores during training, which could affect the compute usage by a factor of 4x.
- Finally, some estimates are based almost entirely on credible ranges for (unreported) key parameters such as training time and hardware. These typically fall into the Speculative category. An example of this is GPT-4, where our compute estimate is based on secondhand reporting that lets us roughly estimate training duration and hardware.
Changelog
2024-06-19
The documentation was updated for the launch of the dataset on Epoch AI’s “Data on AI” webpage.
- Updates particularly affected sections on estimating compute, parameters, and dataset sizes
- The confidence field was updated to be defined in terms of 90% confidence intervals for estimated values.
- The documentation was restructured for clarity.
Downloads
Notable Machine Learning Models
CSV, Updated December 03, 2024
In addition to the notable models dataset, we also host an “all models” dataset with entries covering additional models used in our other research projects. Many of these models do not qualify as notable under our inclusion criteria. We do not recommend using this broader dataset unless you have a specific reason to do so, for example because less-notable models are of interest in your research.
Acknowledgements
We would like to thank the authors of several sources where we have found one or more ML models to include in the database: Stanford CRFM’s foundation model ecosystem graph, AI Tracker, Stella Biderman’s directory of LLMs, Terry Um’s repo of deep learning papers, Alan Thompson’s models table, the OpenCompass Chinese LM leaderboard, the Akronomikon by LightOn AI, Papers With Code, the Metaculus 2022 AI Forecasting Database, and Hugging Face. We would also like to thank the authors of AI and compute and Compute and Energy Consumption Trends in Deep Learning Inference.
The data have been collected by Epoch AI’s employees and collaborators, including Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón, Matthew Burtell, Lennart Heim, Amogh B. Nanjajjar, Tilman Rauker, Nuño Sempere, Max Rauker, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Jean-Stanislas Denain, Owen Dudney, David Atkinson, Ben Cottier, David Owen, Robi Rahman, Carl Guo, Josh You, Nicole Maug, Aidan O’Gara, Bartosz Podkanowicz, Luke Frymire, and Natalia Martemianova.
This documentation was written by David Owen and Robi Rahman. Material on estimating compute, parameters and dataset sizes was adapted from previous documents by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, Anson Ho, Pablo Villalobos, and Robi Rahman.