Some fields within the database require estimation, because they are often not straightforwardly reported within papers or other sources. Here, we detail how estimation works for compute, model size, dataset size, and the metadata on estimate confidence.
Training compute is one of the most important pieces of information in our dataset, as reflected in its usage across Epoch AI’s research and elsewhere. However, estimating compute can be challenging. Here we outline how compute estimation is performed in the notable models dataset.
Compute is measured in units of floating point operations (FLOP). For older models, sometimes the relevant operations were integer operations - in this case we report these instead. We do not apply any multiplier to adjust for operations potentially being more valuable under different tensor formats or precisions, for example FP16 versus FP32 or BF16. Some sources report compute in multiplications-and-addition operations, fused multiply-adds (FMAs), or similar. We treat one multadd/FMA as being equivalent to two FLOP to match typical reporting of chip performance.
For a given model in the database, training compute is provided as the total training compute, including pretraining, and including pretrained base models used as components. Finetuning compute is recorded in its associated column. Finetuning is distinguished by authors’ descriptions of the training as finetuning, or unambiguous use of a pretrained model in a distinct phase of training.
In the simplest case, training compute is directly reported in a paper, and we enter this figure into the database. When compute is not reported, we use two main methods to estimate it:
When there is enough information to count the operations, this is preferred in our dataset, because typically hardware-based estimates require assumptions about utilization, which may reduce the estimates’ accuracy.
Hardware details and usage is a relatively straightforward way to estimate compute, when the necessary details are known:
Once these details are known, the corresponding peak FLOP/s performance by hardware and number representation can be found from hardware documentation, or from the tool below. Finally, utilization rates account for real training runs falling significantly short of peak performance due to memory bottlenecks, network latency, etc. Typical utilization rates for large distributed training runs are around 30-50%. When these are not reported, they are estimated by reference to comparable models from a similar time period.
| Worked example of estimating training compute from hardware details: ImageGPT |
|---|
Some training details are provided in the blogpost: ”[..]iGPT-L was trained for roughly 2500 V100-days […]” |
The number representation is not specified, but given this was trained by a major corporation in 2020, we assume the number format was FP16. |
The V100 has 125 TFLOP/s tensor FP16 performance. Assuming a utilization of 0.3, this leads to the following compute estimate: |
8.1e21 FLOP = 2500 V100-days × 125e12 FLOP/s × 0.3 utilization × 86.4e3 s/day |
Counting the number of operations is often useful for older research, where hardware and usage details might be unavailable. The above formula sets out a widely-applicable heuristic for training compute of dense models. This works by first estimating required FLOP for a forward pass, which is approximately twice the number of connections. This can be modified by sparsity such as Mixture-of-Experts: in this case, the heuristic should use the number of connections in the number of active experts.
The forward pass FLOP is then multiplied by three to account for backward passes, as the ratio between forward and backward passes is 1:2 for non-recurrent dense models. Finally, this is multiplied by the number of passes performed on the data - the number of training examples multiplied by the number of epochs the model was trained. For transformer-based language models, this formula is equivalent to the commonly-used heuristic: Compute = 6 × # of parameters × # of training examples × # of epochs.
Sometimes, the FLOP for a forward pass is reported directly in a paper. In this case, this value can be used directly instead of 2 × # of connections. Otherwise, the FLOP for a forward pass are evaluated by summing FLOP over the network’s layers. These are set out in Table 3.
| Layer | Forward pass FLOP per token (approx) |
|---|---|
| Fully connected layer from N neurons to M neurons | 2×N×M |
| CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | 2×H^2×W^2×C×D/S^2 |
| Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | 2×D×H×W×C^2×K^2 |
| RNN with bias vectors taking an input of size N and producing an output of size M | 2×(N+M)×M |
| Fully gated GRU with bias vectors taking an input of size N and producing an output of size M | 6×(N+M)×M |
| LSTM with bias vectors taking an input of size N and producing an output of size M | 8×(N+M)×M |
| Word Embedding for vocabulary size V and embedding dimension W | 0 |
| Self attention layer with sequence length L, inputs of size W, key of size D and output of size N | 2×W×(2×D+N) + 2×L×(D+N) |
| Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads | 2×H×(W×(2×D+N) + L×(D+N) + N×M) |
| Worked example of estimating training compute from architecture: Attention Is All You Need |
|---|
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. |
Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 2×16×(64×(2×64+64) + 20×(64+64) + 64×1024) = 2.6e6 FLOP per token. |
Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN sublayer has 2×2×1024×4096 = 1.7e7 FLOP per token. |
Summing all its layers, the encoder-decoder stack has 6 × (3 × 2.6e6 + 2 × 1.7e7) ~= 2.5e8 FLOP per token. The final linear layer has 2 × 1024 × 3e4 = 6.1e7 FLOP per token. Summing these, a forward pass takes 3.1e8 FLOP per token. |
The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 × 3e5 × 3 × 3.1e8 = 6.97e18 FLOP. |
When details about model architecture, training data, hardware, and development time are scarce, it may be informative to compare the model’s performance on benchmarks to that of other models. Scaling laws can predict benchmark performance improvements against compute when scaling a given model family (for example coding performance for GPT-4 scaling and ARC Challenge for Llama-3). When there are differences in model/data/training, benchmark performance is less predictable from compute, but nevertheless remains correlated.
This process of estimating training compute from benchmark performance can be improved by aggregating performance across many benchmarks, especially when several or many models with known training compute have been evaluated on those benchmarks.
The procedure is roughly as follows:
This process is demonstrated in a public Colab notebook, Compute Estimation from Benchmark Scores. Because these compute estimates are already based on benchmark performance, they should be excluded from analyses of the relationship between benchmarks and compute. Such compute estimates can be filtered using the Training compute estimation method field.
Parameter counts are often reported by the model developer, but if parameter count is not stated, it can sometimes be estimated based on provided architectural details. Similar to estimating compute, estimating parameter count requires finding a description of the architecture, i.e. type, number, and configuration of the layers, then calculating the parameters in each layer and summing them. Table 5 lists the parameter counts for different layers. Alternatively, if an architecture implementation is available in code, it can be simpler to load an architecture in code and count the number of parameters.
| Layer | Parameters (approx) |
|---|---|
| Fully connected layer from N neurons to M neurons | N×M |
| CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | D×K^2×C |
| Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P | D×K^2×C |
| RNN with bias vectors taking an input of size N and producing an output of size M | (N+M)×M |
| Fully gated GRU with bias vectors taking an input of size N and producing an output of size M | 3×(N+M)×M |
| LSTM with bias vectors taking an input of size N and producing an output of size M | 4×(N+M)×M |
| Word Embedding for vocabulary size V and embedding dimension W | W×V |
| Self attention layer with sequence length L, inputs of size W, key of size D and output of size N | W×(2×D+N) |
| Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads | H×(W×(2×D + N) + N×M) |
| Attention Is All You Need |
|---|
The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax. |
Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 16×(64×(2×64 + 64) + 64×1024) = 1.2e6 parameters. |
Each FCN layer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN layer has 2×1024×4096 = 8.4e6 parameters. |
Summing all its layers, the encoder-decoder stack has 6 × (3 × 1.2e6 + 2 × 8.4e6) ~= 1.2e8 parameters. The final linear layer has 1024 × 3e4 = 3.1e7 parameters. Two embedding layers each have 30e3 × 1024 parameters, so 6.2e7 in total. Summing these, the model has 2.1e8 parameters, matching the reported 213 million parameters in the paper. |
The field “Training power draw (W)” contains the power draw of the hardware used to train the model, measured in watts. This field is filled in when the training hardware type and quantity are known, and is calculated as follows:
Training power draw = PUE × Server overhead × Power per GPU × Hardware quantity
where:
PUE = 1.23 in 1950-2008; 1.08 in 2025
From 2009 onwards, PUE decays exponentially at a rate of ln(1.08/1.23)/16 ≈ 0.8% per year. The value is based on the publication date of the model.
Server overhead = 1 if hardware quantity == 1; 1.82 if hardware quantity > 1
This formula calculates the power per chip times the number of chips to get the peak power draw of the computing hardware, and then adjusts for the server hardware needed to connect the processors, and the power usage efficiency of the facility containing the hardware.
The server overhead factor represents the power consumption of server hardware that is needed to connect multiple GPUs or TPUs to use them on the same computing task.
The value is derived from the NVIDIA DGX H100 server, comparing the power consumption of the server with GPUs to that of the GPUs alone: 10.2 kW server power consumption / (8 * 700 W H100 TDP) ≈ 1.82.
This server is chosen because it is the most typical hardware, representative of hardware used for modern notable machine learning model training. As of May 2025, the geometric mean server power overhead for hardware used to train models identified in Epoch’s AI models dataset (weighted by number of notable models trained using each hardware type) was 1.79, very similar to the overhead of the DGX H100 server. The H100 also makes up the majority of total training compute of notable models in Epoch’s AI models dataset and the majority of the installed AI cluster compute capacity in Epoch’s AI supercomputers dataset; Hopper chips including the H100 constitute the majority of NVIDIA AI chip compute stock as of EOY 2024.
The power usage efficiency factor is the ratio of energy consumed by a data center facility to the power delivered to computing equipment, and represents the additional energy overhead required to run processors and servers due to cooling and other non-computing power consumption. We select values of 1.23 in 2008 and before, 1.08 in 2025, and exponentially decaying at a rate of ln(1.08/1.23)/16 ≈ 0.8% per year from 2009 onwards, based on Google’s datacenter efficiency disclosures. Other hyperscalers, such as Meta, have similar datacenter efficiency. Non-AI datacenters tend to have lower efficiency, but the industry average follows a similar trend.
To facilitate data collection, we utilize a LLM-based categorization pipeline. On a validation set of 192 models, the error rate was 4%. See here for methodological details.
As discussed in Records, the confidence statuses specify the following bounds as 90% confidence intervals:
Confidence applies to the recorded values for Training compute, Parameters, and Training dataset size. It describes confidence for the most uncertain of these values, for the ones that have a non-empty entry.
To estimate confidence statuses, we consider which parts of an estimate are uncertain, and how large the uncertainty is.