Biology AI Models Documentation

Epoch AI's Biology AI Models Dataset is a collection of machine learning models with applications in biology for research about trends in the intersection between AI and biology.

Overview

Epoch AI’s Biology AI Models Dataset is a collection of machine learning models trained on biological data, and key information about their training. This dataset is useful for research about trends in the application of artificial intelligence to biology. This documentation describes which models are contained within the dataset and its records (including data fields and definitions). The dataset is available on our website as a visualization, and is available for download as a daily-updated CSV file.

If you would like to ask any questions about the database, or suggest a model that should be added, feel free to contact us at data@epoch.ai. If this dataset is useful for you, please cite it. To request access to data about biological model safeguards, please contact safeguards@epoch.ai.

Citation

[This data is part of our broader model database, so please cite it as such]

Epoch AI, ‘Parameter, Compute and Data Trends in Machine Learning’. Published online at epoch.ai. Retrieved from: ‘https://epoch.ai/data/notable-ai-models’ [online resource]

BibTeX citation

@misc{epoch2022pcdtrends,
  title = "Parameter, Compute and Data Trends in Machine Learning",
  author = {{Epoch AI}},
  year = 2022,
  url = {https://epoch.ai/data/notable-ai-models},
  note = "Accessed: "
}

Inclusion

The dataset focuses on biological ML models: those that are trained on biological data, including biological sequences, molecular structures or data about molecular properties, among others. Here, we detail criteria for inclusion, and give an overview of how the data have been collected.


Criteria

To be included in the dataset, an ML model must satisfy all inclusion criteria:

  • there must be reliable documentation of its existence and relevance to machine learning;
  • the model must include a learning component and cannot be a non-learned algorithm;
  • the model must have been trained, it cannot be a theoretical description without experimental results;
  • the model must be directly and explicitly trained on biological data, including:
    • biological sequence data;
    • biomolecule structure data;
    • fitness, pathogenicity or other biological properties of proteins or other biomolecules;
    • cell-level data (cell type, expression levels, spatial or imaging data…).

Search process

This data has been collected mainly from a literature review, although some models have been collected from other sources like high-profile models from leading industry labs, bibliographies of notable papers, and ad hoc suggestions from contributors.


Coverage

As of February 21, 2025, the dataset contains 350 models, 154 of which have compute estimates. The dataset does not provide exhaustive coverage of biological models. We attempt to cover the most historically relevant models, as well as significant models released in 2023 and 2024, but we expect some important models to be missing from our dataset.

Records

The database focuses on information relevant to trends in AI model development. Records in this dataset have information about three broad areas:

Bibliographic information about the model and its associated publication, for example its title, URL, authors, citations, date, etc.

Training details such as training compute, parameters, dataset size, hardware used for training, etc.

Metadata about the record, such as notes on the above fields with supporting evidence and context, our confidence in the key models, etc.

We provide a comprehensive guide to the dataset’s fields below. This includes examples taken from Llama-2 70B, one of the best-documented recent models. If you would like to ask any questions about the database, or request a field that should be added, feel free to contact us at data@epoch.ai.

Column Type Definition Example from DNABERT Coverage
Abstract Text

Abstract text from the publication associated with the model.

Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites and transcription factor binding sites, after easy fine-tuning using small task-specific labeled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. 97%
338 out of 350 models
Authors Text

Comma-separated list of authors.

Authors are named in the way that they report their names in their publications, if applicable. For example, Lê Viết Quốc is credited as “Quoc V. Le” in his publications.

Yanrong Ji, Zhihan Zhou, Han Liu, Ramana V Davuluri 99%
346 out of 350 models
Base model Categorical (single select)

Which base model the model was fine-tuned from, if applicable.

SciBERT 5%
19 out of 350 models
Batch size Numeric

Batch size used during training.

128 4%
15 out of 350 models
Citations Numeric

Number of citations as of last update. Values are collected from Semantic Scholar where available, otherwise manually from Google Scholar.

479 60%
209 out of 350 models
Confidence Categorical (single select)

Metadata describing our confidence in the recorded values for Training compute, Parameters, and Training dataset size. This describes confidence for the most uncertain of these parameters, where they have a non-empty entry (compute is typically the most uncertain).

The confidence statuses specify 90% confidence that the recorded values are within the following bounds:

  • Confident - ±3x, 0.5 orders of magnitude.
  • Likely - ±10x, 1 order of magnitude.
  • Speculative - ±31x, 1.5 orders of magnitude.

We also provide further statuses:

  • Unknown - we have too little information to even make a speculative estimate.
  • Wrong - we know this estimate is incorrect, and it has been queued for correction.
  • Unverified - this estimate has not yet been assessed for confidence.
Confident 99%
347 out of 350 models
Country (from Organization) Categorical (multiple select)

Country/countries associated with the developing organization(s). Multinational is used to mark organizations associated with multiple countries.

United States of America 98%
343 out of 350 models
Domain Categorical (multiple select)

The machine learning domain(s) of application associated with the model. This is fairly high-level, for example “Language” incorporates many different ML tasks.

Possible values: Biology, Image generation, Language, Materials science, Mathematics, Medicine, Multimodal, Vision

Biology 100%
350 out of 350 models
Task Categorical (multiple select)

The finegrained task(s) that the model is designed to perform. These are specific applications of the model to different problems, and can span multiple domains.

Task labels are assigned by following a flowchart. Each applicable branch of the flowchart is followed until a leaf node is reached. If the task is already in the database, the model is tagged with that task. If the task does not yet exist in the database, the model is tagged with that task and the task is added to the flowchart.

Examples:

  • Face recognition
  • Visual question answering
  • Tic Tac Toe
  • Weather forecasting
Protein or nucleotide language model (pLM/nLM) 100%
350 out of 350 models
Epochs Numeric

How many epochs (repetitions of the training dataset) was used to train the model.

4 12%
43 out of 350 models
Finetune compute Numeric

Compute used to fine-tune the model, if applicable.

4.980528e+16 1%
5 out of 350 models
Hardware quantity Numeric

Indicates the quantity of the hardware used in training, i.e. the number of chips.

4 15%
51 out of 350 models
Hardware utilization Numeric

Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs.

Where available, record Model FLOP Utilization, which does not depend on implementation details such as checkpointing. Almost all entries record this. However, when Hardware FLOP Utilization is the only reported value, this can be recorded instead.

[empty] 0%
0 out of 350 models
Link URL

Link(s) to best-choice sources documenting a model. This should preferentially be a journal or conference paper, preprint, or technical report. If these are not available, the links should point to other supporting evidence, such as an announcement post, a news article, or similar.

https://academic.oup.com/bioinformatics/article/37/15/2112/6128680 100%
349 out of 350 models
Model Text

The name of the model. This should be unique within the database, and should be the best-known name for a given model.

This column must be filled in, and is used as the primary key for indexing entries in the dataset.

DNABERT 100%
350 out of 350 models
Notability criteria Categorical (multiple select)

The criteria met by the model which qualify it for notability and therefore inclusion in the dataset. To be notable, a model must meet at least one criterion.

Possible values are highly cited, large training cost, significant use, state of the art, or historical significance. These are discussed further in Inclusion.

SOTA improvement 12%
43 out of 350 models
Organization Categorical (multiple select)

Organization(s) who created the model.

Organizations may have multiple different names, but we aim to standardize organization names where they refer to the same organization. Therefore, organizations are periodically reviewed in Airtable and standardized to the most common name for them.

For example, “University of California, Berkeley” and “Berkeley” have been changed to “UC Berkeley”. Note that some organizations have similar names but genuinely are different organizations, for example Google Brain versus Google versus Google DeepMind.

Northeastern University 98%
344 out of 350 models
Organization categorization Categorical (multiple select)

Categorization of the organization(s), automatically populated from the Organization entry. Models are categorized as “Industry” if their authors are affiliated with private sector organizations, “Academia” if the authors are affiliated with universities or academic institutions, or “Industry - Academia Collaboration” when at least 30% of the authors are from each.

Possible values: Industry, Research Collective, Academia, Industry - Academia Collaboration (Industry leaning), Industry - Academia Collaboration (Academia leaning), Non-profit

Academia 97%
339 out of 350 models
Parameters Numeric

Number of learnable parameters in the model. For neural networks, these are the weights and biases. Further information is provided in Estimation.

1.6e7 24%
85 out of 350 models
Publication date Date

The publication, announcement, or release date of the model, in YYYY-MM-DD format. If the year and month are known but the day is unknown, the day is filled in as YYYY-MM-15. If the year is known but the month and day are unknown, the month and day are filled in as YYYY-07-01.

2021-08-15 100%
349 out of 350 models
Reference Text

The literature reference for the model, such as the title of the journal or conference paper, academic preprint, or technical report.

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome 100%
350 out of 350 models
Training compute Numeric

Quantity of compute used to train the model, in FLOP. This is the total training compute for a given model, i.e. pretrain + finetune. It should be filled in here when directly reported, or calculated via GPU-hours or backpropagation gradient updates. Further guidance is provided in Estimation.

1.1e20 44%
154 out of 350 models
Training compute estimation method Categorical (multiple select)

Indicates how the quantity of training compute was found or estimated. Options include:

  • “Reported”, when the developers report how much training compute was used.
  • “Operation counting”, when the parameters, training data, and/or architecture are known, and these are used to estimate the number of operations performed while training.
  • “Hardware”, when the training time and/or hardware type are known, allowing an estimate based on the rate of computation.
  • “Cost”, when the training cost is known and the compute is estimated through the hardware usage budget.
  • “Benchmarks”, when the model’s training compute was estimated using its benchmark performance.
Hardware, Operation counting 43%
150 out of 350 models
Training dataset Categorical (multiple select)

Standard datasets are often used, and can be selected as multiple choice options. If a custom, unreleased dataset is used, it is set as "Unspecified Unreleased". Where this entry is empty, it has not yet been entered for a given model.

Datasets are standardized to their most common name. For example, “MS COCO” and “Microsoft COCO” are standardized as “COCO”.

Human Reference Genome (GRCh38/hg38) 22%
76 out of 350 models
Training dataset size Numeric

Number of datapoints in the training dataset, in the unit specified for a given task, for example number of images in image classification, or number of words in language modeling. Further guidance is provided in Estimation. This counts the dataset size as used for training, so e.g. if a model is trained on a subset of a public dataset, this field reflects the size of that subset.

3.0e9 74%
260 out of 350 models
Training hardware Categorical (multiple select)

Type of training hardware used. Entries are cross-referenced against Epoch AI’s database of ML training hardware.

NVIDIA Quadro RTX 5000 20%
69 out of 350 models
Training time (hours) Numeric

Training time of the model, if reported. This refers to the time elapsed over the training run, not the number of GPU-hours. So for example, if a model were trained with 10 GPUs for 1 hour, the training time would be 1 hour.

600 11%
37 out of 350 models
Training power draw (W) Numeric

Power draw of the hardware used to train the model, in watts. Calculated as hardware quantity times processor TDP times datacenter PUE times server overhead. More details are provided in Estimating power draw.

2695 14%
48 out of 350 models
Frontier model Boolean

Indicates whether a model was within the frontier, defined as models that were in the top 10 by training compute as of their release date.

True 0%
1 out of 350 models
Model accessibility Categorical (multiple select)

The accessibility of the model in terms of whether the model weights can be downloaded or, if the model weights are not accessible, whether the model can be used in an API or product.

Open weights (unrestricted) 19%
66 out of 350 models
Training code accessibility Categorical (single select)

Denotes how the model can be accessed and used by the public. “Open weights (unrestricted)”, “Open weights (restricted use)” and “Open weights (non-commercial)” all mean that the model weights are downloadable by the public, but with different restrictions on use. “API access” means the model can only be interacted with via an application programming interface, and possibly also a hosted service. “Hosted access (no API)” means the model can only be interacted with via a hosted service. “Unreleased” means there is no way for the public to access the model.

Open source 14%
48 out of 350 models
Notes fields, e.g. “Training compute notes” Text

Metadata documenting the reasoning and/or evidence for a given column, e.g. training compute or dataset size. This is particularly important to note in cases where such information isn’t obvious. This field is unstructured text.

Training compute notes:

"Since the pre-training of DNABERT model is resource-intensive (about 25 days on 8 NVIDIA 2080Ti GPUs)" Assuming FP16 and 30% utilization Calculation = (25 * 24 *3600) s * 2.7e13 FLOP/s per GPU * 8 GPUs * 0.3 utilization = 1.4e20 FLOP Alternatively: "DNABERT takes a sequence with a max length of 512 as input... We pre-trained DNABERT for 120k steps with a batch size of 2000" 6 * 512 * 2000 * 120k * 110M = 8.11e19 Geometric mean: 1.07e20
47%
166 out of 350 models

Estimation

Some fields within the database require estimation, because they are often not straightforwardly reported within papers or other sources. Here, we detail how estimation works for compute, model size, dataset size, and the metadata on estimate confidence.


Estimating compute

Training compute is one of the most important pieces of information in our dataset, as reflected in its usage across Epoch AI’s research and elsewhere. However, estimating compute can be challenging. Here we outline how compute estimation is performed in the notable models dataset.

Compute is measured in units of floating point operations (FLOP). For older models, sometimes the relevant operations were integer operations - in this case we report these instead. We do not apply any multiplier to adjust for operations potentially being more valuable under different tensor formats or precisions, for example FP16 versus FP32 or BF16. Some sources report compute in multiplications-and-addition operations, fused multiply-adds (FMAs), or similar. We treat one multadd/FMA as being equivalent to two FLOP to match typical reporting of chip performance.

For a given model in the database, training compute is provided as the total training compute, including pretraining, and including pretrained base models used as components. Finetuning compute is recorded in its associated column. Finetuning is distinguished by authors’ descriptions of the training as finetuning, or unambiguous use of a pretrained model in a distinct phase of training.

In the simplest case, training compute is directly reported in a paper, and we enter this figure into the database. When compute is not reported, we use two main methods to estimate it:

  1. Hardware details and usage.
  2. Counting the operations based on model architecture and data.

When there is enough information to count the operations, this is preferred in our dataset, because typically hardware-based estimates require assumptions about utilization, which may reduce the estimates’ accuracy.

Hardware details and usage

Hardware details and usage is a relatively straightforward way to estimate compute, when the necessary details are known:

  1. The usage in chip-time, e.g. “trained on a cluster of 128 TPUv3 instances for two days” means 256 chip-days = 128 chips × 2 days. Sometimes this is reported as separate chips and time used, other times this may be reported directly in chip-time. When it is not reported, we may create estimates from publicly-known information, comparison to typical training runs, etc.
  2. The type of hardware used, e.g. NVIDIA H100, TPUv3. Ideally, this is reported in the paper. Otherwise, for more speculative estimates, one may have to make assumptions based on institution and year, e.g. that Google would have used TPUs of the corresponding generation in that year.
  3. The type of number representation used, e.g. FP32, FP16, BF16. Ideally, this is reported in the paper. When not reported, it can often be guessed. For example, the number representation was typically FP32 for models trained before 2019.

Once these details are known, the corresponding peak FLOP/s performance by hardware and number representation can be found from hardware documentation, or from the tool below. Finally, utilization rates account for real training runs falling significantly short of peak performance due to memory bottlenecks, network latency, etc. Typical utilization rates for large distributed training runs are around 30-50%. When these are not reported, they are estimated by reference to comparable models from a similar time period.

Table 2: Worked example of estimating training compute from hardware details.
ImageGPT

Some training details are provided in the blogpost: “[..]iGPT-L was trained for roughly 2500 V100-days […]”

The number representation is not specified, but given this was trained by a major corporation in 2020, we assume the number format was FP16.

The V100 has 125 TFLOP/s tensor FP16 performance. Assuming a utilization of 0.3, this leads to the following compute estimate:

8.1e21 FLOP = 2500 V100-days × 125e12 FLOP/s × 0.3 utilization × 86.4e3 s/day

Counting the operations

Counting the number of operations is often useful for older research, where hardware and usage details might be unavailable. The above formula sets out a widely-applicable heuristic for training compute of dense models. This works by first estimating required FLOP for a forward pass, which is approximately twice the number of connections. This can be modified by sparsity such as Mixture-of-Experts: in this case, the heuristic should use the number of connections in the number of active experts.

The forward pass FLOP is then multiplied by three to account for backward passes, as the ratio between forward and backward passes is 1:2 for non-recurrent dense models. Finally, this is multiplied by the number of passes performed on the data - the number of training examples multiplied by the number of epochs the model was trained. For transformer-based language models, this formula is equivalent to the commonly-used heuristic: Compute = 6 × # of parameters × # of training examples × # of epochs.

Sometimes, the FLOP for a forward pass is reported directly in a paper. In this case, this value can be used directly instead of 2 × # of connections. Otherwise, the FLOP for a forward pass are evaluated by summing FLOP over the network’s layers. These are set out in Table 3.

Table 3: Common neural network layers and associated FLOP per token in a forward pass.
Layer Forward pass FLOP per token (approx)
Fully connected layer from N neurons to M neurons 2×N×M
CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P 2×H^2×W^2×C×D/S^2
Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P 2×D×H×W×C^2×K^2
RNN with bias vectors taking an input of size N and producing an output of size M 2×(N+M)×M
Fully gated GRU with bias vectors taking an input of size N and producing an output of size M 6×(N+M)×M
LSTM with bias vectors taking an input of size N and producing an output of size M 8×(N+M)×M
Word Embedding for vocabulary size V and embedding dimension W 0
Self attention layer with sequence length L, inputs of size W, key of size D and output of size N 2×W×(2×D+N) + 2×L×(D+N)
Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads 2×H×(W×(2×D+N) + L×(D+N) + N×M)
Table 4: Worked example of estimating training compute from architecture.
Attention Is All You Need

The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax.

Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 2×16×(64×(2×64+64) + 20×(64+64) + 64×1024) = 2.6e6 FLOP per token.

Each FCN sublayer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN sublayer has 2×2×1024×4096 = 1.7e7 FLOP per token.

Summing all its layers, the encoder-decoder stack has 6 × (3 × 2.6e6 + 2 × 1.7e7) ~= 2.5e8 FLOP per token. The final linear layer has 2 × 1024 × 3e4 = 6.1e7 FLOP per token. Summing these, a forward pass takes 3.1e8 FLOP per token.

The paper says they use batches of 25,000 tokens, and run the training for 300,000 steps. So the total training FLOP would be 2.5e4 × 3e5 × 3 × 3.1e8 = 6.97e18 FLOP.

Benchmark performance

When details about model architecture, training data, hardware, and development time are scarce, it may be informative to compare the model’s performance on benchmarks to that of other models. Scaling laws can predict benchmark performance improvements against compute when scaling a given model family (for example coding performance for GPT-4 scaling and ARC Challenge for Llama-3). When there are differences in model/data/training, benchmark performance is less predictable from compute, but nevertheless remains correlated.

This process of estimating training compute from benchmark performance can be improved by aggregating performance across many benchmarks, especially when several or many models with known training compute have been evaluated on those benchmarks.

The procedure is roughly as follows:

  • Collect a dataset of many benchmarks (e.g. MMLU, GPQA, BigCodeBench) and models (e.g. Llama 3, Mistral Large, Nemotron 4) with known training compute and benchmark scores.
  • For each benchmark, fit a curve that best matches the benchmark scores of each model as a function of their training compute.The curve-fitting procedure uses sigmoid functions, based on our study How Predictable is Language Model Benchmark Performance?
  • Using the fitted curves, impute an x-value (training compute) from the y-values (benchmark scores) for models with unknown training compute. This represents the approximate training compute necessary to achieve the benchmark scores demonstrated by those models.
    • The resulting estimates are cross-validated to ensure that the fitted values are reasonable even if some benchmark evaluation datapoints are held out.
  • If applicable, combine any other information about the compute resources available to the developers with the evidence from the benchmarks, to obtain overall estimates of the compute used to train the models.
  • It may be helpful to constrain fitting to models with the most similar algorithmic efficiency. For example, when using this approach to collect information on leading LLMs from 2024, we constrained fitting to models with algorithmic efficiency similar to, or better than, the Llama 3 family.

This process is demonstrated in a public Colab notebook, Compute Estimation from Benchmark Scores. Because these compute estimates are already based on benchmark performance, they should be excluded from analyses of the relationship between benchmarks and compute. Such compute estimates can be filtered using the Training compute estimation method field.


Estimating model size

Parameter counts are often reported by the model developer, but if parameter count is not stated, it can sometimes be estimated based on provided architectural details. Similar to estimating compute, estimating parameter count requires finding a description of the architecture, i.e. type, number, and configuration of the layers, then calculating the parameters in each layer and summing them. Table 5 lists the parameter counts for different layers. Alternatively, if an architecture implementation is available in code, it can be simpler to load an architecture in code and count the number of parameters.

Table 5: Common neural network layers and parameters.
Layer Parameters (approx)
Fully connected layer from N neurons to M neurons N×M
CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P D×K^2×C
Transpose CNN from a tensor of shape H×W×C with D filters of shape K×K×C, applied with stride S and padding P D×K^2×C
RNN with bias vectors taking an input of size N and producing an output of size M (N+M)×M
Fully gated GRU with bias vectors taking an input of size N and producing an output of size M 3×(N+M)×M
LSTM with bias vectors taking an input of size N and producing an output of size M 4×(N+M)×M
Word Embedding for vocabulary size V and embedding dimension W W×V
Self attention layer with sequence length L, inputs of size W, key of size D and output of size N W×(2×D+N)
Multi-headed attention layer with sequence length L, inputs of size W, key of size D, head output of size N, output of size M and H attention heads H×(W×(2×D + N) + N×M)
Table 6: Worked example of estimating training model size from architecture.
Attention Is All You Need

The input is a sequence of tokens, with an average length of 20 and a vocabulary size of 30,000. Each token is embedded and represented as a vector of size 1024. There are six encoder and decoder layers. Each encoder-decoder pair has a total of 3 multi-headed attention (MHA) sublayers, and 2 fully connected (FCN) sublayers. At the end there is a final linear layer and a softmax.

Each MHA sublayer has the following parameters: input size W=64, head output size N=64, key size D=64, number of heads H=16, final output size M=1024. Hence each MHA sublayer has 16×(64×(2×64 + 64) + 64×1024) = 1.2e6 parameters.

Each FCN layer has an input size of 1024, output size of 1024, and a single hidden layer with 4096 units. Hence each FCN layer has 2×1024×4096 = 8.4e6 parameters.

Summing all its layers, the encoder-decoder stack has 6 × (3 × 1.2e6 + 2 × 8.4e6) ~= 1.2e8 parameters. The final linear layer has 1024 × 3e4 = 3.1e7 parameters. Two embedding layers each have 30e3 × 1024 parameters, so 6.2e7 in total. Summing these, the model has 2.1e8 parameters, matching the reported 213 million parameters in the paper.


Estimating dataset size

Dataset size is measured in the number of datapoints used as training examples, as outlined in Table 7. The objective here is to provide an intuitive idea of how large a dataset is. We record the number of distinct datapoints, not multiplied by the number of epochs. The number of epochs is recorded separately. Table 8 provides worked examples across several of these ML problems.

Table 7: How to measure dataset size for different ML problems.
ML problem Way of measuring dataset size
Classification # training examples
Image classification # images
Image captioning # captions
Language modeling # words
Translation # words in input language
Text classification # training examples
Speech recognition # words
Reinforcement learning # timesteps
Table 8: Worked examples for calculating dataset size across different ML problems.
Image classification: Deep Residual Learning for Image Recognition

“We evaluate our method on the ImageNet 2012 classification dataset that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images.”

We thus note down a dataset size of 1.28e6 (the number of images).

Image captioning: Deep Residual Learning for Image Recognition

“We evaluate our method on the ImageNet 2012 classification dataset that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images.”

According to the authors, the MSCOCO dataset is “arguably the largest and highest quality dataset” that they used. This had 82,783 training examples, each containing a single image and 5 sentences that are “relatively visual and unbiased”. To determine the dataset size, we consider the number of image-caption pairs. Thus we count 82,783 * 5 = 413,915 training examples.

Language modeling: Language Models are Few-Shot Learners

From the paper, we determine that there are 410 + 19 + 12 + 55 + 3 = 499 billion tokens.

We convert this to words by multiplying by 0.75 to give 374B words.

Speech: An RNN-based prosodic information synthesizer for Mandarin text-to-speech

“A continuous-speech Mandarin database provided by the Telecommunication Laboratories, MOTC,1 R.O.C. was used… The data base was divided into two parts: a training set and an open test set. These two sets consisted of 28191 and 7051 syllables, respectively.”

We convert this to words by multiplying 28,191 syllables by 0.62 to get 17,478 words.

Language dataset sizes are usually reported in terms of tokens or gigabytes. These are converted to words per Table 9. Language dataset sizes are usually reported in terms of tokens or gigabytes. These are converted to words per Table 9. These factors are based on the OpenAI GPT-3 tokenizer for Western languages, and manual inspection that tokenizers typically have one token per word in Mandarin, Japanese and Korean. The ratio is tokenizer-dependent, meaning that when estimating a dataset’s size in words, one should consider whether the tokenizer might have a substantially different ratio. Speech recognition data are often expressed in terms of duration or syllables, which are converted to words per Table 10. Speech recognition data are often expressed in terms of duration or syllables, which are converted to words per Table 10.

Table 9: Conversion between words, tokens, and GB for different languages.
Language Words per token Words per GB (approx)
English 0.75 200M
Mandarin Chinese 1 167M
German 0.75 167M
Spanish 0.75 200M
Japanese 1 200M
Korean 1 111M
Table 10: Conversion between words, minutes/hours, and syllables for different languages. Adapted from Trauzettel-Klosinksi et al. (2012).
Language Words per minuteWPM Words per hourWPH Words per syllableWPS
English 228 13,680 ~0.73
Mandarin Chinese 158 9,480 ~0.62
German 179 10,740 ~0.59
Spanish 218 13,080 ~0.41
Japanese 193 11,580 ~0.43

Estimating power draw

The field “Training power draw (W)” contains the power draw of the hardware used to train the model, measured in watts. This field is filled in when the training hardware type and quantity are known, and is calculated as follows:

The power draw is calculated for the processing hardware (GPUs or TPUs), by multiplying the TDP (thermal design power) of the processor by the number of processors.

The power draw of the computing hardware is then scaled up to account for overhead due to networking equipment and data center efficiency:


Estimating confidence

As discussed in Records, the confidence statuses specify the following bounds as 90% confidence intervals:

  • Confident - ±3x (~0.5 orders of magnitude).
  • Likely - ±10x (1 order of magnitude).
  • Speculative - ±31x ( ~1.5 orders of magnitude).

Confidence applies to the recorded values for Training compute, Parameters, and Training dataset size. It describes confidence for the most uncertain of these values, for the ones that have a non-empty entry.

To estimate confidence statuses, we consider which parts of an estimate are uncertain, and how large the uncertainty is.

  • If details (compute, model size, and dataset size) are all directly reported, then the value is Confident. There is little room for error.
  • If a detail is estimated without any assumptions having to be made, then the value is Confident. For example, if hardware type, quantity, training time, number format, and utilization are all reported, then the ensuing compute estimate is unambiguous.
  • When details are ambiguous, and an assumption has to be made, we consider the uncertainty in that assumption.
  • For example, it is often necessary to estimate utilization when estimating training compute from hardware details. Given the typical language modeling range is 0.3-0.5, this estimate should fall within the Confident category.
  • Further ambiguity may move estimates into the Likely category. For example, MedBERT was trained for one week using one V100 GPU, but the authors do not report the arithmetic precision or usage of tensor cores during training, which could affect the compute usage by a factor of 4x.
  • Finally, some estimates are based almost entirely on credible ranges for (unreported) key parameters such as training time and hardware. These typically fall into the Speculative category. An example of this is GPT-4, where our compute estimate is based on secondhand reporting that lets us roughly estimate training duration and hardware.

Downloads

Biology AI Models

CSV, Updated February 21, 2025

Acknowledgements

Sentinel Bio provided a grant to fund this data collection project and make it publicly available. Epoch AI owns the resulting dataset. We thank them for their generous support.