The database focuses on information relevant to trends in AI model development. Records in the database have information about three broad areas:
Bibliographic information about the model and its associated publication, for example its title, URL, authors, citations, date, etc.
Training details such as training compute, parameters, dataset size, hardware used for training, etc.
Metadata about the record, such as notes on the above fields with supporting evidence and context, our confidence in the key models, etc.
We provide a comprehensive guide to the database’s fields below. This includes examples taken from Llama-2 70B, one of the best-documented recent models. If you would like to ask any questions about the database, or request a field that should be added, feel free to contact us at data@epoch.ai.
| Column | Type | Definition | Example | Coverage |
|---|---|---|---|---|
Abstract | Text | Abstract text from the publication associated with the model. | In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. | 85%
2979 out of 3519 models |
Authors | Text | Comma-separated list of authors. | Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom | 76%
2680 out of 3519 models |
Base model | Categorical (single select) | Which base model the model was fine-tuned from, if applicable. | 19%
684 out of 3519 models | |
Batch size | Numeric | Batch size used during training. | 4000000 | 7%
251 out of 3519 models |
Citations | Numeric | Number of citations as of last update. Values are collected from Semantic Scholar where available, otherwise manually from Google Scholar. | 15053 | 42%
1463 out of 3519 models |
Confidence | Categorical (single select) | Metadata describing our confidence in the recorded values for Training compute, Parameters, and Training dataset size. This describes confidence for the most uncertain of these parameters, where they have a non-empty entry (compute is typically the most uncertain). The confidence statuses are defined as followed:
We also provide further statuses:
| Confident | 94%
3320 out of 3519 models |
Country (of organization) | Categorical (multiple select) | Country/countries associated with the developing organization(s). Multinational is used to mark organizations associated with multiple countries. | United States of America | 97%
3424 out of 3519 models |
Domain | Categorical (multiple select) | The machine learning domain(s) of application associated with the model. This is fairly high-level, for example “Language” incorporates many different ML tasks. Possible values: {{site.data.generated.pcd[key]all-domains | url_decode}} | Language | 97%
3426 out of 3519 models |
Task | Categorical (multiple select) | The finegrained task(s) that the model is designed to perform. These are specific applications of the model to different problems, and can span multiple domains. Task labels are assigned by following a flowchart. Each applicable branch of the flowchart is followed until a leaf node is reached. If the task is already in the database, the model is tagged with that task. If the task does not yet exist in the database, the model is tagged with that task and the task is added to the flowchart. Examples:
| Language modeling, Language modeling/generation, Question answering | 97%
3397 out of 3519 models |
Epochs | Numeric | How many epochs (repetitions of the training dataset) was used to train the model. | 1 | 23%
803 out of 3519 models |
Finetune compute (FLOP) | Numeric | Compute used to fine-tune the model, if applicable. | 7%
260 out of 3519 models | |
Hardware quantity | Numeric | Indicates the quantity of the hardware used in training, i.e. the number of chips. | 1000 | 24%
843 out of 3519 models |
Hardware utilization (MFU) | Numeric | Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on computations successfully applied to model training, and does not include computations performed by the hardware which do not ultimately affect the model. | 0.4191975017 | 2%
61 out of 3519 models |
Hardware utilization (HFU) | Numeric | Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on measured computational throughput in the hardware during training. (Model FLOPs utilization is a better measure of utilization if it is available.) | 1%
25 out of 3519 models | |
Link | URL | Link(s) to best-choice sources documenting a model. This should preferentially be a journal or conference paper, preprint, or technical report. If these are not available, the links should point to other supporting evidence, such as an announcement post, a news article, or similar. | 99%
3475 out of 3519 models | |
Model | Text | The name of the model. This should be unique within the database, and should be the best-known name for a given model. This column must be filled in, and is used as the primary key for indexing entries in the dataset. | Llama 2-70B | 100%
3519 out of 3519 models |
Notability criteria | Categorical (multiple select) | The criteria met by the model which qualify it for notability. To be notable, a model must meet at least one criterion. Possible values are highly cited, large training cost, significant use, state of the art, or historical significance. These are discussed further in Inclusion. | Historical significance, Significant use, Highly cited, Training cost | 29%
1038 out of 3519 models |
Organization | Categorical (multiple select) | Organization(s) who created the model. Organizations may have multiple different names, but we aim to standardize organization names where they refer to the same organization. Therefore, organizations are periodically reviewed in Airtable and standardized to the most common name for them. For example, “University of California, Berkeley” and “Berkeley” have been changed to “UC Berkeley”. Note that some organizations have similar names but genuinely are different organizations, for example Google Brain versus Google versus Google DeepMind. | Meta AI | 97%
3431 out of 3519 models |
Organization categorization | Categorical (multiple select) | Categorization of the organization(s), automatically populated from the Organization entry. Models are categorized as “Industry” if their authors are affiliated with private sector organizations, “Academia” if the authors are affiliated with universities or academic institutions, or “Industry - Academia Collaboration” when at least 30% of the authors are from each. Possible values: Industry, Research Collective, Academia, Industry - Academia Collaboration (Industry leaning), Industry - Academia Collaboration (Academia leaning), Non-profit | Industry | 97%
3412 out of 3519 models |
Parameters | Numeric | Number of learnable parameters in the model. For neural networks, these are the weights and biases. Further information is provided in Estimation. | 7.0e+10 | 65%
2292 out of 3519 models |
Publication date | Date | The publication, announcement, or release date of the model, in YYYY-MM-DD format. If the year and month are known but the day is unknown, the day is filled in as YYYY-MM-15. If the year is known but the month and day are unknown, the month and day are filled in as YYYY-07-01. | 2023-07-18 | 99%
3496 out of 3519 models |
Reference | Text | The literature reference for the model, such as the title of the journal or conference paper, academic preprint, or technical report. | Llama 2: Open Foundation and Fine-Tuned Chat Models | 95%
3345 out of 3519 models |
Training compute (FLOP) | Numeric | Quantity of compute used to train the model, in FLOP. This is the total training compute for a given model, i.e. pretrain + finetune. It should be filled in here when directly reported, or calculated via GPU-hours or backpropagation gradient updates. Further guidance is provided in Estimation. | 8.1e+0 | 39%
1388 out of 3519 models |
Training compute cost (2023 USD) | Numeric | The training compute cost, estimated using the “amortized hardware capex plus energy” approach documented in our training cost methodology. Values are converted to 2023 US dollars. | 1,102,561.19 | 6%
224 out of 3519 models |
Training compute estimation method | Categorical (multiple select) | Indicates how the quantity of training compute was found or estimated. Options include:
| Hardware, Operation counting | 41%
1440 out of 3519 models |
Training hardware | Categorical (multiple select) | Type of training hardware used. Entries are cross-referenced against Epoch AI’s database of ML training hardware. | NVIDIA A100 SXM4 80 GB | 34%
1179 out of 3519 models |
Training time (hours) | Numeric | Training time of the model, if reported. This refers to the time elapsed during the training process, not the number of chip-hours. For example, if a model were trained with 10 GPUs for 1 hour, the training time would be 1 hour. Includes the duration of all training phases conducted to develop the model, such as pre-training, post-training, RL, SFT, etc. If the model is fine-tuned from a previously published model, then that base model’s training time is not included. | 1728 | 16%
550 out of 3519 models |
Training power draw (W) | Numeric | Power draw of the hardware used to train the model, in watts. Calculated as hardware quantity times processor TDP times datacenter PUE times server overhead. More details are provided in Estimating power draw. | 795557 | 22%
771 out of 3519 models |
Frontier model | Boolean | Indicates whether a model was within the frontier, defined as models that were in the top 10 by training compute as of their release date. | 4%
137 out of 3519 models | |
Possibly over 1e23 FLOP | Boolean | Indicates whether a model was (or may have been) trained with at least 1023 floating-point operations, which qualifies it for inclusion in the large-scale models dataset. | True | 15%
521 out of 3519 models |
Model accessibility | Categorical (multiple select) | The accessibility of the model in terms of whether the model weights can be downloaded or, if the model weights are not accessible, whether the model can be used in an API or product. | Open weights (restricted use) | 75%
2650 out of 3519 models |
Training code accessibility | Categorical (single select) | Denotes how the model can be accessed and used by the public. “Open weights (unrestricted)”, “Open weights (restricted use)” and “Open weights (non-commercial)” all mean that the model weights are downloadable by the public, but with different restrictions on use. “API access” means the model can only be interacted with via an application programming interface, and possibly also a hosted service. “Hosted access (no API)” means the model can only be interacted with via a hosted service. “Unreleased” means there is no way for the public to access the model. | Unreleased | 68%
2403 out of 3519 models |
Notes fields, e.g. “Training compute notes” | Text | Metadata documenting the reasoning and/or evidence for a given column, e.g. training compute or dataset size. This is particularly important to note in cases where such information isn’t obvious. This field is unstructured text. | “Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB” of which 1720320 GPU hours were used to train the 70B model. 311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP. Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 670B2T = 8.4e+23 FLOP. | 46%
1635 out of 3519 models |
Have a question? Noticed something wrong? Let us know.
The AI Models dataset is a collection of machine learning models for research about trends in the history and future of artificial intelligence. It includes over two thousand machine learning models, encompassing a broad array of domains and scales.