AI Models Documentation

Records

The database focuses on information relevant to trends in AI model development. Records in the database have information about three broad areas:

Bibliographic information about the model and its associated publication, for example its title, URL, authors, citations, date, etc.

Training details such as training compute, parameters, dataset size, hardware used for training, etc.

Metadata about the record, such as notes on the above fields with supporting evidence and context, our confidence in the key models, etc.

Fields

We provide a comprehensive guide to the database’s fields below. This includes examples taken from Llama-2 70B, one of the best-documented recent models. If you would like to ask any questions about the database, or request a field that should be added, feel free to contact us at data@epoch.ai.

Column	Type	Definition	Example	Coverage
Abstract	Text	Abstract text from the publication associated with the model.	In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closedsource models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.	85% 2984 out of 3526 models
Authors	Text	Comma-separated list of authors. Authors are named in the way that they report their names in their publications, if applicable. For example, Lê Viết Quốc is credited as “Quoc V. Le” in his publications.	Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom	76% 2681 out of 3526 models
Base model	Categorical (single select)	Which base model the model was fine-tuned from, if applicable.		19% 684 out of 3526 models
Batch size	Numeric	Batch size used during training.	4000000	7% 251 out of 3526 models
Citations	Numeric	Number of citations as of last update. Values are collected from Semantic Scholar where available, otherwise manually from Google Scholar.	16911	42% 1486 out of 3526 models
Confidence	Categorical (single select)	Metadata describing our confidence in the recorded values for Training compute, Parameters, and Training dataset size. This describes confidence for the most uncertain of these parameters, where they have a non-empty entry (compute is typically the most uncertain). The confidence statuses are defined as followed: Confident: This is based on primary or reliable sources who are providing specific technical details that allow us to make this estimate using a methodology we are confident is correct. If a lab discloses in a technical paper that a model was trained on 30,000 H100s for 45 days, we would make a confident estimate. Deepseek v3 is in this category. Likely: Similar to confident in terms of methodology, but the inputs to the methodology are not as reliable. For instance, we might base an estimate of how many GPUs were used for a training run on a detailed leak, a less reliable source (ie, what an engineer discusses on a podcast), the size of a datacenter we are tracking, or a vague but trustworthy source. Grok 3 is in this category, Speculative: Our best guess, based on methodologies or sources we think are informative but don’t have confidence in. For instance, we might base an estimate on the compute used to train other models with similar capabilities. Speculative estimates may change substantially as more information comes to light. An example of a speculative estimate would be GPT-4o, where we used benchmark scores to estimate how much compute was required for training. We also provide further statuses: Unknown - we have too little information to even make a speculative estimate. Wrong - we know this estimate is incorrect, and it has been queued for correction Unverified - this estimate has not yet been assessed for confidence.	Confident	94% 3329 out of 3526 models
Country (of organization)	Categorical (multiple select)	Country/countries associated with the developing organization(s). Multinational is used to mark organizations associated with multiple countries.	United States of America	97% 3431 out of 3526 models
Domain	Categorical (multiple select)	The machine learning domain(s) of application associated with the model. This is fairly high-level, for example “Language” incorporates many different ML tasks. Possible values: {{site.data.generated.pcd[key]all-domains \| url_decode}}	Language	97% 3434 out of 3526 models
Task	Categorical (multiple select)	The finegrained task(s) that the model is designed to perform. These are specific applications of the model to different problems, and can span multiple domains. Task labels are assigned by following a flowchart. Each applicable branch of the flowchart is followed until a leaf node is reached. If the task is already in the database, the model is tagged with that task. If the task does not yet exist in the database, the model is tagged with that task and the task is added to the flowchart. Examples: Face recognition Visual question answering Tic Tac Toe Weather forecasting	Language modeling, Language modeling/generation, Question answering	97% 3404 out of 3526 models
Epochs	Numeric	How many epochs (repetitions of the training dataset) was used to train the model.	1	23% 803 out of 3526 models
Finetune compute (FLOP)	Numeric	Compute used to fine-tune the model, if applicable.		7% 260 out of 3526 models
Hardware quantity	Numeric	Indicates the quantity of the hardware used in training, i.e. the number of chips.	1000	24% 843 out of 3526 models
Hardware utilization (MFU)	Numeric	Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on computations successfully applied to model training, and does not include computations performed by the hardware which do not ultimately affect the model.	0.4191975017	2% 61 out of 3526 models
Hardware utilization (HFU)	Numeric	Number between 0.00 and 1.00 indicating the hardware utilization ratio, i.e. utilized FLOPs / theoretical maximum FLOPs. This value reflects utilization based on measured computational throughput in the hardware during training. (Model FLOPs utilization is a better measure of utilization if it is available.)		1% 25 out of 3526 models
Link	URL	Link(s) to best-choice sources documenting a model. This should preferentially be a journal or conference paper, preprint, or technical report. If these are not available, the links should point to other supporting evidence, such as an announcement post, a news article, or similar.	https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ https://arxiv.org/abs/2307.09288	99% 3482 out of 3526 models
Model	Text	The name of the model. This should be unique within the database, and should be the best-known name for a given model. This column must be filled in, and is used as the primary key for indexing entries in the dataset.	Llama 2-70B	100% 3526 out of 3526 models
Notability criteria	Categorical (multiple select)	The criteria met by the model which qualify it for notability. To be notable, a model must meet at least one criterion. Possible values are highly cited, large training cost, significant use, state of the art, or historical significance. These are discussed further in Inclusion.	Historical significance, Significant use, Highly cited, Training cost	30% 1044 out of 3526 models
Organization	Categorical (multiple select)	Organization(s) who created the model. Organizations may have multiple different names, but we aim to standardize organization names where they refer to the same organization. Therefore, organizations are periodically reviewed in Airtable and standardized to the most common name for them. For example, “University of California, Berkeley” and “Berkeley” have been changed to “UC Berkeley”. Note that some organizations have similar names but genuinely are different organizations, for example Google Brain versus Google versus Google DeepMind.	Meta AI	98% 3438 out of 3526 models
Organization categorization	Categorical (multiple select)	Categorization of the organization(s), automatically populated from the Organization entry. Models are categorized as “Industry” if their authors are affiliated with private sector organizations, “Academia” if the authors are affiliated with universities or academic institutions, or “Industry - Academia Collaboration” when at least 30% of the authors are from each. Possible values: Industry, Research Collective, Academia, Industry - Academia Collaboration (Industry leaning), Industry - Academia Collaboration (Academia leaning), Non-profit	Industry	97% 3419 out of 3526 models
Parameters	Numeric	Number of learnable parameters in the model. For neural networks, these are the weights and biases. Further information is provided in Estimation.	7.0e+10	65% 2299 out of 3526 models
Publication date	Date	The publication, announcement, or release date of the model, in YYYY-MM-DD format. If the year and month are known but the day is unknown, the day is filled in as YYYY-MM-15. If the year is known but the month and day are unknown, the month and day are filled in as YYYY-07-01.	2023-07-18	99% 3503 out of 3526 models
Reference	Text	The literature reference for the model, such as the title of the journal or conference paper, academic preprint, or technical report.	Llama 2: Open Foundation and Fine-Tuned Chat Models	95% 3352 out of 3526 models
Training compute (FLOP)	Numeric	Quantity of compute used to train the model, in FLOP. This is the total training compute for a given model, i.e. pretrain + finetune. It should be filled in here when directly reported, or calculated via GPU-hours or backpropagation gradient updates. Further guidance is provided in Estimation.	8.1e+0	39% 1392 out of 3526 models
Training compute cost (2023 USD)	Numeric	The training compute cost, estimated using the “amortized hardware capex plus energy” approach documented in our training cost methodology. Values are converted to 2023 US dollars.	1,102,561.19	6% 224 out of 3526 models
Training compute estimation method	Categorical (multiple select)	Indicates how the quantity of training compute was found or estimated. Options include: “Reported”, when the developers report how much training compute was used. “Operation counting”, when the parameters, training data, and/or architecture are known, and these are used to estimate the number of operations performed while training. “Hardware”, when the training time and/or hardware type are known, allowing an estimate based on the rate of computation. “Cost”, when the training cost is known and the compute is estimated through the hardware usage budget “Benchmarks”, when the model’s training compute was estimated using its benchmark performance.	Hardware, Operation counting	41% 1440 out of 3526 models
Training hardware	Categorical (multiple select)	Type of training hardware used. Entries are cross-referenced against Epoch AI’s database of ML training hardware.	NVIDIA A100 SXM4 80 GB	33% 1181 out of 3526 models
Training time (hours)	Numeric	Training time of the model, if reported. This refers to the time elapsed during the training process, not the number of chip-hours. For example, if a model were trained with 10 GPUs for 1 hour, the training time would be 1 hour. Includes the duration of all training phases conducted to develop the model, such as pre-training, post-training, RL, SFT, etc. If the model is fine-tuned from a previously published model, then that base model’s training time is not included.	1728	16% 550 out of 3526 models
Training power draw (W)	Numeric	Power draw of the hardware used to train the model, in watts. Calculated as hardware quantity times processor TDP times datacenter PUE times server overhead. More details are provided in Estimating power draw.	795557	22% 771 out of 3526 models
Frontier model	Boolean	Indicates whether a model was within the frontier, defined as models that were in the top 10 by training compute as of their release date.		4% 137 out of 3526 models
Possibly over 1e23 FLOP	Boolean	Indicates whether a model was (or may have been) trained with at least 10²³ floating-point operations, which qualifies it for inclusion in the large-scale models dataset.	True	15% 521 out of 3526 models
Model accessibility	Categorical (multiple select)	The accessibility of the model in terms of whether the model weights can be downloaded or, if the model weights are not accessible, whether the model can be used in an API or product.	Open weights (restricted use)	75% 2655 out of 3526 models
Training code accessibility	Categorical (single select)	Denotes how the model can be accessed and used by the public. “Open weights (unrestricted)”, “Open weights (restricted use)” and “Open weights (non-commercial)” all mean that the model weights are downloadable by the public, but with different restrictions on use. “API access” means the model can only be interacted with via an application programming interface, and possibly also a hosted service. “Hosted access (no API)” means the model can only be interacted with via a hosted service. “Unreleased” means there is no way for the public to access the model.	Unreleased	68% 2403 out of 3526 models
Notes fields, e.g. “Training compute notes”	Text	Metadata documenting the reasoning and/or evidence for a given column, e.g. training compute or dataset size. This is particularly important to note in cases where such information isn’t obvious. This field is unstructured text.	“Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB” of which 1720320 GPU hours were used to train the 70B model. 311.84 BF16 TFLOP/s * 1720320 hours * 0.40 utilization = 7.725e+23 FLOP. Alternatively: the model was trained for 1 epoch on 2 trillion tokens and has 70B parameters. C = 6ND = 670B2T = 8.4e+23 FLOP.	46% 1639 out of 3526 models

Inclusion

Database updates

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Papers & Reports

Data Insights

Newsletter

Podcast

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

Records

Fields

AI Models Documentation – Records

AI Progress

Industry

Infrastructure

Impacts

Featured

Publications

Data explorers

Benchmarks by Epoch AI

Scaling

Software progress

Open models

Capabilities

Math

Leading companies

Finances

Geopolitics

Chips

Data centers

Energy

Adoption and use

Economic impact

Future of AI

Publications

Papers & Reports

Data Insights

Newsletter

Podcast

Data explorers

Capabilities

Models

Frontier Data Centers

Chip Owners

Companies

Polling on AI Use

Benchmarks by Epoch AI

Epoch Capabilities Index

FrontierMath: Open Problems

FrontierMath: Tiers 1-4

About Epoch AI

Donate

Team

Careers

Consultations

For press

Transparency

AI Models Documentation

Records

Fields