The dataset focuses on biological ML models: those that are trained on biological data, including biological sequences, molecular structures or data about molecular properties, among others. Here, we detail criteria for inclusion, and give an overview of how the data have been collected.
To be included in the dataset, an ML model must satisfy all inclusion criteria:
This data has been collected mainly from a literature review, although some models have been collected from other sources like high-profile models from leading industry labs, bibliographies of notable papers, and ad hoc suggestions from contributors.
As of Feb. 17, 2026, the dataset contains 383 models, of which have compute estimates. The dataset does not provide exhaustive coverage of biological models. We attempt to cover the most historically relevant models, as well as significant models released in 2023 and 2024, but we expect some important models to be missing from our dataset.