Show sidebar Inclusion

Inclusion

The dataset focuses on biological ML models: those that are trained on biological data, including biological sequences, molecular structures or data about molecular properties, among others. Here, we detail criteria for inclusion, and give an overview of how the data have been collected.

Criteria

To be included in the dataset, an ML model must satisfy all inclusion criteria:

  • there must be reliable documentation of its existence and relevance to machine learning;
  • the model must include a learning component and cannot be a non-learned algorithm;
  • the model must have been trained, it cannot be a theoretical description without experimental results;
  • the model must be directly and explicitly trained on biological data, including:
    • biological sequence data;
    • biomolecule structure data;
    • fitness, pathogenicity or other biological properties of proteins or other biomolecules;
    • cell-level data (cell type, expression levels, spatial or imaging data…).

Search process

This data has been collected mainly from a literature review, although some models have been collected from other sources like high-profile models from leading industry labs, bibliographies of notable papers, and ad hoc suggestions from contributors.

Coverage

As of Feb. 17, 2026, the dataset contains 383 models, of which have compute estimates. The dataset does not provide exhaustive coverage of biological models. We attempt to cover the most historically relevant models, as well as significant models released in 2023 and 2024, but we expect some important models to be missing from our dataset.