Announcing our Expanded Biology AI Coverage
We've expanded our Biology AI Dataset, now covering 360+ models. Our analysis reveals rapid scaling from 2017-2021, followed by a notable slowdown in biological model development.

Published
We’re pleased to announce an expansion of our Biological Model Dataset, a component of Epoch AI’s larger database of machine learning models. As the role of AI in biology continues to grow—powering advances in drug design, protein engineering, and genomics—the opportunities and governance challenges posed by biological AI models increase the importance of tracking advances in this field.
Our goal with this project is to provide a comprehensive resource for researchers and policymakers. To this end, we have curated information from over 360 models in this update, prioritizing recent models at the frontier of capability, scale, or scientific impact. Alongside details on their developers, intended tasks, and training datasets, we’ve included new estimates of the training compute that went into developing them.
Analyzing compute and data trends can help us understand how invested the field is in scaling as a means to increase performance. The plot above tracks the evolution of training compute and dataset sizes in biological models, highlighting a substantial increase from 2017 to 2021, followed by a relative slowdown. This visualization underscores how quickly the field has advanced—and also suggests that the pace may be changing. By providing transparent, easy-to-explore compute estimates, we hope to enable deeper discussion of what’s driving progress and where bottlenecks may arise in the near future.
Finally, because biological models can pose dual-use concerns, we have compiled information about safeguards that developers have adopted to mitigate such risks, such as data filtering, risk evaluations, inference-time refusal, and access controls. In our dataset, fewer than 3% of models have any such safeguards, although the most capable models (large foundation models like EvolutionaryScale’s ESM 3, or powerful specialized models like AlphaFold 3) tend to have more safeguards. We encourage developers to continue sharing best practices for mitigating potential misuse.
To protect sensitive information about model safeguards while enabling responsible research, detailed safeguards data is available upon request. Researchers and developers interested in accessing this information can email safeguards@epoch.ai.
You can find additional information about the dataset in our database documentation.
Sentinel Bio provided a grant to fund this data collection project and make it publicly available. Epoch AI owns the resulting dataset. We thank them for their generous support.