Request for Information

Jan 23, 2025
Program Updates

Data Requirements for Microbial Machine Learning Models

What microbes can do

Microbes have fundamentally shaped our bodies, our environment, and our understanding of biology. As a window into living systems, they provide a huge amount of taxonomic diversity while being just simple and tractable enough for generating data and insights at high throughput. Microbes underpin our basic tools for understanding molecular biology and drive applications in domains from agriculture to drug discovery, biomanufacturing, biosecurity, and human health. Despite this impact, we have only experimentally studied less than 0.001% of extant microbial species to date.

The paucity of publicly available genotypic and phenotypic data for diverse microbial species—which make up 95% of all biological diversity on Earth—has likewise been a major bottleneck for advancement of machine learning models useful for predicting biological functions and advancing biotechnological applications. To fully realize the potential impact of microbes and machine learning in biology, we need better ways to extract information across a broader swath of microbial species and better public datasets for training predictive models.

The need for data

Predictive models of microbial function are increasingly critical for new research and development in life science and biotechnology. Beyond protein structure data, high quality training data on biological function remains a critical bottleneck limiting progress in applications of AI in biology.

We want to fill this gap. Astera’s Data group is launching a new project to generate and publish rich microbial datasets, both across application domains and across the tree of life. We aim to create public datasets from next-generation sequencing, multi-omics, and high-dimensional phenotyping to enable modellers and researchers seeking to understand microbes through machine learning.

The project will comprise four iterative phases, some of which will overlap with each other:

Calling all data users

In order to create data that is of use right away, we are calling for information to support the creation of key requirements for this Astera-funded data generation effort. What species of bacteria, archaea, or protists would be most valuable and informative for you, the users of these datasets? What types, specifications, and requirements do your models have for training data? What metadata is required or beneficial? Would you be willing to join a user group in advance of data generation?

Your submissions will help guide our dedicated investment, select initial targets, and guide long-term prioritization. The first species and data types we generate will be chosen based on the requirements we gather from submissions to this RFI.

Supporting iterative public data generation at scale

All data will be placed in the public domain upon validation, and deposited in FAIR databases of record when available. Astera will provide data engineering and informatics resources to support data linkage and computational analysis. Microbes selected for analysis and data generation will also be available for order from a repository.

We expect to iterate and adapt on both microbe selection and data types as data generation gets underway, teaching us what’s informative for both basic science and applied science. Protocols, workflows, and data deposit methods will be version controlled and publicly available so that other laboratories can re-run data generation locally, generate foundational data on unprofiled microbes, and extend data available on profiled microbes.

For each microbe, we aim to generate data that supports machine learning, and request information on data types most informative to that goal (both at the per-microbe level and across a dataset of thousands of microbes). Specific examples could include types of sequencing platforms or approaches, relative preference of transcriptomics/proteomics/metabolomics (and assay preferences within those fields), types of mass spectroscopy, emergent or novel high-throughput phenotyping approaches, and which phenotypic information is most broadly useful (ranging from molecular scale properties to complex behavior). Specific examples of how these data might support machine learning, or how their absence is restraining current models, are welcome.

Impact across applications

Microbes are critical in biotechnology and impact human, plant, animal, and environmental health in myriad ways. We are eager to support model builders with microbial genotype and phenotype data that will have direct impact in translation. The clusters of data we generate around microbes in a specific scientific domain would then map into our ongoing phylogenomic selection of microbes. We are interested in any domain that will have direct scientific or engineering benefit and in particular, we envision drawing on the perspectives from experts working in:

Given that it may not be possible to know specific lists of microbes without a rational approach to sourcing, you can also feel free to respond in free text — which microbes are you particularly interested in these or other domains, and why?

Response format

We are interested in short responses of 2-3 pages – this should not be a lengthy exercise that consumes enormous amounts of your time. But we do have some specific areas we hope you explore in your answer.

For your domain, please consider providing:

  1. Priority list of microbial species with justification
  2. Ranked importance of specific data types (feel free to specify down to machine, e.g. Novaseq or Ion Torrent, specific types of mass spec, file formats)
  3. Brief description of your modeling experience with these organisms and data types
  4. Sense of your background in the domain, experience with microbial -omics machine learning, experience with multi-omics data integration
  5. Indication of your willingness to join a small working group of data users as we develop our laboratory protocols, data engineering approach, and computational workflows

OUR ADVISORS

This project is supported by scientific advisors including Prachee Avasthi, Seemay Chou, and Jonathan Eisen.

Submission timeline and process

We’ll begin reviewing submissions on February 24, prioritizing submissions made by that deadline. Please submit your responses through this form. We’ll get back to you before the start of the next phase, in mid-March. We look forward to working with you!