A perspective on the use of machine learning on the human microbiome for phenotype prediction and possible adaptations for plant microbiomes
Summary
There has been a fundamental shift from viewing organisms as isolated entities to considering them as a holobiont. This change highlights the critical role of microbiomes in influencing host phenotypes in plants and humans. The microbiome of a host organism has a direct relationship to its phenotype, and it has an impact on various physiological traits playing roles in immune response, nutrient acquisition, and protection against pathogens. To explore the potential of deep learning (DL) models for the classification of host phenotypes in plants, we reviewed studies that use classic machine learning (ML) and DL for the prediction of microbiome related host disease phenotypes (Inflammatory Bowel Disease, colorectal cancer, type II diabetes, obesity, arthritis, liver cirrhosis) in humans using microbiome data. As these methods have been more extensively applied for such use cases.
In classic ML applications, the data is processed in bioinformatic pipelines to produce abundance tables and find gene pathways. Features are often selected through an expert-driven process, and Support vector machines and Random Forests are the most successful classifiers. Studies either transform standard metagenomic outputs such as abundance tables into different forms such as phylogenetic trees or “synthetic images” or perform dimensionality reduction through DL. Features from these methods are then fed to both DL and classic ML classifiers. DL classifiers such as Convolutional Neural Networks (CNNs), and Multilayer perceptrons (MLPs) occasionally outperform traditional ML techniques by small margins. However, current applications fail to leverage significant advantages such as end-to-end prediction and automatic feature selection. Moreover, there is a severe lack of explainability when using DL classifiers, while classic approaches such as Random Forest are very explainable. Which is critical to understand the underlying processes that cause disease phenotypes in the host.
As human and plant microbiome sequencing data shares the same structure, there is no technical blockade for applying the methods reviewed in this study to plant microbiomes. However, plant microbiomes are much more diverse, exhibiting compositional differences among different types of plants. This necessitates larger datasets to appropriately generalize DL models. As is, DL classifiers already face limitations due to feature-sample size imbalance in metagenomics, on top of limited dataset availability. This issue will only be exacerbated in plant applications. Although DL holds promise for future large-scale microbiome analysis, its current performance and the need for explainable models and extensive datasets remain significant hurdles. Future DL classification in this field should focus on interpretability, end-to-end prediction, and multi-omics integration.