High-throughput sequencing techniques have accelerated functional metagenomics studies through the generation of large volumes of ‘omics’ data. The integration of these data using computational approaches is potentially useful for predicting metagenomic functions. Machine learning models can be trained using microbial features (e.g. taxonomical units in human microbiome) which are then used to classify microbial data into different functional classes (e.g. healthy versus diseased states). For analyzing the omics data, features (i.e. the microbial taxas) as well as taxonomical relations between the features are important. The relationships are potentially uncoverable from the phylogenetic tree of microbial taxas. In this paper, we propose a novel integrative framework, namely Phy-PMRFI, driven by phylogeny-based modelling of omics data to predict metagenomic functions by using important features selected by a Random Forest Importance (RFI) strategy. The proposed framework integrates the underlying phylogenetic tree information with abundance measures of microbial species (features) by creating a novel phylogeny and abundance aware matrix structure (PAAM). Phy-PMRFI progresses by ranking the columns of the obtained matrix (i.e. the microbial features) by using the RFI measure, which are further used as input for the microbiome classification. The resultant feature set enhances the performance of the most popular state-of-art methods such as Support Vector Machines. Our proposed integrative framework also outperforms the state-of-the-art pipeline of Phylogenetic Isometric Log-Ratio Transform (PhILR) and MetaPhyl (e.g. obtaining 90 % accurate predictions with Phy-PMRFI over human throat microbiome in comparison to other approaches of PhILR with 53% and MetaPhyl with 71% Accuracy).
|Number of pages||9|
|Journal||IEEE Transactions on Nanobioscience|
|Publication status||Accepted/In press - 1 Apr 2019|
- Machine Learning (ML)
- operational Taxonomic Units (OTUs)
- Random Forest Importance (RFI)