Phy-PMRFI : Phylogeny-aware Prediction of Metagenomic Functions using Random Forest Feature Importance

Jyotsna Talreja Wassan, Haiying / HY Wang, Browne Fiona, Huiru Zheng

Research output: Contribution to journalArticle

Abstract

High-throughput sequencing techniques have accelerated functional metagenomics studies through the generation of large volumes of ‘omics’ data. The integration of these data using computational approaches is potentially useful for predicting metagenomic functions. Machine learning models can be trained using microbial features (e.g. taxonomical units in human microbiome) which are then used to classify microbial data into different functional classes (e.g. healthy versus diseased states). For analyzing the omics data, features (i.e. the microbial taxas) as well as taxonomical relations between the features are important. The relationships are potentially uncoverable from the phylogenetic tree of microbial taxas. In this paper, we propose a novel integrative framework, namely Phy-PMRFI, driven by phylogeny-based modelling of omics data to predict metagenomic functions by using important features selected by a Random Forest Importance (RFI) strategy. The proposed framework integrates the underlying phylogenetic tree information with abundance measures of microbial species (features) by creating a novel phylogeny and abundance aware matrix structure (PAAM). Phy-PMRFI progresses by ranking the columns of the obtained matrix (i.e. the microbial features) by using the RFI measure, which are further used as input for the microbiome classification. The resultant feature set enhances the performance of the most popular state-of-art methods such as Support Vector Machines. Our proposed integrative framework also outperforms the state-of-the-art pipeline of Phylogenetic Isometric Log-Ratio Transform (PhILR) and MetaPhyl (e.g. obtaining 90 % accurate predictions with Phy-PMRFI over human throat microbiome in comparison to other approaches of PhILR with 53% and MetaPhyl with 71% Accuracy).
LanguageEnglish
Number of pages9
JournalIEEE Transactions on Nanobioscience
Publication statusAccepted/In press - 1 Apr 2019

Fingerprint

Mathematical transformations
Support vector machines
Learning systems
Pipelines
Throughput
Phylogeny

Keywords

  • Metagenomics
  • Phylogeny
  • Classification
  • Machine Learning (ML)
  • operational Taxonomic Units (OTUs)
  • Random Forest Importance (RFI)

Cite this

@article{525a11ebdcff4239bf43226e15abeaf7,
title = "Phy-PMRFI : Phylogeny-aware Prediction of Metagenomic Functions using Random Forest Feature Importance",
abstract = "High-throughput sequencing techniques have accelerated functional metagenomics studies through the generation of large volumes of ‘omics’ data. The integration of these data using computational approaches is potentially useful for predicting metagenomic functions. Machine learning models can be trained using microbial features (e.g. taxonomical units in human microbiome) which are then used to classify microbial data into different functional classes (e.g. healthy versus diseased states). For analyzing the omics data, features (i.e. the microbial taxas) as well as taxonomical relations between the features are important. The relationships are potentially uncoverable from the phylogenetic tree of microbial taxas. In this paper, we propose a novel integrative framework, namely Phy-PMRFI, driven by phylogeny-based modelling of omics data to predict metagenomic functions by using important features selected by a Random Forest Importance (RFI) strategy. The proposed framework integrates the underlying phylogenetic tree information with abundance measures of microbial species (features) by creating a novel phylogeny and abundance aware matrix structure (PAAM). Phy-PMRFI progresses by ranking the columns of the obtained matrix (i.e. the microbial features) by using the RFI measure, which are further used as input for the microbiome classification. The resultant feature set enhances the performance of the most popular state-of-art methods such as Support Vector Machines. Our proposed integrative framework also outperforms the state-of-the-art pipeline of Phylogenetic Isometric Log-Ratio Transform (PhILR) and MetaPhyl (e.g. obtaining 90 {\%} accurate predictions with Phy-PMRFI over human throat microbiome in comparison to other approaches of PhILR with 53{\%} and MetaPhyl with 71{\%} Accuracy).",
keywords = "Metagenomics, Phylogeny, Classification, Machine Learning (ML), operational Taxonomic Units (OTUs), Random Forest Importance (RFI)",
author = "Wassan, {Jyotsna Talreja} and Wang, {Haiying / HY} and Browne Fiona and Huiru Zheng",
note = "contact author: h.zheng@ulster.ac.uk",
year = "2019",
month = "4",
day = "1",
language = "English",

}

TY - JOUR

T1 - Phy-PMRFI : Phylogeny-aware Prediction of Metagenomic Functions using Random Forest Feature Importance

AU - Wassan, Jyotsna Talreja

AU - Wang, Haiying / HY

AU - Fiona, Browne

AU - Zheng, Huiru

N1 - contact author: h.zheng@ulster.ac.uk

PY - 2019/4/1

Y1 - 2019/4/1

N2 - High-throughput sequencing techniques have accelerated functional metagenomics studies through the generation of large volumes of ‘omics’ data. The integration of these data using computational approaches is potentially useful for predicting metagenomic functions. Machine learning models can be trained using microbial features (e.g. taxonomical units in human microbiome) which are then used to classify microbial data into different functional classes (e.g. healthy versus diseased states). For analyzing the omics data, features (i.e. the microbial taxas) as well as taxonomical relations between the features are important. The relationships are potentially uncoverable from the phylogenetic tree of microbial taxas. In this paper, we propose a novel integrative framework, namely Phy-PMRFI, driven by phylogeny-based modelling of omics data to predict metagenomic functions by using important features selected by a Random Forest Importance (RFI) strategy. The proposed framework integrates the underlying phylogenetic tree information with abundance measures of microbial species (features) by creating a novel phylogeny and abundance aware matrix structure (PAAM). Phy-PMRFI progresses by ranking the columns of the obtained matrix (i.e. the microbial features) by using the RFI measure, which are further used as input for the microbiome classification. The resultant feature set enhances the performance of the most popular state-of-art methods such as Support Vector Machines. Our proposed integrative framework also outperforms the state-of-the-art pipeline of Phylogenetic Isometric Log-Ratio Transform (PhILR) and MetaPhyl (e.g. obtaining 90 % accurate predictions with Phy-PMRFI over human throat microbiome in comparison to other approaches of PhILR with 53% and MetaPhyl with 71% Accuracy).

AB - High-throughput sequencing techniques have accelerated functional metagenomics studies through the generation of large volumes of ‘omics’ data. The integration of these data using computational approaches is potentially useful for predicting metagenomic functions. Machine learning models can be trained using microbial features (e.g. taxonomical units in human microbiome) which are then used to classify microbial data into different functional classes (e.g. healthy versus diseased states). For analyzing the omics data, features (i.e. the microbial taxas) as well as taxonomical relations between the features are important. The relationships are potentially uncoverable from the phylogenetic tree of microbial taxas. In this paper, we propose a novel integrative framework, namely Phy-PMRFI, driven by phylogeny-based modelling of omics data to predict metagenomic functions by using important features selected by a Random Forest Importance (RFI) strategy. The proposed framework integrates the underlying phylogenetic tree information with abundance measures of microbial species (features) by creating a novel phylogeny and abundance aware matrix structure (PAAM). Phy-PMRFI progresses by ranking the columns of the obtained matrix (i.e. the microbial features) by using the RFI measure, which are further used as input for the microbiome classification. The resultant feature set enhances the performance of the most popular state-of-art methods such as Support Vector Machines. Our proposed integrative framework also outperforms the state-of-the-art pipeline of Phylogenetic Isometric Log-Ratio Transform (PhILR) and MetaPhyl (e.g. obtaining 90 % accurate predictions with Phy-PMRFI over human throat microbiome in comparison to other approaches of PhILR with 53% and MetaPhyl with 71% Accuracy).

KW - Metagenomics

KW - Phylogeny

KW - Classification

KW - Machine Learning (ML)

KW - operational Taxonomic Units (OTUs)

KW - Random Forest Importance (RFI)

M3 - Article

ER -