A Comprehensive Study on Predicting Functional Role of Metagenomes Using Machine Learning Methods

Jyotsna Talreja Wassan, Haiying / HY Wang, Fiona Browne, Huiru Zheng

Research output: Contribution to journalArticle

Abstract

"Metagenomics" is the study of genomic sequences obtained directly from environmental microbial communities with the aim to linking their structures with functional roles. The field has been aided in the unprecedented advancement through high-throughput omics data sequencing. The outcome of sequencing are biologically rich data sets. Metagenomic data consisting of microbial spe-cies which outnumber microbial samples, lead to the "curse of dimensionality". Hence the focus in metagenomics studies has moved towards developing efficient computational models using Machine Learning (ML), reducing the computational cost. In this paper, we comprehensively assessed various ML approaches to classifying high-dimensional human microbiota effectively into their functional phenotypes. We propose the application of embedded feature selection methods, namely, Extreme Gradient Boost-ing and Penalized Logistic Regression to determine important species. The resultant feature set enhanced the performance of one of the most popular state-of-the-art methods, Random Forest (RF) over metagenomic studies. Experimental results indicate that the proposed method achieved best results in terms of accuracy, area under Receiver Operating Characteristic curve (ROC-AUC) and major improvement in processing time. It outperformed other feature selection methods of filters or wrappers over RF and classifiers such as Support Vector Machine (SVM), Extreme Learning Machine (ELM), and k -Nearest Neighbors ( k -NN).

Fingerprint

Learning systems
Feature extraction
Support vector machines
Logistics
Classifiers
Throughput
Processing
Costs

Keywords

  • Metagenomics
  • Microbiota
  • Embedded Feature Selection
  • OperationalTaxonomicUnits(OTUs)
  • Classification

Cite this

@article{798ce09677054d019cf388ec302c8a80,
title = "A Comprehensive Study on Predicting Functional Role of Metagenomes Using Machine Learning Methods",
abstract = "{"}Metagenomics{"} is the study of genomic sequences obtained directly from environmental microbial communities with the aim to linking their structures with functional roles. The field has been aided in the unprecedented advancement through high-throughput omics data sequencing. The outcome of sequencing are biologically rich data sets. Metagenomic data consisting of microbial spe-cies which outnumber microbial samples, lead to the {"}curse of dimensionality{"}. Hence the focus in metagenomics studies has moved towards developing efficient computational models using Machine Learning (ML), reducing the computational cost. In this paper, we comprehensively assessed various ML approaches to classifying high-dimensional human microbiota effectively into their functional phenotypes. We propose the application of embedded feature selection methods, namely, Extreme Gradient Boost-ing and Penalized Logistic Regression to determine important species. The resultant feature set enhanced the performance of one of the most popular state-of-the-art methods, Random Forest (RF) over metagenomic studies. Experimental results indicate that the proposed method achieved best results in terms of accuracy, area under Receiver Operating Characteristic curve (ROC-AUC) and major improvement in processing time. It outperformed other feature selection methods of filters or wrappers over RF and classifiers such as Support Vector Machine (SVM), Extreme Learning Machine (ELM), and k -Nearest Neighbors ( k -NN).",
keywords = "Metagenomics, Microbiota, Embedded Feature Selection, OperationalTaxonomicUnits(OTUs), Classification",
author = "Wassan, {Jyotsna Talreja} and Wang, {Haiying / HY} and Fiona Browne and Huiru Zheng",
year = "2018",
doi = "A Comprehensive Study on Predicting Functional Role of Metagenomes Using Machine Learning Methods",
language = "English",
journal = "IEEE/ACM Transactions on Computational Biology and Bioinformatics",
issn = "1545-5963",

}

TY - JOUR

T1 - A Comprehensive Study on Predicting Functional Role of Metagenomes Using Machine Learning Methods

AU - Wassan, Jyotsna Talreja

AU - Wang, Haiying / HY

AU - Browne, Fiona

AU - Zheng, Huiru

PY - 2018

Y1 - 2018

N2 - "Metagenomics" is the study of genomic sequences obtained directly from environmental microbial communities with the aim to linking their structures with functional roles. The field has been aided in the unprecedented advancement through high-throughput omics data sequencing. The outcome of sequencing are biologically rich data sets. Metagenomic data consisting of microbial spe-cies which outnumber microbial samples, lead to the "curse of dimensionality". Hence the focus in metagenomics studies has moved towards developing efficient computational models using Machine Learning (ML), reducing the computational cost. In this paper, we comprehensively assessed various ML approaches to classifying high-dimensional human microbiota effectively into their functional phenotypes. We propose the application of embedded feature selection methods, namely, Extreme Gradient Boost-ing and Penalized Logistic Regression to determine important species. The resultant feature set enhanced the performance of one of the most popular state-of-the-art methods, Random Forest (RF) over metagenomic studies. Experimental results indicate that the proposed method achieved best results in terms of accuracy, area under Receiver Operating Characteristic curve (ROC-AUC) and major improvement in processing time. It outperformed other feature selection methods of filters or wrappers over RF and classifiers such as Support Vector Machine (SVM), Extreme Learning Machine (ELM), and k -Nearest Neighbors ( k -NN).

AB - "Metagenomics" is the study of genomic sequences obtained directly from environmental microbial communities with the aim to linking their structures with functional roles. The field has been aided in the unprecedented advancement through high-throughput omics data sequencing. The outcome of sequencing are biologically rich data sets. Metagenomic data consisting of microbial spe-cies which outnumber microbial samples, lead to the "curse of dimensionality". Hence the focus in metagenomics studies has moved towards developing efficient computational models using Machine Learning (ML), reducing the computational cost. In this paper, we comprehensively assessed various ML approaches to classifying high-dimensional human microbiota effectively into their functional phenotypes. We propose the application of embedded feature selection methods, namely, Extreme Gradient Boost-ing and Penalized Logistic Regression to determine important species. The resultant feature set enhanced the performance of one of the most popular state-of-the-art methods, Random Forest (RF) over metagenomic studies. Experimental results indicate that the proposed method achieved best results in terms of accuracy, area under Receiver Operating Characteristic curve (ROC-AUC) and major improvement in processing time. It outperformed other feature selection methods of filters or wrappers over RF and classifiers such as Support Vector Machine (SVM), Extreme Learning Machine (ELM), and k -Nearest Neighbors ( k -NN).

KW - Metagenomics

KW - Microbiota

KW - Embedded Feature Selection

KW - OperationalTaxonomicUnits(OTUs)

KW - Classification

U2 - A Comprehensive Study on Predicting Functional Role of Metagenomes Using Machine Learning Methods

DO - A Comprehensive Study on Predicting Functional Role of Metagenomes Using Machine Learning Methods

M3 - Article

JO - IEEE/ACM Transactions on Computational Biology and Bioinformatics

T2 - IEEE/ACM Transactions on Computational Biology and Bioinformatics

JF - IEEE/ACM Transactions on Computational Biology and Bioinformatics

SN - 1545-5963

ER -