Abstract
Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reproducibility. This paper proposes a novel ensemble feature selection method, named Filter and Wrapper Stacking Ensemble (FWSE), to identify reproducible biomarkers from high-dimensional omics data. In FWSE, filter feature selection methods are run on numerous subsets of the data to eliminate irrelevant features, and then wrapper feature selection methods are applied to rank the top features. The method was validated on four high-dimensional medical datasets related to mental illnesses and cancer. The results indicate that the features selected by FWSE are stable and statistically more significant than the ones obtained by existing methods while also demonstrating biological relevance. Furthermore, FWSE is a generic method, applicable to various high-dimensional datasets in the fields of machine intelligence and bioinformatics.
Original language | English |
---|---|
Article number | bbad382 |
Pages (from-to) | 1-17 |
Number of pages | 17 |
Journal | Briefings in Bioinformatics |
Volume | 24 |
Issue number | 6 |
Early online date | 26 Oct 2023 |
DOIs | |
Publication status | Published (in print/issue) - Nov 2023 |
Bibliographical note
Publisher Copyright:© The Author(s) 2023. Published by Oxford University Press.
Data Access Statement
This study employed four distinct datasets. The LYRIKS dataset, owned by the Institute for Mental Health, Singapore, is not publicly available due to privacy considerations and the absence of participant consent for public data sharing. Researchers interested in accessing this dataset for scientific purposes may reach out directly to the Institute for Mental Health, Singapore, to explore potential data access arrangements. The Bipolar dataset, on the other hand, is publicly accessible and can be downloaded from the following link: [Link]. The Lung Adenocarcinoma (LUAD) dataset, part of The Cancer Genome Atlas (TCGA) PanCancer Atlas study, can be downloaded from the following link: [Link]. The Pancreatic Ductal Adenocarcinoma (PDAC) dataset is publicly available at the following link: [Link] We encourage researchers to utilize these resources in accordance with the respective data use agreements and ethical guidelines.Keywords
- feature selection
- biomarker discovery
- ensemble learning
- high-dimensional data
- genomics
- proteomics