Abstract
DNA sequencing is the process of reading individual base pairs from a section of DNA. Genes are the name given to parts of the DNA which encode proteins; for example ion channels are proteins that maintain concentrations of ions within cells. The sequencing of these genes can offer insights into factors such as evolution and disease. During the sequencing process, unknown values 'N' can be substituted in the sequence where the sequencing machine is unable to identify a nucleotide as Adenine (A), Cytosine (C), Thymine (T), or Guanine (G). These gene sequences vary in length; this includes individual genes across the same species. This has led to the use of a process known as k-mer encoding so that a machine learning algorithm can assess these genes without the need for pre-alignment. K-mer encoding works by taking small sections of the sequence and tallying the number of times that such a sequence appears, such as, how many times the k-mer 'ACCT' appears in the overall sequence. The unknown 'N' value presents a problem in k-mer encoding, as this value increases the size of the k-mer feature vector exponentially as the k-mer length increases. In this paper we research the accuracy and computational impact of including, removing, or ignoring this 'N' value for the k-mer lengths 3, 6, and 9 across four Machine Learning algorithms: Random Forest, Multinomial Naive Bayes, Neural Networks, and Linear Support Vector Machine.
Original language | English |
---|---|
Title of host publication | 2023 IEEE Symposium Series on Computational Intelligence, SSCI 2023 |
Publisher | IEEE |
Pages | 1043-1048 |
Number of pages | 6 |
ISBN (Electronic) | 978-1-6654-3065-4, 978-1-6654-3064-7 |
DOIs | |
Publication status | Published online - 1 Jan 2024 |
Event | 2023 IEEE Symposium Series on Computational Intelligence: SSCI 2023 - heraton Mexico City Maria Isabel Hotel, Mexico City, Mexico Duration: 5 Dec 2023 → 8 Dec 2023 https://attend.ieee.org/ssci-2023/ |
Publication series
Name | |
---|---|
ISSN (Print) | 2770-0097 |
ISSN (Electronic) | 2472-8322 |
Conference
Conference | 2023 IEEE Symposium Series on Computational Intelligence |
---|---|
Abbreviated title | SSCI 2023 |
Country/Territory | Mexico |
City | Mexico City |
Period | 5/12/23 → 8/12/23 |
Internet address |
Bibliographical note
Publisher Copyright:© 2023 IEEE.
Keywords
- k-mer
- DNA
- Machine Learning
- Random Forest
- Neural Network
- Multinomial Naive Bayes
- Linear Support Vector Machine
- SVM
- NN