A Computational Approach to Uncertainty in DNA Sequences

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

35 Downloads (Pure)

Abstract

DNA sequencing is the process of reading individual base pairs from a section of DNA. Genes are the name given to parts of the DNA which encode proteins; for example ion channels are proteins that maintain concentrations of ions within cells. The sequencing of these genes can offer insights into factors such as evolution and disease. During the sequencing process, unknown values 'N' can be substituted in the sequence where the sequencing machine is unable to identify a nucleotide as Adenine (A), Cytosine (C), Thymine (T), or Guanine (G). These gene sequences vary in length; this includes individual genes across the same species. This has led to the use of a process known as k-mer encoding so that a machine learning algorithm can assess these genes without the need for pre-alignment. K-mer encoding works by taking small sections of the sequence and tallying the number of times that such a sequence appears, such as, how many times the k-mer 'ACCT' appears in the overall sequence. The unknown 'N' value presents a problem in k-mer encoding, as this value increases the size of the k-mer feature vector exponentially as the k-mer length increases. In this paper we research the accuracy and computational impact of including, removing, or ignoring this 'N' value for the k-mer lengths 3, 6, and 9 across four Machine Learning algorithms: Random Forest, Multinomial Naive Bayes, Neural Networks, and Linear Support Vector Machine.
Original languageEnglish
Title of host publication2023 IEEE Symposium Series on Computational Intelligence, SSCI 2023
PublisherIEEE
Pages1043-1048
Number of pages6
ISBN (Electronic)978-1-6654-3065-4, 978-1-6654-3064-7
DOIs
Publication statusPublished online - 1 Jan 2024
Event2023 IEEE Symposium Series on Computational Intelligence: SSCI 2023 - heraton Mexico City Maria Isabel Hotel, Mexico City, Mexico
Duration: 5 Dec 20238 Dec 2023
https://attend.ieee.org/ssci-2023/

Publication series

Name
ISSN (Print)2770-0097
ISSN (Electronic)2472-8322

Conference

Conference2023 IEEE Symposium Series on Computational Intelligence
Abbreviated titleSSCI 2023
Country/TerritoryMexico
CityMexico City
Period5/12/238/12/23
Internet address

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • k-mer
  • DNA
  • Machine Learning
  • Random Forest
  • Neural Network
  • Multinomial Naive Bayes
  • Linear Support Vector Machine
  • SVM
  • NN

Fingerprint

Dive into the research topics of 'A Computational Approach to Uncertainty in DNA Sequences'. Together they form a unique fingerprint.

Cite this