A Computational Approach to Uncertainty in DNA Sequences

Research output: Contribution to conferencePaperpeer-review


DNA sequencing is the process of reading individual base pairs from a section of DNA. Genes are the name given to parts of the DNA which encode proteins; for example ion channels are proteins that maintain concentrations of ions within cells. The sequencing of these genes can offer insights into factors such as evolution and disease. During the sequencing process, unknown values 'N' can be substituted in the sequence where the sequencing machine is unable to identify a nucleotide as Adenine (A), Cytosine (C), Thymine (T), or Guanine (G). These gene sequences vary in length; this includes individual genes across the same species. This has led to the use of a process known as k-mer encoding so that a machine learning algorithm can assess these genes without the need for pre-alignment. K-mer encoding works by taking small sections of the sequence and tallying the number of times that such a sequence appears, such as, how many times the k-mer 'ACCT' appears in the overall sequence. The unknown 'N' value presents a problem in k-mer encoding, as this value increases the size of the k-mer feature vector exponentially as the k-mer length increases. In this paper we research the accuracy and computational impact of including, removing, or ignoring this 'N' value for the k-mer lengths 3, 6, and 9 across four Machine Learning algorithms: Random Forest, Multinomial Naive Bayes, Neural Networks, and Linear Support Vector Machine.
Original languageEnglish
Number of pages6
Publication statusAccepted/In press - 15 Sept 2023
Event2023 IEEE Symposium Series on Computational Intelligence: SSCI 2023 - heraton Mexico City Maria Isabel Hotel, Mexico City, Mexico
Duration: 5 Dec 20238 Dec 2023


Conference2023 IEEE Symposium Series on Computational Intelligence
Abbreviated titleSSCI 2023
CityMexico City
Internet address


  • k-mer
  • DNA
  • Machine Learning
  • Random Forest
  • Neural Network
  • Multinomial Naive Bayes
  • Linear Support Vector Machine
  • SVM
  • NN


Dive into the research topics of 'A Computational Approach to Uncertainty in DNA Sequences'. Together they form a unique fingerprint.

Cite this