Algorithmic Bioinformatics

Holger Schmeisky

Approximate String Matching (Hamming Distance)

Academic Advisor: Manuel Holtgrewe , Knut Reinert
Discipline: Bioinformatics
Degree: Bachelor of Science (B.Sc.)
Status: finished


C++ knowledge


Approximate String Matching is an important problem in Bioinformatics applications, e.g. [Weese et al., 2009]. There are various ways of quantifying "approximate". For example Hamming or Levensthein distance are considered. In this thesis, approximate string matching with Hamming distance, sometimes also called k Mismatches Problem, is considered: Find a "short" string needle in a longer string haystack such that Hamming distance between the needle and the aligned part of the haystack is less than a given value k.

There a large body of literature on approximate string matching, see e.g. [Navarro, 2001]. However, it is still an active area of research, e.g. [Salmela et al., 2010].

The first aim of this thesis is to implement existing algorithms for the k Mismatches Problem that have shown to be competetive in recent studies. The second aim of this thesis is to perform an experimental study of the algorithms on real world inputs. The third aim of this thesis is a clear presentation and evaluation of the study's results.