You are here: ABI » ThesesHome » ThesisApproxHammingMatch

Approximate String Matching (Hamming Distance)

This Bachelor thesis gets you in touch with current research for string matching, modern implementation in C++/SeqAn, good Algorithm Engineering practice and data presentation. One aim is to include your code in a future release of the SeqAn library for sequence analysis.

Note that the project description is subject to discussion, depending on your interest and focus.

Rationale/Aims

Approximate String Matching is an important problem in Bioinformatics applications, e.g. [Weese et al., 2009]. There are various ways of quantifying "approximate". For example Hamming or Levensthein distance are considered. In this thesis, approximate string matching with Hamming distance, sometimes also called k Mismatches Problem, is considered: Find a "short" string needle in a longer string haystack such that Hamming distance between the needle and the aligned part of the haystack is less than a given value k.

There a large body of literature on approximate string matching, see e.g. [Navarro, 2001]. However, it is still an active area of research, e.g. [Salmela et al., 2010].

The first aim of this thesis is to implement existing algorithms for the k Mismatches Problem that have shown to be competetive in recent studies. The second aim of this thesis is to perform an experimental study of the algorithms on real world inputs. The third aim of this thesis is a clear presentation and evaluation of the study's results.

Proposed Schedule

The thesis is planed for 8 weeks. After 2 weeks, you can decide whether you want to complete the thesis or look for another topic.

  • Literature research, getting started with SeqAn. (2 weeks)
  • Select promising candidates for implementation, prioritize the list.
  • Implement and test the selected candidates. (4 weeks)
  • Design the experimental study, collect data, perform study. (1 week)
  • Thesis writeup. (1 week)

There will be regular meetings with your supervisor.

Proposed Requirements

The theoretical work in the thesis should:

  • give a short overview of the area of string matching
  • give an overview of the general ideas in approximate string matching
  • describe the approaches for the k Mismatch Problem.

The implementation must fulfill the following:

  • At least 4 (to be discussed) fast approximate string matching algorithms (hamming distance) must be implemented.
  • The implementation should be in C++, in SeqAn, using the SeqAn coding standard.
  • The code must be well-tested.

The experimental study should:

  • use real-world data, ideally similar to the data in existing studies
  • follow good scientific standards and best practice, results should be repeatable
  • give a clear evaluation with good presentation of the yielded data (e.g. [Sanders, 2002]).

Optional aims:

  • The code is in such good shape that is can be included in the next release of SeqAn.
  • The scripts supporting the experimental study can be adapted to be used for further benchmarks in SeqAn.

Literature

The following could be a good starting point for literature review.

  • [Weese et al., 2009] RazerS - Fast Read Mapping with Sensitivity Control. David Weese, Anne-Katrin Emde, Tobias Rausch, Andreas Döring, and Knut Reinert. Genome Research, Sep 2009, 19: pp. 1646-1654
  • [Navarro, 2001] Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1 (Mar. 2001), 31-88.
  • [Salmela et al., 2010] Approximate Boyer-Moore string matching for small alphabets. L. Salmela, J. Tarhio, P. Kalsi. Volume 58, Number 3 of Algorithmica, pages 591-609.
  • [Sanders, 2002] Presenting Data from Experiments in Algorithmics. P. Sanders. In Experimental Algorithmics -- From Algorithm Design to Robust and Efficient Software, volume 2547 of LNCS, pages 181-196. Springer, 2002.
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback