Implementing 01*0 seed search strategy using the bidirectional FM index in SeqAn

Background

Approximate string matching is an important subtask in many bioinformatic applications such as read mapping. Seeds (short substrings of reads) are searched in a reference genome allowing up to a certain number of mismatches or indels. This can be implemented straight-forward using backtracking but is only feasible for a small number of errors since the running time grows exponentially in the number of errors. Recently, there has been more research to address this problem and improve the running time, such as [1] and [2],

Topic

The goal of this thesis is to implement 01*0 seeds with a bidirectional FM index and compare it with the implementation "Bwolo" in [1]. It can then be integrated into the SeqAn find module. Bwolo is based on the unidirectional FM index, searches for 01*0 seeds from right to left and thus has to perform a banded dynamic programming algorithm to extend the pattern to the right after finding a 01*0 seed.

Schedule

Preliminary schedule:

[1] Vroland, Christophe, et al. "Approximate search of short patterns with high error rates using the 01⁎0 lossless seeds." Journal of Discrete Algorithms (2016).

[2] Kucherov, Gregory, Kamil Salikhov, and Dekel Tsur. "Approximate string matching using a bidirectional index." Symposium on Combinatorial Pattern Matching. Springer International Publishing, 2014.

[3] Pockrandt, Christopher. Generic Implementation of a Bidirectional FM-Index in SeqAn and Applications. Master’s thesis, Freie Universit at Berlin, 2015.