Algorithmic Bioinformatics

Jochen Singer

A Wavelet Tree Based FM-Index for Biological Sequences in SeqAn

Academic Advisor: David Weese , Knut Reinert
Discipline: Bioinformatik
Degree: Master of Science (M.Sc.)
Degree: Jan 30, 2012
Status: finished


The technological development in the field of genome research has resulted in a massive generation of data that has to be stored and analyzed. The enormous amount of information demands special data structures and algorithms for an efficient analysis. Such an analysis often requires the identification of interesting sequences in genomes, which can be realized using full-text indices. Until recently, the major problem of this approach was its memory consumption, which now can be overcome using the well known FM-index. Therefore, in this thesis we extended the software library SeqAn that provides data structures and algorithms for analyzing biological sequences, with sophisticated FM-index versions designed for fast and memory efficient pattern search. We show that in comparison with existing FM-index implementations our variants are not only competitive to other approaches, but also outperform them. 



[1] M. Burrows and D. J. Wheeler (1994) A Block-sorting Lossless Data Compression Algorithm, SRC Research Report 124
[2] G. Navarro and V. Ma ̈kinen (2007) Compressed Full-Text Indexes, ACM Computing Surveys 39(1), Article no. 2 (61 pages), Section 5.3
[3] D. Adjeroh, T. Bell, and A. Mukherjee (2008) The Burrows-Wheeler Transform: Data Compression, Suffix Arrays and Pattern Matching, Springer
[4] P. Ferragina, G. Manzini (2000) Opportunistic data structures with applications, Proceedings of the 41st IEEE Symposium on Foundations of Computer Science
[5] P. Ferragina, G. Manzini (2001) An experimental study of an opportunistic index, Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, pp. 269-278
[6] P. Ferragina, G. Manzini (2005) Indexing compressed text, Journal of the ACM, Vol. 52, No. 4, July 2005, pp. 552–581