Page FMIndex

TITLE


TITLE

Student
Jochen Singer

Academic Advisor
Prof. Dr. Knut Reinert, David Weese

Expose

Recent developments in the field of DNA research have reduced the cost of genome sequencing drastically. While the analyses of individual genomes was very expensive in the past, leading to only a few studies, the numbers of studies has increased by magnitudes recently. In addition to the increase in studies, the amount of data per study has increased dramatically as well. Therefore fast and at the same time space efficient analysis algorithms have to be developed.

One such approach is the FM-Index. The index is based on the Burrows-Wheeler transformation (BWT) which is a rearrangement of a given text T into T* (BWT(T) = T*). The transformation is computes using a suffix array. In doing so the BWT rearranges T in such a way that T is usally more compressible than T. Furthermore one can efficiently search for patterns P in T or count them. In order to do so some auxiliary data is needed:

This project focuses on the implementation of the FM-Index. More precisely we will implement:

Compression (this paragraph has been copied from [6], page 558)

Searching

The bottleneck in terms of speed is the occ function. The function is used to search in the compressed text and can be computed efficiently by creating buckets which cover the whole sequence and store some auxiliary data. The figure below is a graphical representation of the described situation.

buckets.jpg

In addition to the structure described above the tables NO and W are stored in superbuckets of size lb.

In order to determine occ(x,'c') one has to determine which bucket contains the position x. Afterwards one adds the entries from the superbuckets and buckets before the one which contains x. The current bucket is then decompressed and the cs before x are counted. Using sophisticated data structures, such as wavelet trees, this can be done in constant time.

The implementation will be done in such a way that the Finder interface of SeqAn can be used. In addition the implementation should support the use of a string set.

Literature

[1] M. Burrows and D. J. Wheeler (1994) A Block-sorting Lossless Data Compression Algorithm, SRC Research Report 124
[2] G. Navarro and V. Ma ̈kinen (2007) Compressed Full-Text Indexes, ACM Computing Surveys 39(1), Article no. 2 (61 pages), Section 5.3
[3] D. Adjeroh, T. Bell, and A. Mukherjee (2008) The Burrows-Wheeler Transform: Data Compression, Suffix Arrays and Pattern Matching, Springer
[4] P. Ferragina, G. Manzini (2000) Opportunistic data structures with applications, Proceedings of the 41st IEEE Symposium on Foundations of Computer Science
[5] P. Ferragina, G. Manzini (2001) An experimental study of an opportunistic index, Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, pp. 269-278
[6] P. Ferragina, G. Manzini (2005) Indexing compressed text, Journal of the ACM, Vol. 52, No. 4, July 2005, pp. 552–581

Work Plan

31.05.11 - 05.06.11

06.06-11 - 12.06.11 I used quiet some time to create a document describing in more detail the original FM-index (FMIndex.pdf). In addition the document contains a description of later versions of the index. It is especially useful to describe the differences and the resulting advantages and disadvantages of the different implementations. Note that this document is subject to change and will be updated constantly.

Furthermore I tried to download, compile and run different versions of the FM-index provided by: http://pizzachili.dcc.uchile.cl/ I was able to run the original version, only after getting the hint to use a compiler <= gcc-4.2. As a first result I built an index of a 600MB text file. The index occupies less then 100MB of memory and it has to be noted that the index is not loaded completely into main memory when used for searching. Instead, the implemented version maps the corresponding file and uses virtual memory. This is an important feature and should be considered in our implementation. However, I was not able to run the alphabet friendly version, even though I used gcc-4.2 and g++-4.2 as compiler. While running the program it crashes for unknown reasons.

In addition I read more paper and was looking at different approaches to determine the rank of a bit string. There are several implementations available and as David pointed out one could use Intrinsics of the gcc compiler.

In order to determine which compression scheme is the most appropriate I implemented a little program. The program performs an MTF-encoding and counts the number of zeros. In addition the program determines how many runs of length x of zeros can be observed. This is done for the original as well as the Burrows-Wheeler transformed text.

13.06.11 - 19.06.11 I fixed bugs in the little program described above and created a spreadsheet to determine the memory consumption of an encoded string as described in [4] and [6].

I updated FMIndex.pdf

Re-reading and designing a FMIndex using BitStrings as described in https://www.mi.fu-berlin.de/w/ABI/AdvancedAlgorithms11_Indices (script 10 and 11).

20.06.11 - 26.06.11 Implementation of a FMIndex with BitStrings and wavelet trees. Note that the first version contains many "work arounds".

Weekly Report

Comments