You are here: Foswiki>ABI Web>ThesesHome>FMIndex (27 Jun 2011, JochenSinger)Edit

`TITLE`

Jochen Singer

Prof. Dr. Knut Reinert, David Weese

- occ - occ('c',loc) = how many occurrences of c are found in T* up to position loc
- c('c') = total number of characters lexicographically smaller than c

- The Burrows-Wheeler transformation (this should already be included in SeqAn)
- The auxiliary tables occ, c and sparseSA (I still need to find out what precisely the sparseSA is): A first version will be a naive implementation which does not include the compression of occ. Afterwards a more sophisticated implementation focusing on the reduction of space consumption will be implemented. In order to do so different approaches will be compared and one chosen which shows a reasonable trade off between time and space consumption.

- Use move-to-front encoding to replace each character of T* with the number of distinct characters seen since its previous occurrence. the encoder maintains a list, called the MTF-list, initialized with all characters of ordered alphabetically. When the next character arrives, the encoder outputs its current rank in the MTF-list and moves it to the front of the list. At any instant, the MTF-list contains the characters ordered by recency of occurrence. Note that runs of identical characters in L are transformed into runs of zeros in the resulting string L
^{mtf}= mtf(T*). The string L^{mtf}is defined over the alphabet {0, 1, 2, . . . , |Sigma| - 1}. - Encode each run of zeros in L mtf using a run-length encoder. More precisely, replace any sequence 0m with the number (m + 1) written in binary, least significant bit first, discarding the most significant bit. For this encoding, we use two new symbols 0 and 1. For example, 05 is encoded with 01, and 07 is encoded with 000. Note that the resulting string L rle = rle(L
^{mtf}) is defined over the alphabet {**0**,**1**, 1, 2, . . . , |Sigma| -1}. - Compress L
^{rle}with the following variable-length prefix code. Encode 0 and 1 using two bits (10 for 0, 11 for 1). For i = 1, 2, . . . , |Sigma| - 1, encode the symbol i using 1 + 2 log(i + 1) bits: log(i + 1) 0's followed by the binary representation of i + 1 that takes 1 + log(i + 1) bits. The resulting string is the final output of algorithm BW_RLX and is defined over the alphabet {0, 1}.

- First of all the original text T is converted into T* (T*=bwt(T)).
- T* is divide into several buckets each of size l.
- A table NO for each bucket is created which stores how many occurrences of 'c' have been counted before the start of the current bucket. For example: NO
_{3}[c]= 222 would mean that there are 222 occurrences of c before the start of the current bucket. - W[i] states the size of the buckets up to the current bucket
- MTF[i] = Move to Front, stores a picture of the MTF-list (see compression) at the beginning of the current bucket
- S[c, h, BZ
_{i}, MTF[j]] stores the number of occurrences of c among the first h charactes of the compressed bwt of the current bucket. BZ_{i}is the compressed bwt of the current bucket i and MTF[j] is a picture of the MTF-list at the start of the current bucket.

[2] G. Navarro and V. Ma ̈kinen (2007) Compressed Full-Text Indexes, ACM Computing Surveys 39(1), Article no. 2 (61 pages), Section 5.3

[3] D. Adjeroh, T. Bell, and A. Mukherjee (2008) The Burrows-Wheeler Transform: Data Compression, Suffix Arrays and Pattern Matching, Springer

[4] P. Ferragina, G. Manzini (2000) Opportunistic data structures with applications, Proceedings of the 41st IEEE Symposium on Foundations of Computer Science

[5] P. Ferragina, G. Manzini (2001) An experimental study of an opportunistic index, Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms, pp. 269-278

[6] P. Ferragina, G. Manzini (2005) Indexing compressed text, Journal of the ACM, Vol. 52, No. 4, July 2005, pp. 552–581

- FMIndex.pdf: A document describing different FM-index implementations.

I | Attachment | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|

FMIndex.pdf | manage | 112 K | 15 Jun 2011 - 12:28 | JochenSinger | A document describing different FM-index implementations. | |

jpg | buckets.jpg | manage | 31 K | 23 May 2011 - 11:59 | JochenSinger |

Edit | Attach | Print version | History: r11 < r10 < r9 < r8 | Backlinks | View wiki text | Edit wiki text | More topic actions

Topic revision: r11 - 27 Jun 2011, JochenSinger