Online String Searching and Filtering 11.04.-06.05. Index Data Structures 09.05.-30.05. Review 1 11.05. Genome Comparison and RNA Analysis 03.06.-17.06. Review 2 08.06. Klausur 29.06. Individual Projects 18.06.-16.07.

##### Project A: Suffix array search
• Student 1
• Read in a text from disk
• Construct the suffix array with Quicksort or an algorithm of this lecture
• Construct the lcp-table with the Kasai algorithm
• Write the 2 tables to disk
• Student 2
• Read in a text the corresponding suffix array and lcp table from disk
• Implement the suffix array search with mlr-heuristic
• Implement the O(m + log n) suffix array search and therefor construct an lcp-interval tree
• Compare the search times of both algorithms (use a genomic text and a large set of substrings)

##### Project B: FM-Index
• Student 1
• Read in a text from disk and construct its suffix array (you are allowed to use the results of Project A, you could also read it in after executing the tool of Project A)
• Construct the Burrows-Wheeler Transformation (BWT) using the suffix array from Project A
• Construct the Occ and c-table
• Provide a function for the L-to-F mapping (used by Student 2)
• Student 2
• Implement the exact match algorithm that uses BWT, Occ and c table to search a string in a text
• Student 3
• Use the exact match algorithm of Student 2 and backtracking to implement an approximate string search with up to k errors

##### Project C: Pigeonhole Filter
Implement the hierarchical filter (PEX) to find approximate matches of a pattern with up to k edit-errors in a text.
• Student 1
• You are given the name of a text file, a pattern P and the number of edit-errors k on command line
• Read in the text from disk
• Construct the hierarchical tree as described in the PEX algorithm
• Student 2
• Write a semi-global DP-aligner to search a substring of P in a substring of T with edit-errors
• Implement the PEX search algorithm
• (Optional: Student 3)
• Implement a banded version of Myers' bit-vector algorithm as a replacement for the DP-aligner

##### Project D: QUASAR
Implement QUASAR to search semi-global alignments with up to k edit-errors of patterns in a text.
• Student 1
• Read in a text and construct the q-gram index (assume a DNA alphabet and q=10)
• Read in a set of patterns (you can assume that all have the same size w)
• For each pattern P
• Count the occurrences of all overlapping q-grams of P in overlapping blocks of the text
• Student 2
• For each pattern P
• For each block that reaches the threshold t (from q-gram lemma)
• Verify if the block contains a semi-global alignment of P with up to k edit-errors
• If so, output the pattern P, its end position in the text and the number of errors

##### Project E: Calculation of the exact q-gram threshold
• Student 1
• Read in (from command line) a gapped shape, a text length m and a number of mismatches k
• Calculate the exact gapped q-gram threshold t using the DP algorithm described in [1]
• Student 2
• Calculate a lower bound for t using the q-gram lemma
• Calculate an upper bound using a greedy algorithm that tries to destroy as many q-grams as possible with k errors (see [2])
• For different gapped shapes and values for m and k compare the results of Student 1 with your bounds

##### Project F: Skew algorithm
Implement the skew-algorithm for the difference cover D = {1,2,4} and m = 7
• Student 1
• Sort (7 radix passes) and name the septuples T[i..i+6] where i mod 7 equals 1, 2 or 4
• Create the text T' = t1 t2 t4 of septuple-names
• Create suffix array A' of T' recursively
• Student 2
• Derive suffix array A124 from A'
• Construct suffix arrays A0, A3, A5, A6 iteratively from A124
• Student 3
• Merge suffix arrays from Student 2 using a priority queue and appropriate comparisons

##### Project G: Nussinov SCFG
• Student 1
• Implement the Nussinov-SCFG for RNA-folding with the Nussinov CYK algorithm.
• Together with Student 2 compare the folds predicted by your algorithm to those predicted by mfold and RNAFold.
• Student 2
• Implement a training procedure that uses RFAM-alignments as training input to learn the SCFG's parameters that can be used by Student 1.

## References

Please note that some journals might only be accessible from granted subnets like the FU subnet (or VPN).

[1] Stefan Burkhardt, Juha Kärkkäinen, Better Filtering with Gapped q-Grams, Fundamenta Informaticae, 2003
[2] http://en.wikipedia.org/wiki/Maximum_coverage_problem
Topic revision: r1 - 19 Mar 2012, KnutReinert