Post-processing long sequence alignments
Long pairwise alignments suffer from two problems. First, dynamic programming based alignment algorithms can generate a local alignment where internal segments score poorly. It is thus desirable to decompose these local alignments into high-scoring subalignments. Second, seed and extend approaches such as BLAST might compute a local alignment where the entire alignment actually scores less than a prefix or suffix of the alignment. In these cases it would be desirable to report only the high-scoring prefix or suffix. The goal of this thesis is to implement an algorithm in SeqAn that avoids both problems. The algorithm decomposes a long alignment into subalignments where (1) each subalignment is free of very low scoring regions and (2) the score of each subalignment cannot be improved by shortening the subalignment.
An X-drop within an alignment, where X>0, is a region of consecutive columns scoring less than -X. Alignments containing no such X-drop are called X-alignments. Obviously, X-alignments avoid the first problem that local alignments contain internal segments scoring less than -X. A "normal" alignment is an alignment where each prefix or suffix has a non-negative score. Such an alignment is called maximal if it is not contained in any longer normal alignment. Maximal normal alignments clearly avoid the second problem that an entire alignment scores less than a prefix or suffix. The algorithm proposed by Zhang et al. constructs a tree that allows to decompose an alignment into all X-full subalignments where X-full refers to subalignments that are maximal normal alignments and X-alignments. The tree encodes all X-full alignments for all X greater or equal to 0. Hence, the decomposition corresponding to any particular value of X can be readily extracted from the tree. The goal of this thesis is to implement this algorithm and test it on real data.
 Z. Zhang, P. Berman, T. Wiehe and W. Miller, Post-processing long pairwise alignments. (1999).