With the rapid development of next generation sequencing technologies more and more genomes are being sequenced and assembled. Aligning fully assembled genomes is often the first step for further biological analyses, e.g. for reconstruction of a phylogenetic tree. These analyses are limited to the number of genomes that can be aligned by current genome alignment systems. It is therefore sought to develop genome aligners that scale well for large numbers of genomes.
The bottleneck during genome alignment is an initial all-against-all local alignment step. Usually, genome aligners start with local alignments from all pairs of genomes. Instead, it is possible to select only a subset of genome pairs. The authors of the program FSA [1] already showed that the loss of accuracy may be small when using a simple spanning tree to select pairs of genomes. We think that the loss of accuracy can be further reduced by intelligently selecting important pairs of genomes.The goal of this masters thesis is to adapt an existing alignment program for reduced sets of pairwise local alignments. The main focus will be on exploring methods to select important pairs of genomes. These may for example be based on alignment free distance matrices and corresponding guide trees or approaches similar to [2]. The methods will be developed, implemented (in SeqAn), and evaluated.
There will be weekly meetings to discuss progress and problems.