Scaling genome alignment to hundreds of genomes

Topic

With the rapid development of next generation sequencing technologies more and more genomes are being sequenced and assembled. Aligning fully assembled genomes is often the first step for further biological analyses, e.g. for reconstruction of a phylogenetic tree. These analyses are limited to the number of genomes that can be aligned by current genome alignment systems. It is therefore sought to develop genome aligners that scale well for large numbers of genomes.

The bottleneck during genome alignment is an initial all-against-all local alignment step. Usually, genome aligners start with local alignments from all pairs of genomes. Instead, it is possible to select only a subset of genome pairs. The authors of the program FSA [1] already showed that the loss of accuracy may be small when using a simple spanning tree to select pairs of genomes. We think that the loss of accuracy can be further reduced by intelligently selecting important pairs of genomes.

The goal of this masters thesis is to adapt an existing alignment program for reduced sets of pairwise local alignments. The main focus will be on exploring methods to select important pairs of genomes. These may for example be based on alignment free distance matrices and corresponding guide trees or approaches similar to [2]. The methods will be developed, implemented (in SeqAn), and evaluated.

Timeline

  • [Week 1-4] Getting familiar with genome alignment and SeqAn.
  • [Week 4-10] Developing different ideas/algorithms to select pairs of genomes.
  • [Week 11-14] Implementation using SeqAn.
  • [Week 15-20] Extensive evaluation of different methods, start writing.
  • [Week 20-24] Finishing experiments, writing up.

There will be weekly meetings to discuss progress and problems.

Literature

[1] RK Bradley, A Roberts, M Smoot, S Juvekar, J Do, C Dewey, I Holmes, L Pachter. Fast Statistical Alignment. PLoS Comput Biol 5(5):e1000392, 2009.

[2] F Sievers, A Wilm, D Dineen, TJ Gibson, K Karplus, W Li, R Lopez, H McWilliam, M Remmert, JD Thompson, DG Higgins. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology 7:539, 2011.
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback