Page DfgWorkPackages

This is the list of work packages from the DFG project.

Design and verification of a suitable model for cross-species genome compari- son (3 months)

Complex models that capture each and every biological detail quickly become computa- tionally intractable. Simple models, however, are vulnerable to an excess of abstraction despite their ease of computability. These models can ultimately lead to efficient algo- rithms producing useless results. In the first project task we want to analyze whether the discussed alignment graph model is a suitable computational model for genome align- ments. This includes a comparison to the models used by current genome aligners such as the anchor-based alignment method. We will investigate whether the representation of a genome comparison as a k-partite graph for k sequences has the potential to capture all the essential genomic mechanisms and we will identify the strength and weaknesses of the model. For instance, overlapping segment matches cannot readily be turned into an alignment graph because a single vertex corresponds 1:1 to a part of a sequence or an abstract entity. Hence, we need to refine overlapping matches and in case of genomic sequences, segments containing gaps should be incorporated in the model. We need to analyze the pros and cons of refining matches [55] versus trimming minimal overlapping matches.

Design, analysis, implementation and validation of cross-species genome com- parison algorithms (5 months)

Because of sheer size, fast and efficient algorithms are crucial to compare genomes. Pro- gressive alignment proved to be highly efficient for aligning large numbers of sequences with only a minimal loss in accuracy compared to optimal dynamic programming. The succinctness of the algorithm and the idea of aligning sequences along a guide tree that indicates when each and every sequence (the leaves of the tree) is added to the growing alignment is very appealing. Because of that, our first falsifiable hypotheses of this project task is whether the current graph-based progressive alignment algorithm can be substituted with a greedy progressive matching algorithm in order to drop the collinearity constraint. We therefore are going to design, analyze, implement and vali- date novel progressive matching algorithms, including different methods to compute a bi-partite matching in a sparse graph and possible means of matching two subsets of already matched sequences. The second question we want to address in this project task is whether classical assignment algorithms can be adapted to the problem at hand. The obvious shortcoming of applying assignment algorithms is, however, that conserved regions do not have to occur in all sequences. Nevertheless, these conserved regions certainly induce a highly connected subgraph in the alignment graph. We therefore plan to investigate a third falsifiable hypotheses which relates to the question whether clique- finding algorithms can be executed efficiently on a k-partite sparse graph to elucidate locally conserved genomic features.

Parallel read mapping algorithm research (5 months)

The increasing prevalence of multi-core CPUs, multithreading and cluster computing require novel computational models and algorithms that can capitalize on these parallel processing units. We believe that the filtration and verification phase of a read map- per are ideal candidates to study in-depth the advantages and requirements of efficient parallel data processing. The SWIFT [53] filter algorithm used by our read mapper Raz- erS performs a serial scan of the reference genome to identify potential read matches. We plan to design, analyze, implement and experimentally validate algorithms that per- form multiple scans in parallel over different and independent fractions of the reference genome. We are going to analyze the computational implications of using multi-core li- braries such as Intel Threading Building Blocks (www.threadingbuildingblocks.org) or OpenCL (www.khronos.org/opencl). Likewise, the verification of each potential match found by the filter is independent of all other potential matches. Therefore, the verification process can also be easily parallelized. Special emphasis will be placed on analyzing the pros and cons of the parallel version in comparison to the serial version. The systematic study of different parallel data distribution and communication models will also be part of this project task. Besides using the multi-core libraries from above, we plan to use the GPU via the CUDA library for the parallel verification.

Designing a read mapping benchmark (3 months)

A number of read mapping algorithms has been published in the recent past. Unfor- tunately, read mappers are very difficult to compare because they all pursuit different objectives. Due to algorithmic constraints some favor speed over sensitivity, some focus on returning only the best match for every read (at the expense of loosing all matches occuring in repeats) and others return all potential matches within some error bound. All these differences make a thorough and objective analysis of read mappers difficult if not impossible and these shortcomings can only be overcome by the provision of a pub- licly available, generally accepted benchmark data set of realistic read mapping problem instances. In this project task, we plan to generate various problem instances for current read mappers, including simulating reads from highly repetitive genomes, reads of vari- able lengths and error rates and reads derived from mate pair libraries. The benchmark will be disseminated to other developers over the web. Tools to evaluate the perfor- mance of different read mappers on the benchmark will be developed and published. We hope that such a benchmark further spurs algorithm engineering activities in the area of new sequencing technologies.

Realigner (3 months)

Given the output of a read mapper one can only readily deduce the alignment of a read on the target genome. To enable SNP calling and variant detection this local view of a single read aligned to the target genome must be integrated in a global view where all reads are placed in a multi-read alignment with respect to the target genome. To achieve this goal we plan to design, analyze, develop and experimentally validate an algorithm derived from the realigner tool [5]. We plan to extend this algorithm to the new short-read technologies and we will generalize this algorithm to apply it also to protein sequences.

Genomic variation detection / SNP calling (3 months)

Based upon the proposed realigner tool and the already existing consensus multi-read aligner, we want to address the problem of calling SNPs and detecting genomic variants within the grants duration. First, we want to concentrate on a column-by-column based SNP calling approach because windowing approaches that process entire portions of reads spanning a region of sequence variation simultaneously [20] seem to be very sensi- tive to a high error rate of the sequenced reads. Second, mate-pairs have the potential to allow a cataloging of all genetic variation events, which is crucial for our understanding of many diseases such as cancer or diabetes. Mate-pair analysis is however non-trivial because of the error-prone technology to generate mate-pairs and the high deviations mate pair libraries display. We plan to design, analyze and develop clustering algorithms that group mapped mate-pairs in order to gain support for a certain variation event. As shown in Fig. 1 the novel insert would be questionable if only one or two reads support such an insert. The more reads support a certain variant the more believable it is.

Reverse engineering current genome comparison tools and integrating their key algo- rithmic components in the SeqAn library (see Fig. 4) is the aim of this project task. This also implies experimenting with different suffix array construction or chaining algo- rithms. Within the grants duration, we plan to establish for each algorithmic component the basic parameters and conditions when it performs the best.

Library maintenance (3 months)

We regularly prepare a library release. This requires finishing all the on-going implemen- tation work, extensive code testing, finalizing the documentation and preparing tutorials.

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback