There are two main research goals in the proposed project: (1) To design, analyze, implement, and experimentally validate efficient and versatile genome comparison algorithms based upon suitable computation models and (2) to integrate all required algorithmic components in the SeqAn library for biological sequence analysis for the purpose of disseminating the core algorithms and data structures to the bioinformatics and algorithm engineering community. The software developed will be part of the SeqAn library.
In the last two decades, the field of genomics has seen large changes, fueled by technological advances in sequencing technology. Nowadays, important large mammalian genomes (e.g. human, rat, mouse) are completely available. Maybe even more impressive, the time/cost to read genomic data has sunk by at four/five orders of magnitude. In the next years, the rate of generated data will continue to increase as the cost will continue to decrease.
Clearly, this huge amount of data calls for the usage of computers and algorithms and both have successfully been applied to support biologists in their work. Without the contribution of computer science and bioinformatics, the biological and medical advances driven by genomics would have been impossible.
The pioneering work in Bioinformatics is done very closely to the sequencing labs and biologists. There, it is most important for a computational result to be available quickly. As long as the result is not obviously lacking, the methodology of designing, implementing and evaluating the programs, respectively algorithms is not questioned. Often, sequence analysis programs are written with ad hoc heuristic, tailored to the one use case at hand. Besides some simple metrics like "number of aligned bases/genes", experimental evaluation is often limited to anecdotal considerations. One reason for this is the lack of established metrics.
Algorithm Engineering is a field in computer science. Here, the process of developing an algorithm is described as a cycle: The algorithm is designed (i.e. on paper) and then theoretically analyzed in a second step. The aim here is to yield theoretical insights into the problem, give worst case running time and result quality guarantees. Then, it is implemented and experiments are performed with the resulting program on real world and/or artificial data. This step is done in a careful way and used to verify the theoretical analysis. In many cases, the theoretical analysis was only preliminary and thus the experimental analysis could help to improve the theoretical analysis by maybe suggesting an even better bound on the running time or solution quality. Thus, the cycle can then be continued at the first step, using the experimental results in the design. For example, new heuristics can be introduced when there is experimental evidence that some special cases are common.
Read Mapping is one key component for modern genomics. In the last decade, tens of papers have appeared on this topic, due to the large advances that have been made in sequencing technology. Clearly, it is important for Biologists and Bioinformaticians to select a good read mapper objectively. However, most published read mappers solve slightly different problems.
Beginning from a precise definition of the problem of read mapping we are developing a precise formulation of a solution to this problem. Using these precise formulations, we are then designing algorithms and writing programs to evaluate read mapping programs in a fair and objective manner. These are then tied together by wrapping scripts to provide a comprehensive read mapping benchmark.
Complex models that capture each and every biological detail quickly become computationally intractable. Simple models, however, are vulnerable to an excess of abstraction despite their ease of computability. These models can ultimately lead to efficient algorithms producing useless results. We plan to review approaches from the literature, create precise problem formulations from the whole genome alignment and comparison from these, introduce metrics for whole genome alignments and design/implement/evaluate (i.e. engineer) algorithms that solve these problems.