During the last five years modern sequencing technologies have brought a super-exponential growth of sequencing capacities. With the current sequencing technologies it is possible to sequence up to 30 billion nucleotides per day. The tremendous growth threatens to swamp the available data archives of sequencing centers. Furthermore, the costs of sequencing dramatically decreased due to the technically matured sequence instruments making these technologies available for the broad scientific community. The costs for sequencing a human genome could be reduced to almost $21,000 in January 2011 and sequencing the first "$1000 Genome" is expected for the next few years. In general, the rate of generated data will continue to increase as the cost will continue to decrease.
Accompanied with the colossal advancements in sequencing approaches novel procedures have been emerged interrogating entire genomic sequences of multiple individuals. Such studies (often called as genome wide association studies, GWAS) focus on the interrogation of sequence diversities between a group of individuals sharing a common pathological phenotype and an appropriate control group of healthy individuals. For this matter often the same analysis procedures are performed repeatedly for each control and subject individual.
Clearly, this huge amount of data calls for the usage of computers and algorithms and both have successfully been applied to support biologists in their work. Without the contribution of computer science and bioinformatics, the biological and medical advances driven by genomics would have been impossible.
In the daily routine of sequencing labs and biologists it is not only desired but demanded for a computational result to be available quickly. Even though the pioneering work in Bioinformatics responds quickly to new demands, which are arisen by issues from the practical usage of the known methods, the tremendous growth of sequencing data requires extra efforts in development of novel approaches and structures that can be used to efficiently analyse plenty of sequences. These approaches and structures would be essential milestones along the way of personalised genomics and further applications that benefit from next generation sequencing.
This work aims to respond to the described increase of genomic sequence data with algorithmic approaches that benefit from redundancies across multiple datasets.
At the beginning of this work we are designing a data structure representing one or more genomic sequences by storing only the differences to a similar reference sequence while maintaining the ability to navigate quickly in all sequences. We then us this data structure for developing algorithms to transform the substring index data structure of a reference to the substring index of a new genome without rebuilding it from scratch and by only storing the differences to the reference index.
Here, we plan to develop algorithms that efficiently process multiple genomes in parallel which are based on the representation described above. Such data parallel methods enable the swift processing of large sets of many human genomes by exploiting similarities between these genomes. Examples for applications are genome alignment, structural variant detection, and GWAS.