You are here: ABI » ResearchLogReneMaerker

Algorithm Engineering for High Throughput Sequencing Data

Summary

During the last five years modern sequencing technologies have brought a super-exponential growth of sequencing capacities. With the current sequencing technologies it is possible to sequence up to 30 billion nucleotides per day. The tremendous growth threatens to swamp the available data archives of sequencing centers. Furthermore, the costs of sequencing dramatically decreased due to the technically matured sequence instruments making these technologies available for the broad scientific community. The costs for sequencing a human genome could be reduced to almost $21,000 in January 2011 and sequencing the first "$1000 Genome" is expected for the next few years. In general, the rate of generated data will continue to increase as the cost will continue to decrease.

Accompanied with the colossal advancements in sequencing approaches novel procedures have been emerged interrogating entire genomic sequences of multiple individuals. Such studies (often called as genome wide association studies, GWAS) focus on the interrogation of sequence diversities between a group of individuals sharing a common pathological phenotype and an appropriate control group of healthy individuals. For this matter often the same analysis procedures are performed repeatedly for each control and subject individual.

Clearly, this huge amount of data calls for the usage of computers and algorithms and both have successfully been applied to support biologists in their work. Without the contribution of computer science and bioinformatics, the biological and medical advances driven by genomics would have been impossible.

In the daily routine of sequencing labs and biologists it is not only desired but demanded for a computational result to be available quickly. Even though the pioneering work in Bioinformatics responds quickly to new demands, which are arisen by issues from the practical usage of the known methods, the tremendous growth of sequencing data requires extra efforts in development of novel approaches and structures that can be used to efficiently analyse plenty of sequences. These approaches and structures would be essential milestones along the way of personalized genomics and further applications that benefit from next generation sequencing.

Scope of Research

This work aims to respond to the described increase of genomic sequence data with algorithmic approaches that benefit from redundancies across multiple datasets.

Sequence Aggregation

At the beginning of this work we are designing a data structure representing one or more genomic sequences by storing only the differences to a similar reference sequence while maintaining the ability to navigate quickly in all sequences. We then us this data structure for developing algorithms to transform the substring index data structure of a reference to the substring index of a new genome without rebuilding it from scratch and by only storing the differences to the reference index.

Data Parallelization

Here, we plan to develop algorithms that efficiently process multiple genomes in parallel which are based on the representation described above. Such data parallel methods enable the swift processing of large sets of many human genomes by exploiting similarities between these genomes. Examples for applications are genome alignment, structural variant detection, and GWAS.

Organization

Work Breakdown Structure and Milestones

List the WBS for the project and the milestones.

  • Exposé
    • Read DFG-Proposal [2h]
    • List key-features [0.5h]
    • Literature Research [40h]
    • Create WBS [8h]
    • Write Exposé [8h]

Milestone Due Date
Journal-String-Set 28.09.2011
Delta-Index 05.03.2012
Compact Journal-Strings 12.10.2012
Data Parallelism 28.11.2013
Dissertation 30.06.2014

Progress and Reports

The weekly report summarize achieved steps of the current week and the planned steps for the next week.

Bug Report

Here you can find a list of known bugs and there current progress.

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback