BscRawSeqJournaling

BSc-Thesis: Journaling raw sequences.

Weekly Progress

Introduction

Nowadays sequencing technologies lead to a tremendous number of available sequences in a short period of time. Hence, several compression techniques are used to relieve network and disk resources to make the sequence information persistently available. Since many projects work with sequences from the same population the vertical compression mode has been shown to give very good compression results due to the high sequence identity shared among the individuals belonging to the same population.

In this thesis we want to implement an application that takes as an input a set of sequences in fasta format and compresses each sequence by means of a common reference sequence. First, we compute a large scale alignment of the sequences based to a common reference sequence. Afterwards the compressed sequences are streamed and the detected variants are joined into a common delta map, which represents the joined variant information of all compressed sequences in an ascending order given their reference position. A bit vector is used to determine the coverage of the variants for all compressed sequences. In the following we describe the different work tasks.

Task Description

T1 - Large Scale Alignment (3 Weeks):

T2 - Variant Joining (3 Weeks):

T3 - Evaluation (2 Weeks):

Optional:

Expected outcome for student

References

[1] http://docs.seqan.de/seqan/develop/specialization_IndexQGram.html

[2] Brudno, M et al. LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Res. (2003): 721-731. Published in Advance March 12, 2003, doi:10.1101/gr.926603

[3] Rahn, R. Genomes per E-Mail - Efficient Compression of Biological Sequences. Master's Thesis (2011)

[4] http://docs.seqan.de/seqan/develop/specialization_JournaledString.html

[5] http://docs.seqan.de/seqan/develop/specialization_JournaledSet.html

[6] Holtgrewe, M. Mason – a read simulator for second generation sequencing data. Technical Report TR-B-10-06, Institut für Mathematik und Informatik, Freie Universität Berlin (2010)

[7] Siva, N. 1000 Genomes project. Nature biotechnology 26.3 (2008): 256-256.

[8] Rahn, R., Weese D., & Reinert, K. Journaled String Tree - A scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics (2014).