BscRawSeqJournaling

BSc-Thesis: Journaling raw sequences.

Weekly Progress

Introduction

Nowadays sequencing technologies lead to a tremendous number of available sequences in a short period of time. Hence, several compression techniques are used to relieve network and disk resources to make the sequence information persistently available. Since many projects work with sequences from the same population the vertical compression mode has been shown to give very good compression results due to the high sequence identity shared among the individuals belonging to the same population.

In this thesis we want to implement an application that takes as an input a set of sequences in fasta format and compresses each sequence by means of a common reference sequence. First, we compute a large scale alignment of the sequences based to a common reference sequence. Afterwards the compressed sequences are streamed and the detected variants are joined into a common delta map, which represents the joined variant information of all compressed sequences in an ascending order given their reference position. A bit vector is used to determine the coverage of the variants for all compressed sequences. In the following we describe the different work tasks.

Task Description

T1 - Large Scale Alignment (3 Weeks):

  • Implement an application that reads a set of fasta sequences.
    • In the base version of this program the first sequence is used as the common reference sequence.
  • Build a qGram-Index [1] over the reference sequence. (Parameters like the qGram length should be given as argument)
  • Iterate over all sequences and align them globally with an adaption of the LAGAN [2] algorithm.
    1. We expect the sequences to be very similar, hence we look for neighbouring q-grams and extend them to long runs.
    2. Chain the long runs globally.
    3. Compute a global banded chain alignment:
      1. Connect gaps between anchors with manhattan distance.
      2. Connect gaps between anchors using bi-affine gap costs [3].
  • Store the result of the alignment as a JournaledString [4] in a JournaledSet [5]

T2 - Variant Joining (3 Weeks):

  • Join the variant information to a global DeltaMap [8].
    1. Parse the JournaledStrings and collect all variants.
    2. Sort the variants by means of the reference position and the type of the variant.
    3. Stream the sorted variants and create the DeltaMap.

T3 - Evaluation (2 Weeks):

  • Generate test sets using Mason2 [6] with different parameterisations for the SNPs and InDels.
  • Analyse the performance of the application.
  • Analyse and evaluate the compression ratio of the generated DeltaMap.

Optional:

  • Use real data sets from the 1000 Genome project [7].
  • The program should support an incremental add operation, such that sequences can be added to an existing delta map.
  • Obviously the compression ratio depends on the "goodness" of the reference sequence. Develop a heuristic that reduces the overall memory footprint by modifying the reference sequence and compare the results with the base version of the program.

Expected outcome for student

  • learn how read and understand scientific papers, pseudo code and/or other implementations' source code
  • learn how to efficiently implement an existing algorithm in SeqAn, do I/O and develop an application
  • learn how the SeqAn project works, i.e. development, documentation, communication in team...
  • write a thesis

References

[1] http://docs.seqan.de/seqan/develop/specialization_IndexQGram.html

[2] Brudno, M et al. LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Res. (2003): 721-731. Published in Advance March 12, 2003, doi:10.1101/gr.926603

[3] Rahn, R. Genomes per E-Mail - Efficient Compression of Biological Sequences. Master's Thesis (2011)

[4] http://docs.seqan.de/seqan/develop/specialization_JournaledString.html

[5] http://docs.seqan.de/seqan/develop/specialization_JournaledSet.html

[6] Holtgrewe, M. Mason – a read simulator for second generation sequencing data. Technical Report TR-B-10-06, Institut für Mathematik und Informatik, Freie Universität Berlin (2010)

[7] Siva, N. 1000 Genomes project. Nature biotechnology 26.3 (2008): 256-256.

[8] Rahn, R., Weese D., & Reinert, K. Journaled String Tree - A scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics (2014).

 
Topic revision: r3 - 12 Feb 2015, ReneMaerker
 
  • Printable version of this topic (p) Printable version of this topic (p)