Overlap Module for NGS Pipeline

Summary

The overlap module merges the information retained by read mapping to a genome with annotation information (for example genes, transcripts, known intervals of genomic abberations like inversions, etc.) and can be used to measure gene expression levels with RNA-Seq data (NGS of mRNAs) or to produce visualisations files for sequencing coverage. It counts the matched reads for the annotated intervals.

Expose

This thesis uses structures, which are already implemented in Seqan. The most important structure is the FragmentStore, which stores reads, contigs and alignments among others. The first thing to be done is to write a function to extract the matched intervals of the contig in the alignment with a given read.

Apart from that a function is needed to read the given GFF-file, which contains the annotations for special intervals (like genes, exons, etc.) in a contig. The FragmentStore will be extended for an annotationStore and an annotationNameStore, where the information out of the GFF-file will be stored. The annotationNameStore will hold the names of the different intervals in a lexicographical order and hence implicitly the id of the entry by the position. Corresponding to the ids the intervals will be stored in the annotationStore. It also holds the parent-ids (e.g. the gene-id for an exon), the contig-ids, the start- and the end-positions. In addition to that, an intervalTreeStore will be created as a part of the FragmentStore. For each contig two interval-trees will be created (one for each orientation), which holds the intervals of the containing parts with a pointer to the corresponding annotationStore -entry. (Interval-trees are already implemented in Seqan)

Now the interval-trees will be searched for the matched-intervals of each read. Therefore the given function findIntervals() will be used, but we'll search for shortened intervals to compensate mistakes caused by the sequencing or inaccuracies. After that a new function will be used to increment the counts for the current interval and if neccessary for its parent-intervals. To do this it's neccessary to differentiate how a matched-interval can occur (e.g. if the read matches in two overlapping exons, but only in one exon completely). There will be three different results: The results are stored in the position corresponding to the annotationStore -id of the containers (Seqan-Strings and -StringsSets). At the end, a gff-file will be created for the output. Therefore a fast access to the annotationNameStore is possible by using the ids to retrieve the names.

[1] Marc Sultan, Marcel H. Schulz, Hugues Richard, Alon Magen, Andreas Klingenhoff, Matthias Scherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov, Dmitri Parkhomchuk, Dominic Schmidt, Sean O'Keeffe, Stefan Haas, Martin Vingron, Hans Lehrach and Marie-Laure Yaspo, A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science (2008), published online July 3.

Weekly Reports

Week 1 ( - 09.08):

Week 2 (10.08. - 14.08.):

Implementation of:

Comments

Super, genau so soll sein!

-- Main.maschulz - 13 Jul 2009