Overlap Module for NGS Pipeline


The overlap module merges the information retained by read mapping to a genome with annotation information (for example genes, transcripts, known intervals of genomic abberations like inversions, etc.) and can be used to measure gene expression levels with RNA-Seq data (NGS of mRNAs) or to produce visualisations files for sequencing coverage. It counts the matched reads for the annotated intervals.


This thesis uses structures, which are already implemented in Seqan. The most important structure is the FragmentStore, which stores reads, contigs and alignments among others. The first thing to be done is to write a function to extract the matched intervals of the contig in the alignment with a given read.

Apart from that a function is needed to read the given GFF-file, which contains the annotations for special intervals (like genes, exons, etc.) in a contig. The FragmentStore will be extended for an annotationStore and an annotationNameStore, where the information out of the GFF-file will be stored. The annotationNameStore will hold the names of the different intervals in a lexicographical order and hence implicitly the id of the entry by the position. Corresponding to the ids the intervals will be stored in the annotationStore. It also holds the parent-ids (e.g. the gene-id for an exon), the contig-ids, the start- and the end-positions. In addition to that, an intervalTreeStore will be created as a part of the FragmentStore. For each contig two interval-trees will be created (one for each orientation), which holds the intervals of the containing parts with a pointer to the corresponding annotationStore -entry. (Interval-trees are already implemented in Seqan)

Now the interval-trees will be searched for the matched-intervals of each read. Therefore the given function findIntervals() will be used, but we'll search for shortened intervals to compensate mistakes caused by the sequencing or inaccuracies. After that a new function will be used to increment the counts for the current interval and if neccessary for its parent-intervals. To do this it's neccessary to differentiate how a matched-interval can occur (e.g. if the read matches in two overlapping exons, but only in one exon completely). There will be three different results:
  • 1. For each read the ids of the intervals are stored.
  • 2. For each interval of the annotationStore the counts of matched reads are stored.
  • 3. For each interval of the annotationStore connected intervals by reads and their counts are stored.
The results are stored in the position corresponding to the annotationStore -id of the containers (Seqan-Strings and -StringsSets). At the end, a gff-file will be created for the output. Therefore a fast access to the annotationNameStore is possible by using the ids to retrieve the names.

[1] Marc Sultan, Marcel H. Schulz, Hugues Richard, Alon Magen, Andreas Klingenhoff, Matthias Scherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov, Dmitri Parkhomchuk, Dominic Schmidt, Sean O'Keeffe, Stefan Haas, Martin Vingron, Hans Lehrach and Marie-Laure Yaspo, A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science (2008), published online July 3.

Weekly Reports

Week 1 ( - 09.08):
  • detailed working plan created

Week 2 (10.08. - 14.08.):

Implementation of:
  • function to read gff-file and to store the information in annotationNameStore and annotationStore
  • function to extract the matched intervals of a contig in an alignment
  • function to get the annotationStore-IDs of one read, to select the right ones and to store them in a readCountStore
  • function to get the counts of read- and matepair-connections from the readCountStore and to store them in a tupleCountStore
  • function to append the parent-IDs to the readCountStore
  • function to build an annoCountStore by use of the readCountStore, which stores the counts of all annotationStore-IDs
  • function to calculate the sart- and end-positions from parent-entries and to store them in the annotationStore


Super, genau so soll sein!

-- Main.maschulz - 13 Jul 2009

This topic: ABI > WebHome > ThesesHome > ThesisNGSOverlap
Topic revision: 16 Aug 2009, krakau