The purpose of this thesis is to allow "easy parallelism" in the SeqAn library. This will consist of identifying parts of the library where parallelization can be applied effectively and actually engineering (designing, implementing and evaluating) parts of the library to use parallelism. Common parallelization patterns should be extracted into a generalized framework/library that can then expose high-level parallelism to the users of SeqAn. The parallel execution of the string search algorithms in the module "find" as well as the algorithms for constructing indices are obvious candidate for parallelization. The parallel execution of string matching algorithms is a good starting point. Additionally, it allows the extension of the task to many interesting research questions, e.g. how to collect the results, how to process them etc.
In this work the existing implementation of the read mapping tool RazerS should be parallelized and improved in the handling of repeats.
Next-Generation Sequencing technologies combined with a bisulfite treatment of DNA (BS-Seq) present an efficient method in the field of epigenomics, allowing for a precise analysis of methylation patterns at single-nucleotide resolution. The expanding field leads to the production of enormous amounts of data, that need to be analysed using precise and efficient software tools. The analysis needs to address the specific challenges arising from the bisulfite treatment, which converts unmethylated Cs to Ts. We present a powerful analysis workflow, implemented using the C++ software library SeqAn, for accurate bisulfite read mapping, the detection of single-nucleotide polymorphisms and methylation level calling at single-nucleotide resolution. Furthermore, we provide a multiple sequence realignment method to improve the alignment accuracy for indel reads. We show that in comparison to existing methods, we achieve comparable and even better results.
Read mapping is a fundamental task in DNA sequence analysis. Current read mapping tools are very fast and precise but usually fail to map reads that cross breakpoints of structural variants (SVs), or exon-exon junctions in the case of RNA-sequencing. These breakpoints cause one or even multiple splits in the read-to-reference alignment, with parts of the read mapping to different locations on the reference sequence. Identification and classification of SVs is important to evaluate their functional impact but remains challenging. So far, there is no sophisticated SV detection method that can determine all types of SVs at single-nucleotide resolution while being independent from different platforms like Illumina or 454, or from paired-end and single-end reads.
We designed and implemented a sound generic multi-split chaining method using the C++ library SeqAn that uses SeqAn’s exact local aligner Stellar to detect splits of a read. Compatible local matches of a read are then identified, and all compatibility information is stored in a split-read graph representation of the matches. We then use a DAG shortest path algorithm to determine the most probable chain of splits, and report the underlying breakpoints.
Our approach is more versatile compared to existing split-read methods. It allows for multiple splits at arbitrary locations in the read, and is able to detect inversions, inter- and intra-chromosomal translocations, duplications, insertions, and deletions. At the same time, it is independent of the read length. We successfully applied our method to simulated Illumina read data and also to 454 RNA-Seq data, yielding robust results that can compete with the results of the tool SVDetect and the Illumina and the 454 analysis software.
This work presents a useful data structure to manage multiple biological sequences both efficiently and generically. The objective is to create a data structure to aggregate simila- rities between biological sequences. The implemented method ensures that the contai- ned sequences can be compressed while simultaneously the efficient analysis of multiple sequences exploiting parallel computing approaches is supported. The development of new techniques in both categories has a strong impact on the state of the art research in the field of genome research, especially since the current development and wide spread use of next generation sequencing is growing exceptionally fast.
Es sollen Methoden entwickelt werden, mit denen inkrementelle Änderungen eines Textes (bspw. das Einfügen oder Entfernen von kurzen Sequenzen) auch auf bereits erzeugte zugehörige Substring-Indizes (bspw. Enhanced Suffix Array, Lazy Suffix Tree) angewendet werden können, ohne sie komplett neu aufzubauen.
This thesis deals with the problem of post-processing read mapping data to resolve ambiguously mapped reads, i.e. regions within the multi-alignment produced by a read mapper that are covered by reads that actually originate from different regions. For that purpose, an algorithm work flow was designed, implemented and evaluated for different read mapping scenarios.
In der Masterarbeit soll eine Methode zur Detektion von Copy number variations (CNVs) in Second Generation Sequencing Daten implementiert werden. Der entwickelte Algorithmus wird anschließend auf simulierte und reale Sequenzierdaten von einem Tumorgenom und Daten aus dem 1000 Genomes Projekt angewendet.
Motivation. Next-generation sequencing technologies produce unprecedented amounts of data, leading to completely new research fields. One of these is metagenomics, the study of large-size DNA samples containing diverse organisms. A key problem in metagenomics is functionally and taxonomically classifying the sequenced DNA, to which end the well known BLAST program is usually used. But BLAST has dramatic resource requirements at metagenomic scales of data, imposing a high financial or technical burden on the researcher. Multiple attempts have been made to overcome these limitations and present a viable alternative to BLAST, but as of yet, they have not gained widespread adoption.
Results. In this work we present Lambda, our own alternative for BLAST in the context of se- quence classification. In our tests Lambda is among the best tools at reproducing BLAST’s results and is faster than all existing solutions at comparable levels of sensitivity.
The technological development in the field of genome research has resulted in a massive generation of data that has to be stored and analyzed. The enormous amount of information demands special data structures and algorithms for an efficient analysis. Such an analysis often requires the identification of interesting sequences in genomes, which can be realized using full-text indices. Until recently, the major problem of this approach was its memory consumption, which now can be overcome using the well known FM-index. Therefore, in this thesis we extended the software library SeqAn that provides data structures and algorithms for analyzing biological sequences, with sophisticated FM-index versions designed for fast and memory efficient pattern search. We show that in comparison with existing FM-index implementations our variants are not only competitive to other approaches, but also outperform them.
Next-Generation Sequencing (NGS) allows to sequence and output millions of short reads in a single run. In many NGS pipelines for each of these reads a matching alignment in a reference genome needs to be found. This process is called read mapping. Some of these reads can not be unambiguously aligned to one position. Therefore some read mappers use mapping quality scores to indicate the reliabil- ity of the alignments. These scores are assigned to individual alignments and do not directly indicate ambiguous regions in the genome. Also some read mappers, especially read mappers which do not output subsequent matches, do not calculate mapping quality scores directly.
In this thesis we present a novel approach to estimate mapping quality scores without using any information about subsequent alignments. Our approach calculates the mapping quality score of perfect sequences extracted from the genome. For every position in the genome the average of the mapping quality scores is saved, as is the score at the starting position of the perfect sequence. The highest score covered by an alignment is used to calculate the mapping quality. This score is used to annotate the result file of a read mapper.
We evaluate the results by comparing them with the results of a read mapper which calculates mapping quality scores and directly calculated mapping quality scores. Our results indicate that the number of errors and the base call qualities have only a very small influence on the mapping quality score of an alignment. Because we only find a small influence we can precalculate mapping quality scores for a given a genome.
Additionally for a given genome we use the different read mapping results to find simulated single nucleotide polymorphisms (SNP). In our experiment the read map- per results without mapping quality scores generate better results than those with annotated mapping quality scores.
The goal is to develop a carefully desigend benchmark dataset of real and artificial data sets for "approximate" read mapping. Also all current read mappers should be compared using the benchmark.
This work will be about the possibilty to use GPGPU methodes in Seqan (especially with NVidia CUDA) and the implementation and parallelization of Myers fast bitvector algorithm for approximate string matchings. It is stated that using massive multiprocessor programming can dramatically increases the speed and performance of algorithms, depending on the amount that could be parallized and the efficiency it is implemented. Using hundreds or thousands of parallel threads for read mapping/ localizing/matching to a genom with the fast GPU cores and Myers fast matching algorithm can give an great gain of speed for sequence analyse and string matchings.
Develop an OpenMS/TOPP tool addressing the following tasks: (1) correction of instabilities of the retention time (2) multiple alignment of the LC-MS/MS signals caused by peptides (3) compute and maintain a consensus over many experiments.