Algorithmic Bioinformatics

Kyanoush Seyed Yahosseini

Precalculated Mapping Quality Scores

Academic Advisor: David Weese , Knut Reinert
Discipline: Bioinformatik
Degree: Master of Science (M.Sc.)
Degree: Dec 09, 2013
Project:
Status: finished

Abstract

Next-Generation Sequencing (NGS) allows to sequence and output millions of short reads in a single run. In many NGS pipelines for each of these reads a matching alignment in a reference genome needs to be found. This process is called read mapping. Some of these reads can not be unambiguously aligned to one position. Therefore some read mappers use mapping quality scores to indicate the reliabil- ity of the alignments. These scores are assigned to individual alignments and do not directly indicate ambiguous regions in the genome. Also some read mappers, especially read mappers which do not output subsequent matches, do not calculate mapping quality scores directly.

In this thesis we present a novel approach to estimate mapping quality scores without using any information about subsequent alignments. Our approach calculates the mapping quality score of perfect sequences extracted from the genome. For every position in the genome the average of the mapping quality scores is saved, as is the score at the starting position of the perfect sequence. The highest score covered by an alignment is used to calculate the mapping quality. This score is used to annotate the result file of a read mapper.

We evaluate the results by comparing them with the results of a read mapper which calculates mapping quality scores and directly calculated mapping quality scores. Our results indicate that the number of errors and the base call qualities have only a very small influence on the mapping quality score of an alignment. Because we only find a small influence we can precalculate mapping quality scores for a given a genome.

Additionally for a given genome we use the different read mapping results to find simulated single nucleotide polymorphisms (SNP). In our experiment the read map- per results without mapping quality scores generate better results than those with annotated mapping quality scores. 

Contents

Literature:

  1. Li, Heng, Jue Ruan, and Richard Durbin. 2008. “Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores..” Genome Research 18 (11) (November): 1851–1858. doi:10.1101/gr.078212.108.
  2. Lee, H, and M C Schatz. 2012. “Genomic Dark Matter: the Reliability of Short Read Mapping Illustrated by the Genome Mappability Score.” Bioinformatics (Oxford, England) 28 (16) (August 7): 2097–2105. doi:10.1093/bioinformatics/bts330.
  3. Siragusa, Enrico, David Weese, and Knut Reinert. 2013. “Fast and Accurate Read Mapping with Approximate Seeds and Multiple Backtracking.” Nucleic Acids Research (January 28). doi:10.1093/nar/gkt005.
  4. Weese, David, M Holtgrewe, and Knut Reinert. 2012. “RazerS 3: Faster, Fully Sensitive Read Mapping.” Bioinformatics (Oxford, England) 28 (20) (October 10): 2592–2599. doi:10.1093/bioinformatics/bts505.

Downloads