You are here: ABI » SnpStore

Page SnpStore

Short overview of SnpStore implementation.

General structure

Here is a general pseduocode-like outline of how SnpStore proceeds. The functionality is implemented in the function given in brackets.

  1. Option parsing (main)
  2. Import genome (detectSNPs)
  3. Scan over each chromosome (detectSNPs)
  4. Import reads of current genomic window, store in FragmentStore (readMatchesFrom<GFF/SamBam>_Batch)
  5. Apply pile up correction to decrease the effect of PCR amplification artifacts (applyPileupCorrection)
  6. Transform coordinates to local ones (transformCoordinates)
  7. Apply read clipping (clip tags as well as soft clipping) (clipReads)
    1. IF realignment switched on:
      1. Copy each local group into an own FragmentStore, transform coordinates (realignReadsBatchWrap)
      2. IF there are indel-containing reads:
        1. Do realignment in two rounds: once without reference, then with reference sequence included (dumpVariantsRealignBatch)
        2. Do SNP calling by walking through the multi-read-to-ref alignment columns and applying one of the two SNP calling models described below (dumpVariantsRealignBatch)
        3. Generate a (column by column) indel consensus call and later do indel calling based on the indel consensus sequence (dumpVariantsRealignBatch)
      3. ELSE: see line 7.2.2, do SNP calling without realignment (dumpSNPsBatch)
    2. ELSE:
      1. Do indel calling by inspecting all candidate indels in the pairwise alignments of mapped reads (dumpShortIndelPolymorphismsBatch)
      2. Do SNP calling based on the multi-read-to-reference alignment induced by the pairwise alignments of mapped reads (dumpSNPsBatch)

Indels and SNPs are written out on the fly. The actual criteria to do SNP and indel calling are described in the following.

SNP calling

There are two methods implemented for SNP calling: the threshold model and the Maq-like model. Both operate on an alignment column of the (realigned) multi-read-to-reference alignment. And for both calling methods, first the following criteria are checked for each candidate variant:
  • A minimal coverage, i.e. min. read depth, needs to be observed at the candidate position.
  • A map of read positions records how many different read positions support the variant. Positions that are within X (user parameter -eb) bases of the read border are excluded from this map. Only variant candidates supported by at least Y (user parameter -dp) different read positions may be called as SNPs. (usually we set X <= 5% of read length, and Y >= 2 depending on coverage).
  • The two most frequent bases in the alignment column need to make up at least a certain fraction of the alignment column (user parameter -mec).

The threshold model then checks the following criteria
  • There needs to be at least a certain number of reads supporting the variant (user parameter -mm)
  • The variant-supporting reads need to constitute at least a certain fraction of the alignment column (user parameter -pt)
  • The average base quality value associated with the variant base needs to be at least a certain value (user parameter -mq)

The Maq-model on the other hand uses a Bayesian statistical framework and computes genotype probabilites for diploid genotypes aa, ab, and bb for the two most frequent bases a and b in each candidate alignment column. See the original paper "Mapping short DNA sequencing reads and calling variants using mapping quality scores" by Heng Li for details. The only difference is that SnpStore uses raw base call quality values instead of mapping qualities. Also, SnpStore offers a version of the Maq-statistics that corrects for a larger variance than expected from a binomial distribution at heterzygote variant positions. This comes from amplification biases as described in "The allele distribution in next-generation sequencing data sets is accurately described as the result of a stochastic branching process" by Heinrich et al. However, this option is switched off by default, as the increase in heterozygote calling sensitivity comes at the cost of a considerable increase in false positive SNP calls.

Indel calling

For indel calling only a threshold model is implemented. The criteria checked are:
  • There needs to be at least a certain number of reads supporting the indel (user parameter -it)
  • The indel-supporting reads need to constitute at least a certain fraction of the alignment column (user parameter -ipt)
  • Optionally, only reads overlapping at least X bases with the indel candidate position will be considered as indel-contradicting (user parameter -ebi)
  • The average base quality value associated with the indel-neighboring bases needs to be at least a certain value (user parameter -iqt)
  • The indel needs to be observed on reads from both strands (if user parameter -bsi is switched on)

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback