Page VariantDetectionPapers

Remarks to papers concerning variant detection in NGS data.

Method Papers

Simple variation

VarScan: variant detection in massively parallel sequencing of individual and pooled samples

Daniel C. Koboldt∗, Ken Chen, Todd Wylie, David E. Larson, Michael D. McLellan, Elaine R. Mardis, George M. Weinstock, Richard K. Wilson and Li Ding

Tool Name: VarScan (available for download)

Variation Types: SNPs, indels

Assumptions: Illumina/454

Method: "Given an alignments file, VarScan scores and sorts the alignments on a perread basis, discarding reads that aligned with low identity or to multiple (ambiguous) locations in the reference sequence. Next, the single best alignment for each read is screened for sequence changes. Variants detected in multiple reads are then combined together into unique SNPs and indels. For each predicted variant, VarScan determines the overall coverage, as well as the number of supporting reads, average base quality, and number of strands observed for each allele. Thresholds for coverage, quality, variant frequency, and/or number of reads required to call a variant are set automatically with the easyrun command, but can be manually adjusted by the user. VarScan reports SNPs, insertions, and deletions with their chromosomal coordinates, alleles, flanking sequence, and read counts."


"Here, we describe VarScan, an open source tool for detecting SNPs, insertions, and deletions and assessing their frequencies in massively parallel sequencing data. Unlike currently available variant detection tools, VarScan is compatible with several read aligners (BLAT, Newbler, cross_match, Bowtie and Novoalign), and calls variants in both individual and pooled samples."

SNP detection for massively parallel whole-genome resequencing

Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen and Jun Wang

Tool Name: SOAPsnp (available for download)

Variation Types: SNPs

Assumptions: Illumina sequencing data

  • Bayesian genotype calling (for each position likelihood is computed, priors are set according to the observation that transitions are four times more frequent than transversions)
  • Recalibration of quality values according to observed mismatch rate
  • Account for intrinsic bias of errors in Illumina data
  • Sum rank test is used to eliminate artificial heterozygote calls

Bayesian model:

  • posterior probability of genotype $T_i$: $P(T_i|D) = frac{P(T_i)P(D|T_i)}{sum_{x=1}^{S}P(T_x)P(D|T_x)}$

  • likelihood of diploid genotype $T_i=H_{m}H_n$: $P(d_k|T)=frac{P(d_k|H_m)+P(d_k|H_n)}{2}$

  • likelihood of $P(D|T)=\prod_{k=1}^{n}P(d_k|T)$

  • Low false positive rate for any sequencing depth

galign: A tool for rapid genome polymorphism discovery

Shai Shaham September 2009 PLoS ONE

Dindel: Accurate indel calls from short-read data

Cornelis A. Albers, Gerton Lunter, Daniel G. MacArthur, Gliean McVean, Willem H. Ouwehand, and Richard Durbin. Genome Reserach Oct 2010

Tool Name: Dindel

Variation Type: small indels, SNPs

Assumptions: Illumina data


produces candidate haplotypes of (indel) candidate regions, realigns reads to haplotypes (iteratively, using EM algo), and assign quality scores to indel calls.

Input: requires mapped reads (BAM) with mapping qualities.

Pairede-end mapping based

Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes

Fereydoun Hormozdiari, Can Alkan, Evan E. Eichler, and S. Cenk Sahinalp

Tool Name: ?

Variation Type: indels, inversions (SVs)

Assumptions: PE data

Method: ?

BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 2009.

Ken Chen, John W Wallis, Michael D McLellan, ... Elaine R Mardis.

Tool Name: BreakDancer

PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology 2009.

Jan O Korbel, Alexej Abyzov, Xinmeng J Mu, ... Michael Snyder and Mark B Gerstein.

Tool Name: PEMer

SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics 2010.

Bruno Zeitouni, Valentina Boeva, Isabelle Janoueix-Lerosey, ... Emmanuel Barillot.

Tool Name: SVDetect

MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods 2009.

Seunghak Lee, Fereydoun Hormozdiari, Can Alkan and Michael Brudno.

Tool Name: MoDIL

Depth-of-coverage (DOC) methods

Sensitive and accurate detection of copy number variants using read depth of coverage

Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, et al. August 2009 Genome Research

Tool Name: EWT (not available for download)

Variation Type: CNVs

Assumptions: High coverage (30x) -> normality in 100bp windows

Method: EWT (event wise testing). Number of read starts (read depth RD) in non-overlapping 100bp windows.
  • For each window:
RD r_i is corrected according to GC content in window $r_{i}^{'} = r_i * m/m_GC$ where $m =$ median RD, $m_GC =$ median RD for windows with this GC. Z-score: $z_i = (r_{i}^{'} - mean_readcount) / sigma_cov$. Upper/lower tail probability: $p_i^up = P(Z>z_i). p_i^low = P(Z<z_i)$.
  • For an interval A of consecutive windows:
unusual event, duplication: $max{p_{i}^{up} | i \in A} < (FPR/(L/l))^{(1/l)}$ unusual event, deletion: $max{p_{i}^{low} | i \in A} < (FPR/(L/l))^{(1/l)}$ where FPR is the nominal false-positive rate, $L$ is the number of windows of a chromosome, $l$ is the size of the interval A.
  • Iteration over l until N-1 where $(FPR/(L/N))^{(1/N)} > 0.5$
eg 100bp-windows on chromosome 1: L = 2.4Mio, FPR = 0.05 -> l iterates until 19 -> unusual events up to size 1.9kb
  • Merging of events, cluster formation: events within 500bp proximity are merged (if copy number change in the same direction). events with low absolute difference from the average (compare median rd in event vs global average) are discarded (if within 0.75..1.25). Significance of each cluster is tested by a one-sided Z-test (significance level $10^{-6}$)
  • Pairwise comparison of RD data from multiple individuals: RD in detected 100bp windows are compared by t-test, if p-values < 0.001 and absolute difference between read counts > 0.5 --> polymorphic.
  • Copy number = round(2*r_i^'/mean_readcount)

Difference arrayCGH <-> RD data: for array data variance in probe ratios is lowest for normal state, variance increases with copy number changes in both directions. for RD data, variance is lowest for the deletion state and increases proportionally with copy number.

Overlap PE detected SVs <-> RD detected SVs: very low (5-10%)! PE SVs enriched in simple repeats (12% of predicted PE SVs), RD SVs enriched in segmental duplications (40% of predicted RD SVs)

Implementation: Java

  • inconsistencies: the authors first write that reads with mapping quality < 30 were filtered out (MAQ alignments). later they say that due to maq's random distribution of multi-reads coverage in repetitive regions is not different from the mean coverage. ???

Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs

Joshua M Korn, Finny G Kuruvilla, Steven A McCarroll, Alec Wysoker, James Nemesh, Simon Cawley, Earl Hubbell, Jim Veitch, Patrick J Collins, Katayoon Darvishi, Charles Lee, Marcia M Nizzari, Stacey B Gabriel, Shaun Purcell, Mark J Daly, and David Altshuler October 2008 Nature Genetics

Tool Name: Birdsuite (available for download)

Variation Types: SNPs, CNVs

Assumptions: SNP array data, but ideas should be applicable to sequence data

  • CNP genotyping: Assign copy number across regions of known CNVs
one-dimensional Gaussian mixture model, clusters corresponding to different copy numbers are used to assign copy number to these known regions
  • SNP genotyping: Assign diploid genotypes to all regions having copy number two (+ detect probe-specific means and variances)
two-dimensional Gaussian mixture model, EM algo to determine clusters corresponding to genotypes AA,AB,BB
  • CNV discovery: Rare CNV detection by using probe-sepcific means and variances
HMM on CNV estimates for single probes. The hidden state is the true copy number of the individual’s genome; the observed states are the normalized intensity measurements of each probe on the array. emission probs: intensity measurements given underlying state. transition probs: between different copy numbers low, high within same copy number, also dependent on probe distance
  • Use CN information to estimate genotypes such as AAB and A
uses the imputed locations in A/B intensity space (imputation-extended two dimensional Gaussian mixture model) to assign allele-specific copy number genotype

Comments: integrated method that combines SNP with CNP calling, relies on a map of known CNVs and known SNPs. uses GMMs and HMMs.

Split/Spliced-read mapping

Fast and SNP-tolerant detection of complex variants and splicing in short reads

Thomas D. Wu and Serban Nacu Februray 2010, Bioinformatics

Tool Name: GSNAP

Implementation: C and Perl

Variation Types: complex variants, splicing


  • SNP-tolerant: aligning to a reference "space" (major/minor-allele genome)
  • index 12-mers every 3 nt in the genome (including all known SNP combinations)
  • spanning/complete set method to filter candidate regions
  • "full" sensitivity (as if every 12-mer were indexed) (100% sensitivity actually only guaranteed when read and ref share a common 14-mer --> floor(readLen/14)-1 mismatches)
  • ignores 12-mers that occur more than 10 times the mean (greedy option for subsequently including them, if no alignment is found)
  • indel can be close to the end of the read (vs our method: no shorter than q!)
  • probabilistic model for splice sites. if aligned exon sequence on one side is short, splice score has to be higher. if aligned sequence is long, splice score can be lower
  • also reports interchromosomal splice junctions and "partial-splicing"-matches
  • PE mode

results on simulated data:
  • 100K reads from whole genome, read lengths 36, 70, 100. ins up tp 9bp, del up to 30bp: compared with BWA, Bowtie, SOAP2, SOAP, MAQ. most sensitive for indels, runtimes comparable
  • simluated spliced reads based on RefSeq splice sites: missed only 0.1% of intragenic and 5% of intergenic splice sites (due to ow splice scores)
results on real RNA read data, read length 50, 100K reads:
  • known-SNP-tolerance increases alignment yield by 0.5%, 5% of reads were found to have additional best mapping locations, 0.3% of reads could be disambiguified by SNP tolerance

Comments:<\b> - can handle sequences of different length


Tool Name: SuperSlat

Implementation: c

Variation Types: splice junctions


Method: q-gram index of reference. No mismatches in alignment: "Supersplat works by partitioning each read into all possible two-chunk sizes, given minimum read chunk size (-c), and searching over each reference for locations where both chunks exactly match the reference, within given minimum intron size (-n) and maximum intron size (x)."

Comments: output can become extensive! outputs every possible split position on a read, i.e. each read multiple times

Global und unbiased detection of splice junctions from RNA-seq data

Tool Name: SplitSeek

Implementation: Perl

Variation Types: splice junctions


  • uses Applied Biosystems tools, splits reads into prefix-anchor, gap, and suffix-anchor. anchors are aligned separately, only reads where both anchors map uniquely are from then on considered!! combines anchor matches in perl script (extending half-matched reads into other exon if supported by other reads)
  • can identify junctions reads that overlap >= 5 bp with exon

results: evaluated on SOLiD RNA-seq data, 50bp:
  • low FDR (0.1%)
  • compare with RNA-MATE ?

Detection of splice junctions from paired-end RNA-seq data by SpliceMap

Tool Name: SpliceMap

Implementation: Python

Variation Types: splice junctions


  • Half-read mapping
  • Seeding selection
  • Junction search (* Paired-end filtering)

MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery

Kai Wang, Darshan Singh, Zheng Zeng,Stephen J. Coleman, ... , Jan F. Prins and Jinze Liu. June 2010 Nucleic Acids Research

Tool Name: MapSplice

Implementation: Python

Variation Types: splice junctions



first part: single read mapping

  • partition reads into segments
  • (exonic?) alignment of segments (using Bowtie), keep multi-hits
  • spliced alignment of segments, double-anchored (first and last segment of read match, but middle one doesnt) or single-anchored (only one neighboring segment matches, search within maxDist window)
  • segment assembly, find best split position, assign quality value to read (simple mismatch counts basically, but somehow quality weighted)

second part: multi-read assignment

  • splice junction inference: each read gets an anchor significance score, dependent on the length of the shortest segment. each candidate splice junction get a significance score: the max anchor score of supporting reads, a shannon entropy score somehow rewarding uniformity of mapped reads, and an average read quality, somehow combined.
  • select best alignment for each read: combine read quality socre + junction quality score

Comments:<\b> tried on 1Mio 100bp reads, onto first 5Mb of chr21 --> used 90% of RAM on ergophobie!!! --> > 200Gb --> not of practical use... using a smaller file --> merge pair error or could not allocate memory error..

A robust framework for detecting structural variations in a genome

Seunghak Lee, Elango Cheran and Michael Brudno 2008 Bioinformatics


Topic revision: r20 - 18 May 2011, AnneKatrinEmde
  • Printable version of this topic (p) Printable version of this topic (p)