Page VariantDetectionPapers

Remarks to papers concerning variant detection in NGS data.

Method Papers

Simple variation

VarScan: variant detection in massively parallel sequencing of individual and pooled samples

Daniel C. Koboldt∗, Ken Chen, Todd Wylie, David E. Larson, Michael D. McLellan, Elaine R. Mardis, George M. Weinstock, Richard K. Wilson and Li Ding

Tool Name: VarScan (available for download)

Variation Types: SNPs, indels

Assumptions: Illumina/454

Method: "Given an alignments file, VarScan scores and sorts the alignments on a perread basis, discarding reads that aligned with low identity or to multiple (ambiguous) locations in the reference sequence. Next, the single best alignment for each read is screened for sequence changes. Variants detected in multiple reads are then combined together into unique SNPs and indels. For each predicted variant, VarScan determines the overall coverage, as well as the number of supporting reads, average base quality, and number of strands observed for each allele. Thresholds for coverage, quality, variant frequency, and/or number of reads required to call a variant are set automatically with the easyrun command, but can be manually adjusted by the user. VarScan reports SNPs, insertions, and deletions with their chromosomal coordinates, alleles, flanking sequence, and read counts."

Comments:

"Here, we describe VarScan, an open source tool for detecting SNPs, insertions, and deletions and assessing their frequencies in massively parallel sequencing data. Unlike currently available variant detection tools, VarScan is compatible with several read aligners (BLAT, Newbler, cross_match, Bowtie and Novoalign), and calls variants in both individual and pooled samples."

SNP detection for massively parallel whole-genome resequencing

Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen and Jun Wang

Tool Name: SOAPsnp (available for download)

Variation Types: SNPs

Assumptions: Illumina sequencing data

Method:

Bayesian model:

Comments:

galign: A tool for rapid genome polymorphism discovery

Shai Shaham September 2009 PLoS ONE

Dindel: Accurate indel calls from short-read data

Cornelis A. Albers, Gerton Lunter, Daniel G. MacArthur, Gliean McVean, Willem H. Ouwehand, and Richard Durbin. Genome Reserach Oct 2010

Tool Name: Dindel

Variation Type: small indels, SNPs

Assumptions: Illumina data

Method:

produces candidate haplotypes of (indel) candidate regions, realigns reads to haplotypes (iteratively, using EM algo), and assign quality scores to indel calls.

Input: requires mapped reads (BAM) with mapping qualities.

Pairede-end mapping based

Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes

Fereydoun Hormozdiari, Can Alkan, Evan E. Eichler, and S. Cenk Sahinalp

Tool Name: ?

Variation Type: indels, inversions (SVs)

Assumptions: PE data

Method: ?

BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nature Methods 2009.

Ken Chen, John W Wallis, Michael D McLellan, … Elaine R Mardis.

Tool Name: BreakDancer

PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology 2009.

Jan O Korbel, Alexej Abyzov, Xinmeng J Mu, … Michael Snyder and Mark B Gerstein.

Tool Name: PEMer

SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data. Bioinformatics 2010.

Bruno Zeitouni, Valentina Boeva, Isabelle Janoueix-Lerosey, … Emmanuel Barillot.

Tool Name: SVDetect

MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nature Methods 2009.

Seunghak Lee, Fereydoun Hormozdiari, Can Alkan and Michael Brudno.

Tool Name: MoDIL

Depth-of-coverage (DOC) methods

Sensitive and accurate detection of copy number variants using read depth of coverage

Seungtai Yoon, Zhenyu Xuan, Vladimir Makarov, et al. August 2009 Genome Research

Tool Name: EWT (not available for download)

Variation Type: CNVs

Assumptions: High coverage (30x) → normality in 100bp windows

Method: EWT (event wise testing). Number of read starts (read depth RD) in non-overlapping 100bp windows. RD r_i is corrected according to GC content in window $r_{i}^{'} = r_i * m/m_GC$ where $m =$ median RD, $m_GC =$ median RD for windows with this GC. Z-score: $z_i = (r_{i}^{'} - mean_readcount) / sigma_cov$. Upper/lower tail probability: $p_i^up = P(Z>z_i). p_i^low = P(Z<z_i)$. unusual event, duplication: $max{p_{i}^{up} | i \in A} < (FPR/(L/l))^{(1/l)}$ unusual event, deletion: $max{p_{i}^{low} | i \in A} < (FPR/(L/l))^{(1/l)}$ where FPR is the nominal false-positive rate, $L$ is the number of windows of a chromosome, $l$ is the size of the interval A. eg 100bp-windows on chromosome 1: L = 2.4Mio, FPR = 0.05 → l iterates until 19 → unusual events up to size 1.9kb

Difference arrayCGH <-> RD data: for array data variance in probe ratios is lowest for normal state, variance increases with copy number changes in both directions. for RD data, variance is lowest for the deletion state and increases proportionally with copy number.

Overlap PE detected SVs <-> RD detected SVs: very low (5-10%)! PE SVs enriched in simple repeats (12% of predicted PE SVs), RD SVs enriched in segmental duplications (40% of predicted RD SVs)

Implementation: Java

Comments:

Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs

Joshua M Korn, Finny G Kuruvilla, Steven A McCarroll, Alec Wysoker, James Nemesh, Simon Cawley, Earl Hubbell, Jim Veitch, Patrick J Collins, Katayoon Darvishi, Charles Lee, Marcia M Nizzari, Stacey B Gabriel, Shaun Purcell, Mark J Daly, and David Altshuler October 2008 Nature Genetics

Tool Name: Birdsuite (available for download)

Variation Types: SNPs, CNVs

Assumptions: SNP array data, but ideas should be applicable to sequence data

Method: one-dimensional Gaussian mixture model, clusters corresponding to different copy numbers are used to assign copy number to these known regions two-dimensional Gaussian mixture model, EM algo to determine clusters corresponding to genotypes AA,AB,BB HMM on CNV estimates for single probes. The hidden state is the true copy number of the individual’s genome; the observed states are the normalized intensity measurements of each probe on the array. emission probs: intensity measurements given underlying state. transition probs: between different copy numbers low, high within same copy number, also dependent on probe distance uses the imputed locations in A/B intensity space (imputation-extended two dimensional Gaussian mixture model) to assign allele-specific copy number genotype

Comments: integrated method that combines SNP with CNP calling, relies on a map of known CNVs and known SNPs. uses GMMs and HMMs.

Split/Spliced-read mapping

Fast and SNP-tolerant detection of complex variants and splicing in short reads

Thomas D. Wu and Serban Nacu Februray 2010, Bioinformatics

Tool Name: GSNAP

Implementation: C and Perl

Variation Types: complex variants, splicing

Assumptions:

Method:

results on simulated data: results on real RNA read data, read length 50, 100K reads:

Comments:<\b> - can handle sequences of different length

SuperSplat

Tool Name: SuperSlat

Implementation: c

Variation Types: splice junctions

Assumptions:

Method: q-gram index of reference. No mismatches in alignment: "Supersplat works by partitioning each read into all possible two-chunk sizes, given minimum read chunk size (-c), and searching over each reference for locations where both chunks exactly match the reference, within given minimum intron size (-n) and maximum intron size (x)."

Comments: output can become extensive! outputs every possible split position on a read, i.e. each read multiple times

Global und unbiased detection of splice junctions from RNA-seq data

Tool Name: SplitSeek

Implementation: Perl

Variation Types: splice junctions

Assumptions:

Method:
  • uses Applied Biosystems tools, splits reads into prefix-anchor, gap, and suffix-anchor. anchors are aligned separately, only reads where both anchors map uniquely are from then on considered!! combines anchor matches in perl script (extending half-matched reads into other exon if supported by other reads)
  • can identify junctions reads that overlap >= 5 bp with exon

results: evaluated on SOLiD RNA-seq data, 50bp:
  • low FDR (0.1%)
  • compare with RNA-MATE ?

Detection of splice junctions from paired-end RNA-seq data by SpliceMap

Tool Name: SpliceMap

Implementation: Python

Variation Types: splice junctions

Assumptions:

Method:
  • Half-read mapping
  • Seeding selection
  • Junction search (* Paired-end filtering)

MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery

Kai Wang, Darshan Singh, Zheng Zeng,Stephen J. Coleman, … , Jan F. Prins and Jinze Liu. June 2010 Nucleic Acids Research

Tool Name: MapSplice

Implementation: Python

Variation Types: splice junctions

Assumptions:

Method:

first part: single read mapping

  • partition reads into segments
  • (exonic?) alignment of segments (using Bowtie), keep multi-hits
  • spliced alignment of segments, double-anchored (first and last segment of read match, but middle one doesnt) or single-anchored (only one neighboring segment matches, search within maxDist window)
  • segment assembly, find best split position, assign quality value to read (simple mismatch counts basically, but somehow quality weighted)

second part: multi-read assignment

  • splice junction inference: each read gets an anchor significance score, dependent on the length of the shortest segment. each candidate splice junction get a significance score: the max anchor score of supporting reads, a shannon entropy score somehow rewarding uniformity of mapped reads, and an average read quality, somehow combined.
  • select best alignment for each read: combine read quality socre + junction quality score

Comments:<\b> tried on 1Mio 100bp reads, onto first 5Mb of chr21 --> used 90% of RAM on ergophobie!!! --> > 200Gb --> not of practical use… using a smaller file --> merge pair error or could not allocate memory error..

A robust framework for detecting structural variations in a genome

Seunghak Lee, Elango Cheran and Michael Brudno 2008 Bioinformatics

Comments