Page BoSSADraft

This page describes the current draft of a Benchmark of Seeds for Sequence Analysis. Sources of inspiration are Rabema, PAPI and Pizza&Chili Corpus.

Approximate String Matching

Purpose

Empirically assess the performance of various seed based filtration criteria into solving off-line k-mismatches and k-differences problems.

Specific Questions to be Answered

  • Which is the fastest lossless filtration approach?
  • Which is the fastest lossy filtration approach?
  • Which is the impact of cache misses, false positives, etc. on total time?
  • Which is the correlation between threshold and speed/precision?
  • How good are our specificity/sensitivity estimates w.r.t. random and real world sources?

Specific Applications

  • Read mapping
  • Assembly, or genome comparison

Approaches

Filtration Algorithms

  • Counting based approaches (Swift, QUASAR?)
  • Non-counting based approaches (i.e. Threshold = 1, e.g. PatternHunter)
  • Non-overlapping based approaches (i.e. Pigeonhole, e.g. MrFast indexing reads and genome, Eland?)
  • Adapted counting (HOBBES)

Verification Algorithms

  • Naive Hamming distance
  • Myers+Ukkonen Edit distance

Corpus

Real World Sources

  • Tiny length (Viruses?)
  • Small length (S. Cerevisiae)
  • Medium length (D. Melanogaster)
  • Large length (H. Sapiens)

Random Sources

They can be generated from PAPI.
  • Uniform text, which parameters?
  • Bernoulli i.i.d. text, which parameters?
  • Markov text, which parameters?

Methods

Queries

  • Mason simulated reads on real world sources (or randomly sampled from the corresponding random sources?)
  • Length m in [20,36,50,100,200,500,1000,2000]
  • Errors rate k/m <= 0.2 for time reasons, theoretical filtration bound is 1-e/sqrt(sigma)

Gold Standards

For lossy methods

Index Construction Statistics

  • Total time (User+System)
  • Memory (Peak)
  • Cache misses, page faults, etc. (Perf to access Intel PMU data])

Query Statistics

  • Total time (User+System)
  • Memory usage
  • Cache misses, page faults, etc. (Use Perf to access Intel PMU data])
  • Number of hits per lookup in query
  • Number of false/true positives
  • Number of false negatives (for lossy filters)
  • Lookup time
  • Counting time
  • Verification time

Construction Estimates

Query Estimates

  • Estimate on the number of hits per lookup in query
  • Estimate on the number of false/true positives (Based on Karp et al.)
  • Estimate on the number of false negatives (Based on Karp et al., Weese et al.)

Local Alignment

Purpose

Empirically assess the performance of various seed based filtration criteria into solving local alignment problems. Here, we have experiences with STELLAR and Birte.

Comments

 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback