This page describes the current draft of a Benchmark of Seeds for Sequence Analysis.
Sources of inspiration are
Rabema,
PAPI and
Pizza&Chili Corpus.
Approximate String Matching
Purpose
Empirically assess the performance of various seed based filtration criteria into solving off-line k-mismatches and k-differences problems.
Specific Questions to be Answered
- Which is the fastest lossless filtration approach?
- Which is the fastest lossy filtration approach?
- Which is the impact of cache misses, false positives, etc. on total time?
- Which is the correlation between threshold and speed/precision?
- How good are our specificity/sensitivity estimates w.r.t. random and real world sources?
Specific Applications
- Read mapping
- Assembly, or genome comparison
Approaches
Filtration Algorithms
- Counting based approaches (Swift, QUASAR?)
- Non-counting based approaches (i.e. Threshold = 1, e.g. PatternHunter)
- Non-overlapping based approaches (i.e. Pigeonhole, e.g. MrFast indexing reads and genome, Eland?)
- Adapted counting (HOBBES)
Verification Algorithms
- Naive Hamming distance
- Myers+Ukkonen Edit distance
Corpus
Real World Sources
- Tiny length (Viruses?)
- Small length (S. Cerevisiae)
- Medium length (D. Melanogaster)
- Large length (H. Sapiens)
Random Sources
They can be generated from
PAPI.
- Uniform text, which parameters?
- Bernoulli i.i.d. text, which parameters?
- Markov text, which parameters?
Methods
Queries
- Mason simulated reads on real world sources (or randomly sampled from the corresponding random sources?)
- Length m in [20,36,50,100,200,500,1000,2000]
- Errors rate k/m <= 0.2 for time reasons, theoretical filtration bound is 1-e/sqrt(sigma)
Gold Standards
For lossy methods
Index Construction Statistics
- Total time (User+System)
- Memory (Peak)
- Cache misses, page faults, etc. (Perf to access Intel PMU data])
Query Statistics
- Total time (User+System)
- Memory usage
- Cache misses, page faults, etc. (Use Perf to access Intel PMU data])
- Number of hits per lookup in query
- Number of false/true positives
- Number of false negatives (for lossy filters)
- Lookup time
- Counting time
- Verification time
Construction Estimates
Query Estimates
- Estimate on the number of hits per lookup in query
- Estimate on the number of false/true positives (Based on Karp et al.)
- Estimate on the number of false negatives (Based on Karp et al., Weese et al.)
Local Alignment
Purpose
Empirically assess the performance of various seed based filtration criteria into solving local alignment problems.
Here, we have experiences with STELLAR and Birte.