You are here: ABI » ResearchLogEnricoSiragusa » BoSSADraft

Page BoSSADraft

09 Aug 2023 - 17:03 | Version 14 | UnknownUser

This page describes the current draft of a Benchmark of Seeds for Sequence Analysis. Sources of inspiration are Rabema, PAPI and Pizza&Chili Corpus.

Approximate String Matching

Purpose

Empirically assess the performance of various seed based filtration criteria into solving off-line k-mismatches and k-differences problems.

Specific Questions to be Answered

Which is the fastest lossless filtration approach?
Which is the fastest lossy filtration approach?
Which is the impact of cache misses, false positives, etc. on total time?
Which is the correlation between threshold and speed/precision?
How good are our specificity/sensitivity estimates w.r.t. random and real world sources?

Specific Applications

Read mapping
Assembly, or genome comparison

Approaches

Filtration Algorithms

Counting based approaches (Swift, QUASAR?)
- Maximum weight ungapped seed + optimum lossless threshold (e.g. razerS)
- Maximum weight gapped seed + optimum lossless threshold (From Burhkardt et al., Sahinalp et al.)
- Disjunctive seeds family + threshold value (From Kucherov et al., Sahinalp et al.)
- Conjunctive seeds family + threshold vector (From Fontaine et al., adapted Sahinalp et al.)
Non-counting based approaches (i.e. Threshold = 1, e.g. PatternHunter)
Non-overlapping based approaches (i.e. Pigeonhole, e.g. MrFast indexing reads and genome, Eland?)
Adapted counting (HOBBES)

Verification Algorithms

Naive Hamming distance
Myers+Ukkonen Edit distance

Corpus

Real World Sources

Tiny length (Viruses?)
Small length (S. Cerevisiae)
Medium length (D. Melanogaster)
Large length (H. Sapiens)

Random Sources

They can be generated from PAPI.

Uniform text, which parameters?
Bernoulli i.i.d. text, which parameters?
Markov text, which parameters?

Methods

Queries

Mason simulated reads on real world sources (or randomly sampled from the corresponding random sources?)
Length m in [20,36,50,100,200,500,1000,2000]
Errors rate k/m <= 0.2 for time reasons, theoretical filtration bound is 1-e/sqrt(sigma)

Gold Standards

For lossy methods

Produce a gold standard for each dataset, as in Holtgrewe et al.

Index Construction Statistics

Total time (User+System)
Memory (Peak)
Cache misses, page faults, etc. (Perf to access Intel PMU data])

Query Statistics

Total time (User+System)
Memory usage
Cache misses, page faults, etc. (Use Perf to access Intel PMU data])
Number of hits per lookup in query
Number of false/true positives
Number of false negatives (for lossy filters)
Lookup time
Counting time
Verification time

Construction Estimates

Query Estimates

Estimate on the number of hits per lookup in query
Estimate on the number of false/true positives (Based on Karp et al.)
Estimate on the number of false negatives (Based on Karp et al., Weese et al.)

Local Alignment

Purpose

Empirically assess the performance of various seed based filtration criteria into solving local alignment problems. Here, we have experiences with STELLAR and Birte.

Comments

Sitemap

GeschichteDesComputers

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback

Page BoSSADraft

Approximate String Matching

Purpose

Specific Questions to be Answered

Specific Applications

Approaches

Filtration Algorithms

Verification Algorithms

Corpus

Real World Sources

Random Sources

Methods

Queries

Gold Standards

Index Construction Statistics

Query Statistics

Construction Estimates

Query Estimates

Local Alignment

Purpose

Comments

Attachments ($count)

Sitemap