This is the progress report of Enrico Siragusa for the Spring 2011.
Accomplishments up to the Fall 2010
Literature
Seeds
Literature on seeds is vaster than I expected, a complete list of papers can be found
here.
Most important problems/approaches are summarized in my manuscript.
I read about related topics, i.e. boolean functions and approximation algorithms.
Research
Modeling
- Formal model based on boolean functions which models simple seeds and seed families.
Approximate String Matching Framework
- APX-ratio for the minimum non-detected error.
- APX-ratio for the complementary threshold.
- FPRAS for seed sensitivity/specificity values.
- Heuristic (APX?) for the optimal seed BDD construction.
DNA Homology Framework
- FPRAS for the Hit-Probability/Expectation.
Goals for the Spring 2011
Literature
Reading
Writing
- Complete my manuscript.
- Survey on seeds in sequence analysis? There is already a survey on seeds here, but it is only related to homology search.
Research
- Can we extend this formal framework to Edit Distance / Indel seeds / Subset seeds?
- Can we improve logic and engineering of indexing (Cache Oblivious) / filtering (AND) / verification ?
- Can efficient linear/non-linear programs be formulated for seed design?
- Can submodularity and monotonicity be used somehow for seed design?
- Can we construct explicitly quasi-optimal classes of seeds?
Development
- Benchmark for Approximate String Matching and DNA Homology Frameworks.
- Sensitivity/Specificity, Hit-Probability/Expectation estimation via FPRAS.
- DNA Homology Search using Hit-Expectation.
- ILP for Exact Optimum Threshold Computation.
- Heuristic Seed Design?