Implementing and evaluating different strategies to assign the set of k-mers in a NGS sample to genomic bins


For metagenomics read mapping it becomes more and more difficult to use a single index for read mapping, due to the size of the data sets. One obvious possibility to circumvent this, is to partition the reference in several bins and index the bins. In order to avoid the overhead of mapping all reads to all indices, one can devise filters in a preprocessing step. Those need to be sufficiently fast and space efficient to achieve a speedup compared to the trivial strategy. One possible filter is based on k-mer counting, which in turn needs a data structure to quickly lookup all the bins in which a given k-mer occurs.


In this BSc thesis various possible implementations for k-mer counting based filters shall be evaluated and the results discussed. As data structures the thesis shall use:
  • direct addressing k-mer index (viable for k<=14,15)
  • open addressing k-mer index (viable for k<= 32) and not too many different k-mers
  • The Dadi-Bloom filter index based on SDSL bitvectors
  • The Dadi-Bloom filter index based on compressed SDSL bitvectors
As data set we will use a metagenomic set of bacterial sequences provided by the RKI.


Preliminary schedule:

  • [Week 1-2] Set up SeqAn and the test environment
  • [Week 3] Read about Bloom filters and k-mer counting filter and adapt the implementations of Dadi for the thesis
  • [Week 4-5] Implement all above versions of data structures
  • [Week 6-8] Run and evaluate different parameter settings, varying the number of bins, k, and the implementation. Report the number of needed verifications.
  • [Week 9-12] Write thesis
Topic revision: r1 - 29 Nov 2017, KnutReinert
  • Printable version of this topic (p) Printable version of this topic (p)