Implementing and evaluating different strategies to assign the set of k-mers in a NGS sample to genomic bins

Background

For metagenomics read mapping it becomes more and more difficult to use a single index for read mapping, due to the size of the data sets. One obvious possibility to circumvent this, is to partition the reference in several bins and index the bins. In order to avoid the overhead of mapping all reads to all indices, one can devise filters in a preprocessing step. Those need to be sufficiently fast and space efficient to achieve a speedup compared to the trivial strategy. One possible filter is based on k-mer counting, which in turn needs a data structure to quickly lookup all the bins in which a given k-mer occurs.

Topic

In this BSc thesis various possible implementations for k-mer counting based filters shall be evaluated and the results discussed. As data structures the thesis shall use: As data set we will use a metagenomic set of bacterial sequences provided by the RKI.

Schedule

Preliminary schedule: