MSc Thesis: Faster HMM Learning with Indexing Structures
Motivation
Hidden Markov Models (HMMs) are among the most prominent methods in Bioinformatics. They are routinely applied to DNA sequences, for example for CpG-island or coding sequence detection, and to protein sequences for the important problem of protein domain identification among others. The statistical flexible foundation of HMMs make it the primary choice for many tasks.
Given the exponentially increasing amount of sequencing capacity by next and third generation sequencing approaches, their is a high demand for efficient HMM learning algorithms from sequence data.
SeqAn offers fast implementation of indexing algorithms, like suffix trees and suffix arrays, which combined with the HMM framework, can satisfy the above mentioned needs.
Overview
This thesis has the following parts.
- Implementation of an HMM learning algorithm based on indexing structures.
- Comparison with standard learning algorithms which are implemented in SeqAn to a large extend already.
Part (1) should be based on (Lifshits et al. 2009). The following plan consists of 6 months of work.
Work Plan
- week 1-2: Understanding the paper by Lifshits et al. 2009 and familiarization with HMMs
- week 3-4: Exploration of the current HMM implementation in SeqAn
- month2-3: Implementation of the new algorithm in SeqAn
- month4 : Performance Experiments on standard application with HMMs and comparison to standard learning algorithms in SeqAn.
- month5-6: Cleaning code, Documentation and Interpretation of results and writing the Master thesis.