Peptide Indexer using SeqAn and OpenMS

Background

In Proteomics one subtask for Peptide ID is to search peptide sequences in protein databases. This thesis shall implement a parallel search using the SeqAn index interface and incorporate it into the OpenMS PeptideIndexer program.

Topic

The goal of this thesis is to implement and evaluate different search strategies for detecting peptide sequences in protein data bases. The search should support exact and approximate searches. The student should implement index based searches using various implementations in SeqAn (FM-index, enhanced suffix arrays, lazy suffix trees) and allow for multiple indices in case the protein data base is too large. The approaches should be compared on various use cases of varying query and database sizes. The resulting search function should be used in an OpenMS program to first compute the matches which will then subsequently be subjected to enzymatic filters using OpenMS functionality.

Details

Two search modes should be available:

This mode might yield more protein hits for some peptides (those that contain ambiguous amino acids). Tolerant search also allows for real sequence mismatches (see 'mismatches_max'), in case you want to find related proteins which might be the origin of a peptide if it had a SNP for example.

Input is a set of peptides and a set of proteins. The following parameters will be passed within the OpenMS program: