Peptide Indexer using SeqAn and OpenMS

09 Aug 2023 - 17:03 | Version 2 | UnknownUser

Background

In Proteomics one subtask for Peptide ID is to search peptide sequences in protein databases. This thesis shall implement a parallel search using the SeqAn index interface and incorporate it into the OpenMS PeptideIndexer program.

Topic

The goal of this thesis is to implement and evaluate different search strategies for detecting peptide sequences in protein data bases. The search should support exact and approximate searches. The student should implement index based searches using various implementations in SeqAn (FM-index, enhanced suffix arrays, lazy suffix trees) and allow for multiple indices in case the protein data base is too large. The approaches should be compared on various use cases of varying query and database sizes. The resulting search function should be used in an OpenMS program to first compute the matches which will then subsequently be subjected to enzymatic filters using OpenMS functionality.

Details

Two search modes should be available:

exact: [default mode] Peptide sequences require exact match in protein database. If at least one protein hit is found, no tolerant search is used for this peptide. If no protein for this peptide can be found, tolerant matching is automatically used for this peptide.

tolerant: Allow ambiguous amino acids in protein sequence, e.g., 'M' in peptide will match 'X' in protein.

This mode might yield more protein hits for some peptides (those that contain ambiguous amino acids). Tolerant search also allows for real sequence mismatches (see 'mismatches_max'), in case you want to find related proteins which might be the origin of a peptide if it had a SNP for example.

Input is a set of peptides and a set of proteins. The following parameters will be passed within the OpenMS program:

aaa_max [tolerant search only] Maximal number of ambiguous amino acids (AAAs) allowed when matching to a protein database with AAAs. AAAs are 'B', 'Z', 'J' and 'X'(default: '4' min: '0') (B=D or N, Z=E or Q, J= I or L, X matches all)
mismatches_max [tolerant search only] Maximal number of real mismatches (will be used after checking for ambiguous AA's (see 'aaa_max' option). In general this param should only be changed if you want to look other potential origins of a peptide which might have unknown SNPs or the like. (default: '0' min: '0')
IL_equivalent Treat the isobaric amino acids isoleucine ('I') and leucine ('L') as IL_equivalent (indistinguishable) otherwise it counts as mismatch