PSMB_Seqan_2015_p4

Introduction

Trypsin is a serine protease found in the digestive system of many vertebrates, where it hydrolyses proteins. It cleaves peptide chains at the carboxyl side of the amino acids lysine (K) or arginine (R), except when either is followed by proline (P). For example, given the following peptide:
AAAAKBBBBRCCCCKPDDDDKE
trypsin would cleave it whenever it encounters either K or R not followed by P, and thus produce the following factorization in tryptic peptides:
AAAAK
BBBBR
CCCCKPDDDDK
E

Trypsin sometimes misses cleavage sites. For this reason, we are interested into all tryptic peptides up to k miscleavages. For instance, on the above peptide, trypsin would produce the following tryptic peptides up to one miscleavage (k=1):
AAAAK
AAAAKBBBBR
BBBBR
BBBBRCCCCKPDDDDK
CCCCKPDDDDK
CCCCKPDDDDKE
E

Moreover, given a protein database, we are interested into enumerating each tryptic peptide only once, that is uniquely.

Methods

The key idea to enumerate tryptic peptides uniquely across the database is to use a dictionary data structure.

Naive Method

A naive method would consist into scanning the protein database and computing its factorization in tryptic peptides. Each factor would have to be looked up in a dictionary, e.g. an unordered set, if absent then inserted in the dictionary and counted as unique.

It is easy to see that this method is optimal for k=0 but is not so appealing for k>0. Indeed, we would have to store in the dictionary all possible tryptic peptides up to k miscleavages. Thus, the space consumption of the dictionary would depend on k. For this reason, below we propose a slightly more involved method that requires a dictionary of size linear in the protein database.

Radix Tree Method

For this method, the dictionary must be a radix tree. First, we scan the protein database and construct a radix tree representing all suffixes starting at some tryptic peptide. Subsequently, we perform a top-down traversal of the radix tree to find the end of any tryptic peptide from 0 up to k miscleavages.

The radix tree can be easily implemented as a radix array, i.e. a partial suffix array that contains only those suffixes starting at some tryptic peptide. Once filled, the radix array can be sorted using quicksort. The top-down traversal is better described as a recursive algorithm. A first draft of the pseudo-code of the traversal algorithm is left to the student.

Implementation

Write a program implementing the two above methods to uniquely enumerate all tryptic peptides up to k miscleavages.

Input

  • Protein database file in FastA format
  • Integer k bounding the trypsin miscleavages

Output

  • Number of unique tryptic peptides in the database (to the standard output)
  • List of all unique tryptic peptides (to a file provided as optional argument)

Steps

  • Read the database file
  • Enumerate all tryptic peptides
  • Write the list of tryptic peptides to the provided file

Guidelines

  • Properly design the software
  • Document the source code
  • Unit test the software

Example

Given the example file test.fa:

$ cat test.fa
>test1
AAAAKBBBBRCCCCKPDDDDKE
>test2
AKEE

the program invoked as:

$ trypsin_enumerator test.fa 3 -o factors.txt
12

for k=3, outputs to standard output the number of tryptic peptides (12) and writes the list to the file factors.txt:

$ cat factors.txt
AAAAK
AAAAKBBBBR
AAAAKBBBBRCCCCKPDDDDK
AAAAKBBBBRCCCCKPDDDDKE
AK
AKEE
BBBBR
BBBBRCCCCKPDDDDK
BBBBRCCCCKPDDDDKE
CCCCKPDDDDK
CCCCKPDDDDKE
EE

Evaluation

Evaluate both methods on the complete Swiss-Prot database. Consider values of k ranging from 0 to 10. Measure runtime and memory consumption, in addition to the number of unique tryptic peptides.

SeqAn Modules

The following SeqAn components are relevant for the project:

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback