BscSeqAnAlphabetReduction

((short description what this page is about))

introduction

"Research on the functional redundancy of amino acids dates back to the late 70s of the 20th century (Sander and Schul, 1979). It has mostly been used in structural research, i.e. the anal- ysis and prediction of protein folds and the de-novo design of functional proteins (Regan and DeGrado, 1988). The main purpose of reducing the alphabet today is the reduction of computa- tional complexity, while sacrificing as little sensitivity as possible. Many approaches are specific to certain protein families and derive their grouping of amino acids directly from the biochemical and physical properties of the respective amino acids (Regan and DeGrado, 1988). Other metrics include Miyazawa-Jernigan interactions (Miyazawa and Jernigan, 1996), used by Wang and Wang (1999) and substitution matrices, like Blosum (Henikoff and Henikoff, 1992), used by Murphy et al. (2000) and Li et al. (2003).

The latter approaches are especially useful for sequence alignment, because substitution ma- trices are also fundamental parts of scoring algorithms in most sequence alignment applications and the impact of reductions that are based on the same metric as the target function is intuitively clear. Beside the method of reduction, integral parameters are the size of the original alphabet and the desired output size of the target alphabet, i.e. the number of clusters that remain after re- duction. All of the aforementioned methods begin the reduction on the canonical 20-letter amino acid alphabet that includes all proteinogenic amino acids, without the rare amino acids Seleno- cystein (U) and Pyrrolysine (O) and that does not include a character for the STOP-codon and non of the wildcard characters frequently encountered (X for “any amino acid”; B for “N or D”; Z for “Q or E”). Depending on the method the target size may be fixed or variable, some research indi- cating that sizes as low 5 are sufficient (Bacardit et al., 2009), most suggesting that 10-12 letters are required and/or most effective (Li et al., 2003; Murphy et al., 2000; Ye et al., 2011)."

-- from Hannes Hauswedell's master thesis: http://www.mi.fu-berlin.de/en/inf/groups/abi/theses/master_dipl/hauswedell/msc_thesis_hauswedell.pdf (pp. 14-15)

Tasks

* study of literature: what are the alphabet reductions used historically? What methods were used for clustering? What are recent publications in the field? Which reductions are used by current protein alignment programs, e.g. Lambda, Diamond, MMSeqs2, Malt, Rapsearch2, Paladin?

* implementation: select an interesting sub-set of reductions and implement them in SeqAn. Write test and conversion functions...

* implementation: add support for the implemented alphabets to the Lambda application; possibly also add support to another application.

* evaluation: study the effect of different reductions on the performance and sensitivity of Lambda (and possibly another application). What can be said of the different reductions? What influence does the size of the reduced alphabet have? Can you recommend that Lambda choose a different alphabet in the future?

Comments

 
Topic revision: r1 - 09 Mar 2017, HannesHauswedell
 
  • Printable version of this topic (p) Printable version of this topic (p)