You are here: ABI » ThesesHome » ThesisScoreMatrix

Substitution Matrix Generation Algorithms

This Bachelor thesis gets you in touch with current research for string matching, modern implementation in C++/SeqAn, sound Algorithm Engineering practice and data presentation. One aim is to include your code in a future release of the SeqAn library for sequence analysis.

This project can be tackled both as a Bachelor's and as a Master's thesis. Note that the project description is subject to discussion, depending on your interest and focus.

Rationale/Aims

Amino acid substitution matrices are crucial for dynamic programming algorithms on proteins. They are applied in database search programs such as BLAST and multiple sequence alignment programs such as ClustalW or T-Coffee.

The first substitution matrices (PAM) were proposed by Dayhoff in [Dayhoff et al., 1978] and are based on modelling the mutation process as Markov chains, more than 30 years ago. Henikoff and Henikoff proposed their BLOSUM matrices in [Henikoff, Henikoff, 1992], based on larger databases and counting heuristics. Recently, new approaches for computing better matrices have been proposed in [Mueller, Vingron, 2000] and [Mueller et al., 2002] and even more recently the effect of these new matrices on ClustalW has been studied in [Edgar, 2009].

In this thesis, the classic PAM algorithm and the new, better algorithms for computing scoring matrices should be implemented. The experimental evaluation and repetition of the studies from the literature is the second path of the thesis. The scoring matrices should be used in the existing SeqAn::T-Coffe multiple sequence aligner.

Master's Thesis Extension

While the computation of matrices can be guided by (sometimes rigurous) mathematical models, this is harder for the gap scoring costs. In the literature, they are often determined by tuning by hand. The implementation and first evaluation of the matrix generation schemes is the basic and "hard coded" part of the Master's thesis (of course this is still subject to discussion and own thoughts are encouraged).

After this implementation, your work can become more open:

  • Research the literature for approaches of tuning the gap parameters automatically.
  • Something similar is done in black box optimization from the topic of matchine learning, you can focus on this aspect.
  • Matrices can also be generated by inverse alignment. Gap parameters could be deduced from this approach, or an inverse alignment implementation could be part of your thesis.

If you can think of any extension in the estimation of gap parameters or the computation of substitution matrices, you are encouraged to follow this path. Another possible extension could be comparing the impact of different scoring matrices on different alignment programs.

Proposed Schedule

The thesis is planed for 8 weeks. After 2 weeks, you can decide whether you want to complete the thesis or look for another topic.

  • Literature research, getting started with SeqAn. (2 weeks)
  • Select promising candidates for implementation, prioritize the list.
  • Implement and test the selected candidates. (4 weeks)
  • Design the experimental study, collect data, perform study. (1 week)
  • Thesis writeup. (1 week)

There will be regular meetings with your supervisor.

If you are tackling this for a Master's thesis, we will start out with the schedule above and write up the rest after the discussion of the initial aims.

Proposed Requirements

The theoretical work in the thesis should:

  • give a short overview of the application of scoring matrices
  • give an overview of the general ideas in scoring matrix generation
  • describe the implemented algorithms.

The implementation must fulfill the following:

  • The original approach by Dayhoff et al. and the VTML approach must be implemented, probably another approach, such as VT or BLOSUM, too.
  • The implementation should be in C++, using the SeqAn coding standard.
  • The code must be well-tested.

The experimental study should:

  • use real-world data, ideally similar to the data in existing studies
  • follow good scientific standards and best practice, results should be repeatable
  • give a clear evaluation with good presentation of the yielded data (e.g. [Sanders, 2002]).

Optional aims:

  • The code is in such good shape that is can be included in the next release of SeqAn.
  • The scripts supporting the experimental study can be adapted to be used for further benchmarks in SeqAn.

Literature

The following could be a starting point for your own research.

  • [Dayhoff et al., 1978] Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978), "A model of Evolutionary Change in Proteins", Atlas of protein sequence and structure (volume 5, supplement 3 ed.), Nat. Biomed. Res. Found., p. 345–358.
  • [Henikoff, Henikoff, 1992] Henikoff, S.; Henikoff, J.G. (1992). "Amino Acid Substitution Matrices from Protein Blocks". PNAS 89: 10915–10919.
  • [Mueller, Vingron, 2000] Müller and Vingron. Modeling amino acid replacement. Journal of computational biology : a journal of computational molecular cell biology (2000) vol. 7 (6) pp. 761-76
  • [Mueller et al., 2002] Müller, T., Spang, R., Vingron, M. (2002). Estimating amino acid substitution models: a comparison of Dayhoff's Estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19(1), 8-13.
  • [Edgar, 2009] Optimizing substitution matrix choice and gap parameters for sequence alignment. Edgar RC. BMC Bioinformatics. 2009 Dec 2;10:396.
  • [Sanders, 2002] Presenting Data from Experiments in Algorithmics. P. Sanders. In Experimental Algorithmics -- From Algorithm Design to Robust and Efficient Software, volume 2547 of LNCS, pages 181-196. Springer, 2002.
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback