Detection of homologous regions with FFT

Given a set of sequences your task is to compute homologous segments for all pairs of the input set. The detection of homologous regions is the initial step in the MAFFT algorithm -- a multiple sequence alignment (MSA) algorithm with high accuracy [2]. Among other tasks, multi-alignments help to detect functional residues, primer sites, or to infer the evolutionary history.

Task

You can start with two sequences as input and use an existing FFT implementation. A nice template-recursive version is described on Dr. Dobbs page. If you choose Python as a programming language and its FFT library, and not C++, it is expected that you also implement the alignment step. In this case the expected output is an alignment of two sequences. In the latter case, you only output the pairs of homologuous regions between two sequences. You finally extend the approach to multiple (>2) sequences.

Background

MAFFT has four stages: pairwise homologous computation, guide tree construction, alignment, and a final re-alignment. Homologous regions are detected with the fast Fourier transform (FFT), which operates on numerical input. The sequences are therefore converted to sequences composed of volume and polarity values of each amino acid residue [1]. The reason is that substitutions between physically and chemically similar amino acids tend to preserve the protein structure.

Outlook

You can continue the project as a Bachelor's thesis. Two tracks are possible: implementation of the complete algorithm or the parallelization of the first stage.

  • Track 1 "Multiple sequence alignment with FFT in SeqAn": You complete the MSA algorithm by implementing the two succeeding stages "guide tree construction" and "alignment" with the SeqAn2 library. SeqAn2 provides functionalities for file i/o and other common sequence analysis tasks. Optionally, you can implement the final "re-alignment" step. The algorithm is described here: [1].
  • Track 2 "Parallelization of Homologous Region Detection with FFT": Apply parallelization techniques to the pairwise homologous region computation using either SIMD vectorization and multithreading (e.g. pthreads or OpenMP) or GPU parallelization (CUDA, RocM, or MIOpen). You would need a computer with a multicore processor supporting either SIMD data types up to 256 bits (most Intel chips do, e.g. check Intel's AVX SIMD) or a graphic card supporting CUDA, RocM or MIOpen. [3] describes an implementation of the complete MAFFT algorithm with CUDA.

References

[1] Katoh et al.: "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform", https://www.ncbi.nlm.nih.gov/pubmed/12136088
[2] Pais et al.: "Assessing the efficiency of multiple sequence alignment programs", https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4015676/pdf/1748-7188-9-4.pdf
[3] Zhu et al.: "Parallel Implementation of MAFFT on CUDA-Enabled Graphics Hardware", https://www.researchgate.net/publication/272382888_Parallel_implementation_of_MAFFT_on_CUDA-enabled_graphics_hardware
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback