BscRegionFilter

Possible Project for a BSc Thesis in Bioinformatics or Computer Science

Introduction

"Many nucleotide and amino acid sequences are highly repetitive in nature. If your query sequence contains regions of low complexity or repeats, you can end up with many non-related, high scoring sequences being found during BLAST (or FASTA) searches (e.g. hits against proline-rich regions or poly-A tails). In other cases, your sequence may contain regions of vector sequence, or repeat regions such as Alu sequences, that you either do not want included in your sequence, or at the very least, wish to have discluded in any searches you carry out based on sequence similarity." [1]

Two projects make sense, depending on the student's interests and skill and time frame.

Implement multiple masking algorithms and compare (option 1)

Goal of this thesis is to reimplement the famous filtering algorithms SEG (protein sequences) [2] in a stand-alone SeqAn application and to compare this against the original implementation.

Steps:

Stretchgoals:

Expected outcome for student:

Add masking support to SeqAn3 (option 2)

The focus of this work would be to add masking functionality to the new library. It is more about a clean implementation, proper documentation and participation in the software project and it's workflows.

Goals:

Stretch-goals:

Expected outcome for student

Comments

 

References

[1] http://www.molbiol.ox.ac.uk/analysis_tools/BLAST/BLAST_filtering.shtml

[2] http://www.sciencedirect.com/science/article/pii/009784859385006X , http://www.sciencedirect.com/science/article/pii/S0076687996660352 , more information and documentation available; public reference implementation in C/C++ available in the ncbi-toolkit

[3] original unpublished, improved version: http://www.ncbi.nlm.nih.gov/pubmed/16796549

[4] http://bioinformatics.oxfordjournals.org/content/22/24/2980.full

[5] http://bioinformatics.oxfordjournals.org/content/30/17/i349.abstract