You are here: Foswiki>ABI Web>ThesesHome>BscRegionFilter (24 Oct 2017, h4nn3s)Edit Attach

BscRegionFilter

Possible Project for a BSc Thesis in Bioinformatics or Computer Science

Introduction
Implement multiple masking algorithms and compare (option 1)
Add masking support to SeqAn3 (option 2)
Comments
References

Introduction

"Many nucleotide and amino acid sequences are highly repetitive in nature. If your query sequence contains regions of low complexity or repeats, you can end up with many non-related, high scoring sequences being found during BLAST (or FASTA) searches (e.g. hits against proline-rich regions or poly-A tails). In other cases, your sequence may contain regions of vector sequence, or repeat regions such as Alu sequences, that you either do not want included in your sequence, or at the very least, wish to have discluded in any searches you carry out based on sequence similarity." [1]

Two projects make sense, depending on the student's interests and skill and time frame.

Implement multiple masking algorithms and compare (option 1)

Goal of this thesis is to reimplement the famous filtering algorithms SEG (protein sequences) [2] in a stand-alone SeqAn application and to compare this against the original implementation.

Steps:

implement the SEG algorithm [2] as a function in SeqAn
add support for writing (and reading) SEG interval output to SeqAn
develop a tool that reads FASTA files and outputs intervals for them
benchmark and compare the solution to the original tool

Stretchgoals:

include alternatives to SEG, like GBA [4]
parallelise the tool over the input sequences (should be fairly simple with OpenMP)
measure the influence of the algorithm on a tool like Blast or Lambda [5]

Expected outcome for student:

learn how to read and understand scientific papers, pseudo code and/or other implementations' source code
learn how to efficiently implement an existing algorithm in SeqAn, do I/O and develop an application
learn how to benchmark and compare your implementation with other
learn how to write a thesis

Add masking support to SeqAn3 (option 2)

The focus of this work would be to add masking functionality to the new library. It is more about a clean implementation, proper documentation and participation in the software project and it's workflows.

Goals:

implement the SEG algorithm [2] or another simpler algorithm as a function in SeqAn3 (if permitted by license an existing solution could be imported with little change)
add alphabet types for mask (0 or 1) and a template masked that creates an masking alphabet from an existing one
Implement a masked_sequence_adaptor that stores masking information more efficiently than per-character; evaluate theoretical differences in space consumption and access time vs a regular sequence over masked alphabet
Write proper documentation and tests for all new functionality

Stretch-goals:

Get the changes merged before handing in the thesis
evaluate more storage strategies for masked_sequence_adaptor
adapt the Fasta-Input/Output code to be able to read masked sequences from file

Expected outcome for student

learn how read and understand Modern C++ library code and documentation
improvement of C++ skills
learn how to do good quality software engineering, including automated testing, documentation, version control
learn how the SeqAn project is organised and how to participate in the development work-flow
write a thesis

Comments

References

[1] http://www.molbiol.ox.ac.uk/analysis_tools/BLAST/BLAST_filtering.shtml

[2] http://www.sciencedirect.com/science/article/pii/009784859385006X , http://www.sciencedirect.com/science/article/pii/S0076687996660352 , more information and documentation available; public reference implementation in C/C++ available in the ncbi-toolkit

[3] original unpublished, improved version: http://www.ncbi.nlm.nih.gov/pubmed/16796549

[4] http://bioinformatics.oxfordjournals.org/content/22/24/2980.full

[5] http://bioinformatics.oxfordjournals.org/content/30/17/i349.abstract

Topic revision: r3 - 24 Oct 2017, h4nn3s - This page was cached on 09 Mar 2025 - 23:05.

ABI

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback