You are here: ABI » ThesesHome » ThesisNgsCleaning

NGS Data Cleaning

TODO: Manuel

Note that we can shift the focus of the thesis much stronger towards programming/implementation if you want to program!

Next Generation Sequencing (NGS) creates huge amounts of data. While NGS machines improve in throughput and accuracy, sequencing data often contains errors introduced by technical problems or contamination. Thus, an important first step in pipelines working with such data is cleaning of data.

Techniques for NGS data cleaning include performing BLAST search against known contaminants, the identification of repetitive regions and PCR duplicates, search for adaptors, and many more.

Note that the project description is subject to discussion, depending on your interest and focus. The area of NGS data cleaning is large so the scope could be broadened to a MSc thesis.

Rationale/Aims

Cleaning NGS data is of great importance. Nevertheless, such data cleaning is often done by ad hoc methods. A methodological approach and review is a beneficial contribution to Bioinformatics practice. The aim of this thesis is to review the existing literature and also document existing methods used by practitioners (we have contact to several groups in Berlin whom we could cooperate with to collect their "state of the art"). A systematic evaluation of the current state of the art would be as beneficial as proposals for benchmarking data cleaning methods and subsequently improving the state of the art.

The first aim of this thesis is to review the literature and existing methods for data cleaning and collect information on current practice from practitioners in Berlin as well as a cataloging such methods. The second aim of this thesis is to setup an evaluation environment for (a subset of the identified) data cleaning methods to facilitate a systematic evaluation. The third aim of this thesis is perform a study with a subset of existing methods.

Depending on how large the subset from the second and third aim is, a fourth aim could be to implement an improved method.

Proposed Schedule

The thesis is planed for 8 weeks. After 2 weeks, you can decide whether you want to complete the thesis or look for another topic.

  • Literature research (Optionally: getting started with SeqAn). (2-3 weeks)
    • Catalog and group existing methods, obtain real-world data.
  • Select subset of data type and methods to evaluate.
  • Design and setup a comparison environment for an evaluation. (2-3 weeks)
  • Perform a thorough and systematic evaluation of the methods. (1-2 week)
  • Optional: Design and implement a method, for example combining existing methods (e.g. in SeqAn, BioPython, …) (1-2 weeks)
  • You should hand in the rough structure of your thesis one week before the deadline.
  • Thesis writeup. (1 week)

There will be regular meetings with your supervisor.

Sub Pages

Literature

The following could be a good starting point for literature review.

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback