You are here: Wiki>ABI Web>LectureWiki>PMSB_Seqan_2016>PSMB_Seqan_2016_p5 (28 Jan 2016, TemesgenDadi)Edit

Add BAM support to SeqAn SequenceFile

In this project you will be asked to write an interface that will help the Sequence IO of SeqAn to seamlessly read SAM/BAM files as if they are FASTA/FASTQ files.

Introduction

SAM/BAM Files: These are common file formats that are used to store alignment information of a short sequences (often called as “reads”) with respect to a reference sequence, which is usually a longer sequence. To know more about what a SAM/BAM files looks like read the specification at https://samtools.github.io/hts-specs/SAMv1.pdf.

FASTA/FASTQ Files: These file formats are used for storing biological sequences. This could be any of DNA, RNA or Protein sequences.

The SeqAn FormattedFile class supports reading and writing of both FASTA/FASTQ sequence files and SAM/BAM alignment files. But many people utilize SAM/BAM alignment files only for the sequences inside them discarding the mapping information. Which means Given a SAM/BAM file one wants to extract the sequences and their corresponding identifiers and qualities.

Tasks

  • Getting familiar with the file formats (FASTA/FASTQ and SAM/BAM)
  • Take a closer look at the FormattedFile implementation of SeqAn
  • Implement the interface for reading SAM/BAM files as sequence files
  • Test the implementation with example data.

Stretch goals

  • create tests under SeqAn that checks if the implementation is working.
  • Make your code inline with the SeqAn standards and actually integrate it with SeqAn.

Extension as Bachelor Project

TODO

Literature

Topic revision: r2 - 28 Jan 2016, TemesgenDadi
 
  • Printable version of this topic (p) Printable version of this topic (p)