Add BAM support to SeqAn SequenceFile

In this project you will be asked to write an interface that will help the Sequence IO of SeqAn to seamlessly read SAM/BAM files as if they are FASTA/FASTQ files.

Introduction

SAM/BAM Files: These are common file formats that are used to store alignment information of a short sequences (often called as “reads”) with respect to a reference sequence, which is usually a longer sequence. To know more about what a SAM/BAM files looks like read the specification at https://samtools.github.io/hts-specs/SAMv1.pdf.

FASTA/FASTQ Files: These file formats are used for storing biological sequences. This could be any of DNA, RNA or Protein sequences.

The SeqAn FormattedFile class supports reading and writing of both FASTA/FASTQ sequence files and SAM/BAM alignment files. But many people utilize SAM/BAM alignment files only for the sequences inside them discarding the mapping information. Which means Given a SAM/BAM file one wants to extract the sequences and their corresponding identifiers and qualities.

Tasks

  • Getting familiar with the file formats (FASTA/FASTQ and SAM/BAM)
  • Take a closer look at the FormattedFile implementation of SeqAn
  • Implement the interface for reading SAM/BAM files as sequence files
  • Test the implementation with example data.

Stretch goals

  • create tests under SeqAn that checks if the implementation is working.
  • Make your code inline with the SeqAn standards and actually integrate it with SeqAn.

Extension as Bachelor Project

TODO

Literature

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback