Sign up at ProgrammingGroupList for one of the exercises on Tuesday 12th and Wednesday 14th

Exercise 3

Compression with BWT

Deadline: 24.06.2012 6:00 p.m.

Implement a program that (Mode c)
  • reads a single sequence from a fasta file
  • calculates the BWT of that sequence
  • implements move-to-front encoding and Huffman coding to compress the BWT
  • writes the Huffman code into an outfile

Mode x
  • reads a file containing the BWT compression of some sequence
  • writes the uncompressed sequence into a fasta outfile.

Remarks

For the compressed file it would be optimal if you output the Huffman as binary bitstring. But this is not required. As an alternative you can also use a single char for each bit, so you can output a string of ones and zeros. To obtain the compression rate you can then divide the file size by 8 (which would correspond to the size if binary bitstring was used).

Format specifications
The file format for the compressed file shall be the following:
Alphabet (first line)
<string of all characters that appeared in the sequence e.g. AGCT (required to revert the move to front encoding)>
HuffmanCodes (third line)
0   <code for 0> 
1   <code for 1>
2   <code for 2>
3   <code for 3>
<SequenceName starting with a ">"  (e.g. ) (>random10M)>
<Huffman code of sequence either as binary bit string or as string of ones and zeros>
The program usage should be: ./exercise3 <input file> <output file> <mode> where:
  • mode: single char either c (compress) or x (extract).
  • for mode c input file is fasta file and output file is compressed file
  • for more x input file is compressed file and output file is fasta file

For the application note you should test your program on different sequences with different alphabets. Compare the compression rate to that obtained by other file compression tools.
Topic revision: r5 - 12 Jun 2012, SandroAndreotti
 
  • Printable version of this topic (p) Printable version of this topic (p)