Sign up at ProgrammingGroupList for one of the exercises on Tuesday 15th and Wednesday 16th

Exercise 2

Read mapping with QUASAR

Deadline: 23.05.2012 9:00 a.m.

Implement the QUASAR algorithm for to improve the efficiency of you read mapper in Exercise 1 So you need to:
  • Efficiently create a q-gram index of the genome.
  • Divide the genome into blocks
  • Search for matching q-grams and count matches for each block
  • For each block where the matches exceed the threshold (q-gram Lemma) use the semi-global aligner from Exercise 1 for verification.

Format specifications
The input and output file formats and outfile name are the same as in Exercise 1 The program usage should be:

./exercise2 <genome.fasta> <reads.fasta> <k> <q> <b> where:
  • k is the allowed maximal edit distance of a match.
  • q is the length of the q-grams.
  • b the block size for the genome.

Remark: The choice of q in testing will always respect the memory usage of a q-gram index without hashing (table size 4q).
Your implementation should allow for a length of 12 at least on a normal desktop.
If the block size is smaller than the read length + k your program should write a warning "read length exceeds block size" into the result file and terminate.
If for the choice of k and q and the read length the calculated threshold is not greater than zero the program should write a warning "Bad choice of k and q for input read length" into the result file and terminate.

For the application note you should test different values of k, q and b and measure their effects on the runtime. Compare to the runtime in Exercise 1

Topic attachments
I Attachment Action Size Date Who Comment
DataExercise2.tar.gzgz DataExercise2.tar.gz manage 7050.9 K 20 May 2012 - 14:20 SandroAndreotti Data for Exercise 2
Topic revision: r5 - 20 May 2012, SandroAndreotti
 
  • Printable version of this topic (p) Printable version of this topic (p)