Page ThesisMeganRazersReports

Weekly Reports for the Bachelor Thesis "Comparative Genomics with MEGAN and RazerS" by Hannes Hauswedell

Week 1 (2009-07-11..2009-07-18)

E-Values and Bit-Scores:
  • Contacted help/developer-mailinglist of NCBI-Blast concerning scoring-schemes in Blast and parameter-calculation
  • received large amounts of documentation regarding the topic (50+ pages)
  • started reading the docs, beginning to think that dynamically or statically linking with BLAST is the only way of reaching similar scores
  • postpone decision regarding this topic, for now use the functions and values already implemented earlier
  • this will become important later, as we will definitely need protein-scoring in addition to nucleotide scoring

Overall progress on "BLASTN-Mode" for RazerS:
  • implemented a dumpAlignment()-function for writing the final nucleotide-alignment to the report
  • format seems to be compatible with Blast-Output, but more tests have to be done to make sure

  • started to do some test runs with real-world data to have real reports to compare with output of RazerBlastS
  • ran out of memory very quickly -> need to get smaller samples and databases or better hardware (or just use other hardware)

=> all-in-all less progress than hoped for, but about as much as expected (I knew time would be limited since I am also preparing for my last exam)

Week 2 (2009-07-19..2009-07-26)

Overall progress on "BLASTN-Mode" for RazerS:
  • implemented some more command-line options:
  • it is now posssible to choose the window-size (-W N) which will deactivate parameter-choosing and use an ungapped shape of length N (this behavior might be desirable, it is closer to BLAST)

  • generated an „artificial“ dataset that can be used with available hardware
  • first test-runs on the data with regular BLAST(2) and RazerBlastS
  • RazerBlastS didn't produce any results
  • spent a lot of time debugging this, fixed a lot of issues on the way, but found new one
  • was able to narrow down the problem, but was not able to fix the issue yet (RazerBlastS throws SIGABRT somewhere deep in seqan )

Week 3 (2009-07-27..2009-08-02)

Overall progress on "BLASTN-Mode" for RazerS:
  • After some time the Crash was resolved, there was a problem in the original verification function

  • With RazerBlastS running I did tests to compare output with BLAST's
  • RazerBlastS produced results!
  • => Most of the hits looked "similar" to BLAST, but actual alignments and scores differed
  • found out that they actually differ inside RazerBlastS as well - between verification phase (where the actual alignment is computed) and output-phase (where it is recomputed from genom coordinates)
  • first thought this was just the difference between Gotoh and BandedGotoh -> wrong!
  • it turned out that in verification genome was aligned against read and in output the other way around (which makes a difference because the DP-Matrix-Configuration is asymmetrical)
  • that solved most of the issues, however many alignments had a huge "gap-prefix" or "gap-suffix"
  • this was due to wrong parameter-retrieval from the match-fragments which could be fixed

  • many formatting improvements to the BLASTN-Output-Format

=> Current State:
  • on the testdata RazerBlastS finds all of matches, that blast finds, in almost the same alignments.
  • it finds very few additional useless matches (not yet sure where they come from)
  • the scores on the matches are nearly identical to BLAST's score, even the Bit-Score
  • The output-report already looks very similar and should satisfy MEGAN (no testing done there, yet)

Week 4 (2009-08-03..2009-08-09)

Overall progress on "BLASTN-Mode" for RazerS:

  • fixed a minor problem in e-Value-calculation and switched to "scientific" output, e-Values are now similar to BLAST
  • added "overview" tables to output (beginning of sections) and fixed some formatting issues

  • spent a entire day figuring out where strange hits with bad scores and alignments come from. Found out that those were reverse hits marked as duplicates, which -- because of the way RazerS marks duplicate hits :O -- loose their "reverse"-attribute and are therefore aligned against some forward sequence, resulting in a useless and confusing alignment during output-phase => ignoring matches marked as duplicates

  • changing back to BandedGotoh() in verification, which halves the execution time on a medium sized testset, but produces bad results (in the process of finding errors I had previously switched to regular Gotoh() )
  • after a lot of debugging I found an error in calculation of diagonals
  • fixed that and added a general +-3 to k
  • now results are nearly identical to "real" Gotoh!

Overall progress on "BLASTX-Mode" for RazerS:

  • started work, CLI parameters added
  • began researching an efficient method for Codon->AminoAcid translation
  • didn't find anything useful in seqan
  • didn't manage to adapt ModView or ModifiedString<> because they don't like char[3] to char translation frown, sad smile
  • asked on seqan-dev for help

Weeks 5 & 6 (2009-08-10..2009-08-23)

  • spent lots of time trying to switch to real local alignments from the current more-or-less semi-global approach
  • had discussions with David about this and with Tobias via the list
  • no real progress, other than the strong impression that this is not going to work the way David (and I) had planned

  • progress on Protein-Mode:
  • wrote a codon-conversion table that enables coodon-translation in constant time
  • wrote calls for translating a nucleotide-sequence with 1, 3 or 6 Frames
  • wrote an import function for reading fasta-nucleotide-sequences from file and directly translating them
  • imported tables of kappa- and lambda-values for Protein-Scoring from BLAST-Source-Code
  • adapted e-Value and bitScore-calculation to also work for Protein-Scoring

  • some code-refactoring to increase reusability and readability

Week 7 (2009-08-24..2009-08-30)

  • fixed e-Value-based Sorting of Matches

  • switch to exact local Alignments via localAlignment()
  • this works well, but is very slow

  • removed the global typedefs to make read and genome-types generic (needed for BLASTX-Mode)
  • → this resulted in many changed signatures
  • a lot of improvements on Protein-Support
  • still no protein alignements, though

Week 8 (2009-08-31..2009-09-06)

  • fix in e-Value calculation makes it closer to Blast's

  • lots of progress with BLASTX, RazerBlastS now produces Protein-Alignments!
  • change in find_swift.h to prevent it from overwriting the threshold-parameter
  • we now have a lot more results than we need...

  • had another appointment with David to plan the last weeks of work

Week 9 (2009-09-07..2009-09-13)

  • implemented the verification function for ungapped alignments, already works for BLASTX!

  • started writing the thesis paper! spent a lot of time on organizational stuff and latex…

Week 10-11.5 (2009-09-04..2009-10-01)

  • been busy writing the most of each day
  • fixed minor bugs a long the way


Topic revision: r9 - 29 Sep 2009, hauswede
  • Printable version of this topic (p) Printable version of this topic (p)