You are here: ABI » DraftS » Razers2Revision

Page Razers2Revision

General Points

TODO: Wait for evaluation of Hobbes on human, execute class, update tables.

[Manuel]: Section S5 was updated to reflect the usage of the read mappers for the new evaluation. Section 3.3 was moved to the appendix because of space constraints.

Answers to the reviews.

Dear Dr. ???? (contact?)

please find below our answers to the referees comments. As you can estimate by the time it took to revise our paper we did indeed accept almost all criticism about the evaluation of our algorithm and substantially redid the evaluation part. We hope that the reviewers are satisfied with our changes and look forward to their comments.

For readability we cite the points from the different reviewers and add underneath our answer.

Reviewer: 1

Comments to the Author

The authors describe a new read alignment tool RazerS 3 that supports fast, sensitive alignment of sequencing reads. The authors discuss a pigenohole filter used in RazerS 3 but not the original RazerS, a framework for predicting the sensitivity of the pigenohole filter, a new variant on the Myers’ bit-vector algorithm, and a scheme for load balancing across many processes. The authors also demonstrate RazerS 3’s speed, sensitivity, and accuracy using the Rabema benchmark and various comparisons with respect to real-world data.

The manuscript is potentially quite strong, possibly representing an important contribution in the realm of q-gram-indexing-based read alignment. However, there are some major issues that make it difficult to evaluate the impact of this contribution.

Following are major issues that must be addressed by the authors:

1.1)

The comparison benchmarks are not adequate. Several issues should be addressed: None of the comparisons use the whole human genome as the reference. Most of the tools that are compared to (BWA, BWA-SW, Bowtie, Bowtie 2) were designed for alignment to whole human or mammalian genomes, and alignment to the whole human (or mouse) genome is a very common application of these tools. It’s hard to justify some of the manuscript’s performance claims (e.g. “mostly superior…in terms of sensitivity and performance”, “practicable performance”, “use of BWT…has its limitations”) without such comparisons.

We changed out evaluation substantially. In particular, we map against the whole human genome.

Use of the Rabema benchmark is not well justified here. If my understanding is correct, use of Rabema would tend to give RazerS 3 an unfair advantage since RazerS 3 and Rabema share certain assumptions that aren’t shared by the competing tools. The authors should either choose a different way to compare the tools or explain why it’s fair to use Rabema here. For instance, is the Rabema “gold standard” based on edit distance? If so, does that not give the edit-distance-based tools (RazerS 3) an advantage over the tools that use a non-edit-distance-based scoring scheme (BWA, Bowtie 2, etc)?

This might be a misunderstanding. The Rabema benchmark is not built into Razers 3 and the definition of ”lakes” is independent of the tool. In our tools we choose to report one match per lake and parallelogram (i.e. could be several hits per lake). However, the Rabema benchmark is about sensitivity in the edit distance model. Since our tool and many others are edit distance based, this is an appropriate benchmark. To address the quality based mappers, we added to the Rabema benchmark a test for finding the original location. This is certainly independent of the chosen distance model. In addition we changed our evaluation substantially and added new benchmarks similar in spirit to Bowtie2 and Shrimp2.

Some of the comparisons take reads originating from across the whole genome and align them to just a single chromosome of the genome (e.g. human chr2). Whether a tool performs well on such a test depends heavily on how quickly the tool throws away a read that fails to align. This is not a very relevant basis for comparing tools.

We changed that and compare to the whole genome.

The “island criterion” is in some sense “built in” to RazerS 3 but not into the tools compared against. This can lead to an unfair advantage for RazerS 3. For instance bowtie -k 100000 might “spend” its 100,000 alignments describing alignments to distinct positions within a long homopolymer, whereas RazerS 3 won’t (or shouldn’t) do that (and whether this is a good thing is, I think, debatable - see point 10 below).

See above. (We changed and expanded the evaluation section).

“To make results comparable, we filter the output of all read mappers to obtain only alignments with a maximal error rate of 4%.” Why does this make the results more comparable, and is it fair? This would seem to handicap the tools that use more sophisticated scoring schemes, like BWA and Bowtie 2.

We filtered tha results to make them comparable, since we believe that an edit distance match with say 2 errors is always better than a match with more than 2 errors. However, to address the point we now additionally report the unfiltered results in the new evaluation section.

A way to solve many of these issues at a stroke would be to perform a comparison more like what Heng Li does here (http://lh3lh3.users.sourceforge.net/alnROC.shtml), which is essentially also what is done in the Bowtie 2 manuscript (http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.1923.html), using the whole human genome as the reference.

Taking into account the comments of all reviewers we changed the evaluation section substantially taking into account ideas from Bowtie2 and Schrimp2.

1.2)

A little too much of the methods section is a re-hashing of previous results from this group. For instance, the two subsections of section 2.1 pertaining to the SWIFT-like filtration algorithm mostly cover material from the previous RazerS paper. Also, the paragraph in section 2.4 starting “RazerS deals with this in an intelligent way...” and the following paragraph mostly discuss the first version of RazerS. Also, the “island criterion” section goes over ground covered by the Rabema paper. The authors should trim the methods section so that previous results are discussed only briefly.

We shortened the SWIFT section and pointed to the previous publication. In Section 2.4 we mentioned once Razers instead of Razers 3. We corrected this.

1.3) [Manuel]

The discussion in section 2.4 is unclear. For example it’s not clear what is meant by “filter algorithms can produce duplicate matches for the verification step.” Duplicate in what sense? Because two or more q-grams can fall into the same parallelogram? The sentence beginning “This can be done by sorting the matches lexicographically...” is also confusing. The authors should clarify this section a bit.

We updated Section 2.4 and hope that it is clearer now.

Following are minor issues that must be addressed:

1.4) [Manuel]

In section 3.4: “Bowtie obviously loses a lot of intervals, presumably because of the missing support for indels.” If this is true, why does BWA also lose a lot (~20%) of intervals? A more detailed explanation is needed here.

We used not the optimal BWA parameters. We redid the evaluation with correct parameters.

1.5)

The speed / memory footprint tradeoff is not discussed prominently enough. Memory footprint is a major usability issue in practice. The discussion in S6 should be moved to the main text and the authors should add at least one memory-footprint result with respect to the whole human genome.

We added a note about the memory consumption of the whole human genome in Section 3.5. However, because of space constraints, we left the detailed discussion and tables in the supplement.

1.6)

The term “positional error probability” is not explained clearly. Presumably a “position” is an offset into the read. Are these the same as quality values? Also, regarding the procedure described at the beginning of section 3.2 whereby positional error probabilities are “trained” using real data: is this something the user must do in practice? Does anything about the procedure change when the input is mixed-length (e.g. 454 reads)?

We added an explanation how we compute the positional probabilities in 2.2. For mixed-length read we compute loss rates for each read length compute a weighted loss rate.

1.7)

Tradeoffs between RazerS 3’s edit-distance-based scoring and BWA and Bowtie 2’s more flexible scoring schemes are not discussed. Why would a user prefer an edit-distance-based scheme over a scheme that can weigh mismatches and gaps differently? Or weigh mismatches in a quality-aware way? Or allow affine gap penalties, and separate penalties for read gaps versus reference gaps?

The referee raises some good questions. Edit distance based mapper which are agnostic to quality values oftne have an advantage in finding real variants (SNPs, SNVs) since those are not "marked" by bad qualities which could misguide a quality based mapper as convincingly shown in Shrimp 2 paper. We discuss this now in greater detail and added a evaluation similar to shrimp 2. Affine gaps costs can be advantageous for larger indels finding which is mostly not addressed by standard read mappers but by specialized tools like splazers, pindel, etc.

I’m guessing that the answer is: it’s a tradeoff. Edit-distance based tools are a bit simpler, and therefore usually a bit faster. For instance, use of an affine gap penalty can lead to a more challenging “verification” step because the low penalty for a gap extension leads to wider parallelograms. There should be at least some discussion of this tradeoff, since the leading tool (BWA) has what seems like a strictly more sophisticated policy than RazerS 3’s.

We agree with the referee and added some discussion.

1.8)

In table 3, Bowtie 2 aligns only 44.44% of the pairs in the E. coli data. Is this a typo? If not, can the authors suggest why this happens? This seems surprising.

Yes, this was a typo and is fixed in the updated Section 3.

Following are issues that, if addressed, would strengthen the paper:

1.9) [Manuel]

The authors describe how well their dynamic load balancing scheme performs relative to a naive scheme that divides the read set into t parts of equal size. Another simple scheme is to divide the read set into many parts of equal size and process each using a separate RazerS 3 process or thread, allowing up to t processes or threads. Is the dynamic load balancing scheme much better than this?

This is indeed a valid and interesting point. How to perform load balancing always is a tricky question and there are always many possible ways to do this. We should have thought about this before, given that we manually parallelized mrFAST in this fashion. We note that the choice of package size is critical and adds another layer of complexity which is not used for our dynamic load balancing.

We evaluated this using 8-way parallelism in the following way: The input of 10M reads was split into reads of size 250k (yielding p = 8/16/32/40 packages) and we ran RazerS as we did run mrFAST, using 8 processes. We do not doubt that the yielded load balancing is bad, however, consider the following numbers of accumulated CPU time for each step.

mode filtration [s] verification [s] filt.+verif. [s]
dynamic 242.6 200.5 443.1
p=8 216.4 203.5 419.9
p=16 388.0 191.9 579.9
p=32 694.3 184.7 878.0
p=40 1025.4 200.5 1225.9

Note that these times exclude the time for I/O of the reads but include the time for I/O of the contigs. The CPU times should speak for themselves, however, since any optimal load balancing algorithm can only achieve a total wall cock time of "TIME/t". The verification time is probably reduced because the processes run more isolated than the threads and can be scheduled better by the OS. As we can see, the overhead for scanning over the genome "eats up" any possible advantage of package-based load balancing.

Section S3.3 also describes these results.

1.10)

In section 2.3, this authors discuss how the “island criterion” groups alignments into equivalence classes of nearby alignments. However, this criterion can be criticized. Consider a short read that originated from within a long homopolymer in the reference genome. If the read aligns only within the homopolymer (and no where else in the genome), the “island criterion” would seem to indicate the alignment is unique. But that’s an unintuitive definition of “unique,” since we actually have no idea where within the homopolymer the read originated.

We are sorry if the impression was given, that a hit would be unique if it hits in a homoploymer island. We indeed do not do that. The intention of the island criterion is to be fair to any read mapper in such a region. Any end position would count as correct for the Rabema benchmark. A read mapper can provide all locations in each lake (which is not practical), several, or only one. A good measure for the uniqueness of a match is the size of the lake, i.e. a lake of size 8 with maximal edit distance 4 around a perfect match is "unique" whereas a lake that is much larger (say greater than half the size of a read) indicates a repetetive region. This is explained in detail in the Rabema paper.

RazerS actually creates multiple hits in a homopolymer, one in each parallelogram. This is stated in the last paragraph of Section 2.3:

""" The definition in (Holtgrewe et al., 2011) offers a clean, formal, and practical definition of the expected output of read mappers. RazerS reported one best match in every parallelogram whereas RazerS 3 reports one best match for each lake in each parallelogram. This increases the sensitivity and makes RazerS 3 a fully sensitive read mapper for the read mapping problem with the above match definition. """

Dave: Mmh. This point addresses the rabema paper not RazerS. RazerS outputs multiple matches in a homopolymer but in that case only one per parallelogram. We 1) cannot efficiently detect which matches stem from the same lake or even more detect the size of the lake. Knut: Yes. But we do not claim that.

Another scheme is to say: two matches are different if they have no cells in common in the dynamic programming matrix. This solves the problem described in the second paragraph of the “island criterion” subsection, and the homopolymer problem described above.

The reviewer is partially correct with this statement. Sharing parts in their trace is exactly what we define as "trace equivalence" in the Rabema paper. However, this is not sufficient as reasoned in the same paper, one also needs to combine trace equivalence with (some kind of) "neighbour equivalence" for a mathematical representation of what is intuitive. This gives rise to k-trace equivalence.

The Rabema paper (and its erratum) gives a detailed discussion, as does the following link: http://www.seqan.de/wp-content/uploads/downloads/2012/03/rabema-supp_mat2.pdf

1.11) [Dave]

The dotted lines in Fig 4 are hard to interpret. The paper text says that the line represents “mean relative difference,” and that it’s below 1% for simulated reads and loss rates between 0 and 10%. But I don’t see how the reader could know this by looking at the plot. Is there a missing axis label on the right-hand side?

It seems that the right-hand label was cropped during a manual to pdf-to-eps conversion. We fixed that in the resubmission.

1.12) [Dave]

The discussion of how the approach of Hyyro was refined for RazerS 3 gets confusing; for me, specifically, the sentence beginning “The banded variant proposed in ...” is where I stop being able to follow the discussion. Also, it’s not clear what exactly is meant by “clipping,” and the nature of the “preprocessing” is also not clear. The authors should clarify and/or consider moving some of the more detailed discussion to the supplement.

We clarified the point.

Reviewer: 2

Comments to the Author

Weese et al. present a novel variant of their read mapper RazerS. To my understanding the major changes compared to the previously published tool are a "pigeonhole" filter and a banded bitvector algorithm simlar to a variant of Myers' bitvector algorithm proposed by Hyrooe. The authors show that RazerS3 is competitive to BWA and Bowtie2. The manuscript gives a lot of details on the algorithmic base of RazorS3 but seems to lack some important information.

2a) Major Points

2.1) [Manuel]

In figure S3 the comparison of the banded and unbanded myers algorithm is unclear to me. It would be good to see how the banded variant performs agains the unbanded one directly. Table S1 suggest that the verification time of the pigeonhole filter (significantly more candidates!) is improved by factor 6 using the banded version. See comment below.

We added data for one thread in Table S1. Furthermore, we updated Section S2 with an appropriate discussion to clarify this point further.

2.2) [Dave]

As for the banded bit vector algorithm proposed here i would kindly like to disagree with the statement made on page 3: "However, the read length is limited to 64 bp, the number of bits in a CPU word. For longer reads bit-vectors and operations can be emulated using multiple words resulting in a performance drop of factor 3–4 for reads of length 65 bp." In fact, the calculation of subsequent machine words is only necessary if score <= k (supplemental algorithm) for the first machine word. In the context of read matching to a reference genome the calculation of the second machine word is only necessary if there are no more than k mismatches and indels after matching 64(!) characters. For small k (the authors use low error rates in their tests) the computation of the second block (characters 65-128) is necessary only in few cases. It is unclear to me if the unbanded myers was actually block-based or not.

A good remark of the reviewer. However, how many blocks needs to be computed does in general not only depend on the scores in the first block. A generalization of the reviewers idea is the so-called Ukkonen trick [Ukkonen, E. (1985). Finding approximate patterns in strings. J. Algorithms, 6(1), 132–137.] which reduces the running time from O(mn) to O(kn) on average by computing the DP matrix column-wise from the topmost to the last active cell (last cell in the column with score <= k). This idea was already implemented in our unbanded Myers algorithm for large patterns to improve its running time by computing only the first words up to the word that contains the last active cell. We switch between both variants (single-word and multi-word) depending on the length of the reads (<=/> 64). As the effective performance drop is influenced by specificity of the preceding filter we removed the absolute numbers and clarified the paragraph

2.3) [Manuel]

In the supplement the authors give a space complexity of O(n ⋅ (m − max Q)) when using SWIFT filtering. I.e. the memory requirement increases linearily with the number of reads. The authors state that for 10 million reads the memory requirement is 10GB. Current data sets, however, can be almost an order of magnitude bigger. It seems important to me, to show how much memory is actually used when using sets of reads with 75M-130M reads.

The referee is correct in that the space consumption grows linearly. We added results and a discussion about this in the supplement S6.

2.4) [Manuel]

In Table 4 the overall percentage of mapped reads (for Drosophila) is astonishingly low. In the text the authors state that they have filtered the output to a max error rate of 4%. This seems rather inappropriate to me. Especially so because - as the authors obviously have experienced themselves during the computation of Table 4 - 454 reads may show significantly higher error rates due to low quality ends. I would kindly like to suggest to remove this filter. Since the output of the tools was filtered, it can not be ruled out that BWA-SW or Bowtie 2 in fact used the majority of the time to map the more erroneous reads. This makes the given time measures hard to interpret. The authors state that filtering with 4% error rate "[...] also removes some local alignments where large portions of the read are clipped [...]". In order to overcome the problem of such local alignments a semi-global realignment of Bowtie2 and BWA matches at the reported positions could be necessary. To compare the tools it might be appropriate to show the aligners performance at given error rates.

We substantially redid our evaluation. In general we report now unfiltered and filtered results (see above) and added some more benchmarks to the evaluation in the spirit of Shrimp2 and Bowtie2.

Regarding the Drosophila reads: We contacted the author of the original paper using this data set and he stated that they were probably the first paired reads outside of Illumina, thus containing many errors. The author wrote that current reads exhibit much better quality.

When collecting the data sets for the original manuscript, these were the only 100bp genomic Drosophila Illumina reads we could find. We have now found an additional, better data set and have replaced the low-quality data set with the new one. The numbers are now more comparable to the ones for the other organisms.

2.5) [Manuel]

To evaluate the performance of RazorS3 on real data sets the authors only use chromosome 2 (Table 2,3,4). This, with all due respect, seems inappropriate to me. I dont see any reason to restrict such an evaluation to just one chromosome and i would like to suggest to extend the evaluation to a full mammalian genome.

The reviewers' point is valid. We changed the Single-End and Paired-End Real-World data to align against the whole human genome. See the updates in Section 3.5. and see our comments in the new experimental section.

Again, the authors state that they have sampled only 100k reads. I believe it is necessary to extend the performance analysis to a full set of reads (eg. a HiSeq2000 lane) and a full mammalian chromosome to give the reader an impression of the real-life performance.

The sampling is only performed for the Rabema benchmark. We sampled now over the whole human genome. Here, sampling uniformly from the reads set will give a good representation of the results and wont change with a larger number of reads. Because of time constraints on the whole human genome we stuck with 100 k reads. However, for the real-world data sets, we used 10M reads in the original submission. We increased the number of reads in the data sets as can be seen in the updates to Section 3. The times for larger data sets can be estimated by splitting the reads into multiple packages and mapping them independently, as indicated in the updated manuscript. This also allows to derive running time estimates for larger data sets.

In addition we refer to our new experimental session which contains large data sets.

TODO(Manuel): Update Section 3 with the data after we collected it.

2b) Minor Points

2.6) [Manuel]

In the caption of Table 2 it should be noted which error rates where used for the simulated reads of m=400 and m=800

We are using default Mason settings. A clarifying sentence has been added in Section 3.5.

2.7) [Manuel]

The frequent segfaults with mrFAST and Hobbes are quite striking. Have the authors of these tools been contacted?

As the reviewer suggested, we have contacted the authors of of Hobbes and mrFAST. A new version of mrFAST has been released by Can Alkan which is now used, as indicated in the updated Section S5.

For Hobbes, one problem was that the program has to be compiled on the exact machine it is run. This resolved some but not all problems.

Hobbes is not able to align reads longer than 100 bp, as confirmed by its authors. More problems appeared for Hobbes when switching to the cluster and 12 threads: Hobbes kept crashing. These issues could not be resolved to the day of resubmission.

2.8) [Dave] <

I am not aware of the level of parallelization of each tool. Hence, i would like to suggest to give real time AND cpu time for each comparison.

We added tables with such times in the Supplement. However, we want to remark that the "SYSTEM + USER" time (equaling to cpu time) is not a very good measure: First, parallel programs are usually run with a fixed number of threads T and T cores are allocated exclusively to the program. In parallel computing, one normally only considers wall-clock time since "cpu time" normally increases for well parallelized programs due to synchronization and increases less for programs with multiple sequential sections. Second, I/O time is not considered.

Also note, that we were not able to record the cpu time exactly in a straightforward fashion for mrfast because of the manual parallelization that has to be employed.

Reviewer: 3

Comments to the Author

Paper describes a mapping method, RazerS3, which is faster than its predecessor RazerS and supports longer reads with higher error threshold. The novelties are multi-threading, implementation of pigeon-hole principal and implementation of banded Mayers bit vector to speed up the verification process.

Revisions:

3.1) [Manuel]

The result section lacks a subsection about memory footprint of all the aligners than have been used.

We added memory footprints of all aligners.

3.2) [Manuel]

Aligner version/parameters that have been used in the experiments should be moved into the paper rather than supplementary section.

We also think that the exact parametrizations of the tools should be clearly documented, many papers are missing this information. However, because of space constraints, we feel that the supplement is the best place for the detailed description.

3.3) [Manuel]

What is the explanation for poor performance (% mapped reads) of BWA-SW on SRX016071 on D.melanogaster (table 3).

We completely changed our evaluation section which does not contain 454 reads anymore (see point 4 reviewer 2) The performance difference was mostly due to filtering the results according to edit distance.

3.4) [Manuel+Dave ]

(Comment) In paired end experiments, all the concordant reads are being filtered and suggested to use a variation pipe line to evaluate the mapping. Discordant reads can aslo be used to do structural variation discovery.

For paired end read mapping we choose only to compare concordant reads, because most mappers only support those. Of course the reviewer is correct that discordant reads are useful for variant detection. Such mappings are addressed by other tools of our group. We also refer to our new evaluation section.

3.5) [Dave+Manuel]

Is RazerS a best mapper or an all mapper? Authors have tuned other aligners to report multiple locations per read. I believe this also means that RazerS3 is not considering the quality scores. This should be clearly stated out in the paper.

RazerS is an all-mapper which of course allows to limit the output to certain useful criterias. We added a clarification to the paper and distinguish between all and best mappers in the result section.

TODO: Actually put this into the paper..

Some thoughts about improvements [Dave]

1. If RazerS3's memory consumption is that high only due the large amount of matches we keep, we should use "-m 1" (if not already done) and additionally could add a String[uint8_t] that stores for every read which edit distance new alignments need to have. We decrease these numbers regularly in every call of compactMatches (in setMaxErrors). Right now, we only turn off reads completely in PH mode (setMaxErrors(-1)) but using that String could earlier keep off lots of matches with suboptimal alignments

2. Maybe we can turn off the post-alignment filtering and delete/turn-off quality values with the following options. That would ease the discussion if it is fair to filter alignments that quality-based tools consider to be better.

  • Bowtie 2
    • has a --ignore-quals to ignore quality values --> if we use it we could weaken their argument "Bowtie2 prefers alignments with more than 4% due to low qualities"
    • --end-to-end turns of soft-clipping. Together with "--mp 1,1 --np 1 --rdg 0,1 --rfg 0,1 --score-min L,0,-0.04" this corresponds to 4% edit distance alignment mode
  • Bowtie
    • -v 3 --best
  • BWA
    • -n 4

3. In Seqan clear() = reserve(x,0,Exact()) doesn't release memory (Andreas' policy was to only enlarge sequences). That means that contigs that we already scanned remain in memory. Fixing it could save us up to 3 gigabytes.
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback