Page ForumSpace

Thsi page will be the discussion forum.

Lecture 1

REINERT (30.4): Hier könnte eine ausgearbeitetere Version der Vorlesungsabfolge stehen. Oder im SVN. Momentan kann ich dann keine Kommentare geben.

Read Mapping

  • 1 Review about RNA-Seq
    • 1.1 Motivation
    • 1.2 Pipeline
  • 2 Outline about Read Mapping
  • 3 Bowtie
    • 3.1 Backtracking (in general)
    • 3.2 Suffix Arrays
    • 3.3 BWT
      • Compression transformation
        • Cyclic permutation
        • Transformation output
      • Decompression transformation
        • Last to front mapping
      • BWT's advantages
    • 3.4 Exactmatch
      • Overline: What do we have? What do we want to do?
      • FM-Index
      • Short idea of BWT compression
      • Backward-search algorithm
        • Definition of C and Occ (with examples)
        • Algorithm of searching a pattern of length=1
        • Searching a longer pattern
      • Algorithm to locate P in T
    • 3.5 Backtracking of Bowtie

(Duc, Nicolas und Corinna)

Fragen zu lecture 1

F: Wenn ich ein Extrasymbol wie '#' anhänge, wird dann der Index I, den ich als BWT Output und auch für den Input zur Rücktransformation brauche, nicht trivial?
A: # lässt sich zumindest in F trivial finden. Beim Entpacken von hinten nach vorn wird dort auch angefangen, also braucht man I nicht extra zu speichern. -- DavidWeese - 04 May 2010

F: Indem ich L nach der Position von # scanne, weiß ich automat. mein I. Oder soll dies Performance steigernd sein?
A: Beim Entpacken von vorn nach hinten braucht man ja ein F-to-L mapping, damit findet man ja auch gleich das # in L, I oder Scannen ist dort also auch nicht nötig. -- DavidWeese - 04 May 2010:

F: Umgekehrt, ist '#' nötig, nur um bei der Rücktransformation in keinen Loop zu kommen? Das kann ich auch herausfinden, wenn ich weiß, dass der Eintrag F[I] erreicht wurde.
A: Probiert mal die BWT von abab ohne angehängtes # zu entpacken. -- DavidWeese - 04 May 2010:

Assignments lecture 1

Slides of lecture 1 (13 May 2010 5:26p.m.)

Task 1: Compute the BWT for S=ACGCACGTACGG. APply the exact match algorithm of the lecture to find the number of occurrences of the Pattern P=ACG. In the lecture we will learn how to also find the positions in the text and extend the example.

Task 2: Compute all matches with exactly one mismatch of the Pattern P=AGG using the backtracking algorithm of the lecture. (It will be given on Wednesday, nevertheless prepare ALL for it).

Comments to Lecture 1

Slides

Here are some additional comments to the slides:

-- DavidWeese - 06 May 2010, 11pm:
  • S41: "abracadabra" has 11 suffixes, only after appending # it has 12
  • S51: "characters in L occur in same order as in F (rank preserving)" is not what rank/order preserving means
  • S57: "every column in matrix M is a permutation of string S", in fact it is a permutation of S# (according to your definition of S)
  • S59: "T[i] is position of the character F[i] in L" - No. F[2]=F[3]=i. So I would expect T[2]=T[3] after reading this sentence.
  • S67: "that connect"
  • S69: "the number of P in T" -> "the number of occurrences of P in T"
  • S77: "pos(i) = C[L[i]] + Occ(L[i], i)" you use pos( ) with a different definition later
  • S78: "All suffixes prefixed by P occupies in a set of rows" should be "All suffixes prefixed by P occur in a consecutive set of rows"
  • S98: "It’s possible to find row(i) indirectly" - what is row(i), define or omit
  • I don't understand slides 98-99:
    • I thought pos(j) is unknown or do you describe the preprocessing alg. that marks positions in T?
    • successor(c)?
    • "Determine L[j]: Compare Occ(c, j) with Occ(c,j-i) for every c ∈ Σ ∪ {#}" - this is only necessary if L is compressed
    • "pos(i)=C[L[j]] + Occ(L[j], i)" seems wrong to me, "C[L[j]] + Occ(L[j], i)"" is the result of the L->F mapping of i
    • Backward_step(x) is simply the result of the L->F mapping of x, it should be explicitly pointed out
  • S101: "use Backward_step(i)" and "repeat t times" could be misleading, as you set "j=i" and actually repeat "j=Backward_step(j)" t times
  • (hint) you could introduce with "pos(Backward_step(i)) = pos(i) - 1"
  • (minor) S103: "Last = 10": You want to calculate pos(i) for i=9,..,10 so "i=10" would be more intuitive here
  • (minor) S105: same as above, "First = 9" is correct, but I would prefer "i=9"
  • S105: "Finding ssi"
  • S107: Shouldn't the red arrow point to [s|7]?
  • S111: Why Backward_step(4)? It should be Backward_step(11)=4.
  • S119: The ↑ arrows are maybe unusual, you could use "The higher the ..., the higher the ..." or "more ... results in ..."

BLASSE: Folien wurden überarbeitet

Script

-- DavidWeese - 19 May 2010

My comments to revision 6251:

Definitionen
  • p10: "Definition" Definitions
  • p10: "lexicographically sorted" -> "ordered"
  • p10: "The character P[i] is the..." -> "P[i] denotes the ..."
  • p10: You not only need prefixes, "P[i,j] is a pattern of symbols from the i-th to the j-th position in P"
  • p11: Introduce $ first and for both suffix array, see the SA definition in previous P4 lectures
BWT
  • p12: "an special"
  • p12: you already introduced $ as a character smaller than all characters of Σ, why introducing # with the same properties?
  • p12: "M whose rows are cyclic shifts of S" - either of S$ or of S' with S'=S$
  • p12: "column ... are ..."
  • p12: "position of character c in string X" - what is the position of character c= s? There are 4 occurrences of s in L and F. You have to bring characters of the same origin in T into relation. Either use cyclic shifts or the already mentioned "predecessor of" relation.
  • p12: What if c1= m and c2= $?
  • p12: "of given string"
  • p12: Use the latex theorem block for the "rank preserving property" and the latex proof for its proof.
  • p12: "There are regions in F with c" - inexact, they occur in one substring/block
  • p13: If I=5 your row counting seems to begin with 0, you should say that
  • p14: "c1andc2" and wrong formulation of rank preserving property
  • p14: "Advantages of BWT" - advantages are relative, compared to what? Better use "Properties..."
  • p14: the first paragraph of 6.2 could be shortened
  • p14: "it is reversible algorithm"
  • p14: cite "Move to Front algorithm", "Run-Length encoding"
  • p15: "..algorithm is the string ..."
  • p15: instead of writing "Σ ∪ special character" multiple times introduce Σ'=Σ ∪ {$}
  • p15: "character in L corresponding to character F[i]" - the same here, corresponding is undefined, first define corresponding as stemming from the same text position
  • p15: How do you get the row beginning with $ and ending with m (this is definitely not row I)? In this example its coincidence that L[I=5] = m, because your string begins with 1 and I is counted from 0.
  • p15: "successor character of S" -> "in S"
  • p15: "the i character"? better: "The character corresponding to T[i] in L ...."
  • p16: "the rank of c" is inexact. It can only be used to relate the corresponding cyclic rotations ending with the same character. Leave it out.
  • p16: "f you do not use" -> "Without"
  • p16: "If you construct.. you can use" - colloquial, omit "you", better: "The BWT can be used..."
kleine Änderungen beim L-to-F mapping,
  • p17: "to to"
  • p17: (minor) "C[L[i]] is the number of lexicographically smaller characters before the 􏰁rst L[i] in F occurs." shorter: "C[L[i]] is the position before the first L[i] in F."
  • p17: ".. calculate the first possible position ... add the number of L[i]s which occurred before the L[i]" not exactly how defined it. Its the position before the first L[i] and you add j if L[i] is the j-th occurrence in L.
Exactmatch, Backward_search, Anfang von Backward_search]
  • p18: "Bowtie used" You haven't introduced Bowtie. Remove or rephrase.
  • p18: "Here you have the challenge" sounds strange
  • p18: "a index.."
  • p18: the most significant advantage is: Exactmatch is O(|p|) (without compression of Occ)
  • p18: "For the performance it just needs" - What performance?
  • p18: "occupies in"
  • p18: "Because of the rank preserving property it is possible to calculate the first and last position of this set." Why that?
  • p19: Please consider rewriting. Most of the sentences are hard to understand. Please avoid "you" (if there's no other way use "we").
    • Try to precisely formulate each sentence in german first then translate
    • Read the sentence multiple times and check for exactness and unambiguity.
    • Let your group mates review the text.
  • p19: You could move 4) between 1) and 2) and use j instead of j-1 in 2). Additionally you need 5) "Repeat 2)-4) until j=1"
  • p20: The output is actually only First, Last (not number of occ.)
  • p20: "Initialisierung"
  • p20: "return no rows..." better: "output/print "no rows..."" and always return <First,Last>
  • p20: "The ... Algorithm runs..." -> "algorithm"
  • p20: "Because of space limits it isn't possible...." What limits, who sets them? Better: "To reduce the space consumption it is possible to..."
  • p20: "isn't" -> "is not"
  • p20: what is the position of a row in the T? You actually mean the originating position of F[i] the first character of row i.
  • p21: "pos(1)-1" -> "pos_T(1)-1"
  • p21: "Determine L[j]" again, Occ not L is compressed
  • p21: "Extension of the BWT" section empty
  • p21: Backtracking should be explained
  • ...

To all sections
  • p*: check and replace all ?? and cite all non-standard terms/algorithms

Lecture 2

REINERT: Dann haben wir zwei Vorlesungen über read mapping. Ist das gewünscht? (siehe auch unten)
  • RazerS
  • Filtering:
    • Pigeonhole
    • q-gram

(Hannes, Sebastian)

Comments to Lecture 2

Slides

-- DavidWeese - 25 May 2010

My comments to revision 6323:
  • please turn on page numbers in the slides (easier to refer to)
  • s13: "exponential runtime increasing" -> "exponential incease in running time"
  • s14: "easier" is not correct, it should be faster. A filter only makes sense if the work load of subsequent steps and also the overall running time is reduced.
  • s15: "parts ... does"
  • s15: "contain possible", "be match" - quantifiers missing
  • s16: "no guaranty to lose any significant match" - so the filter is/can be lossy? what is significant?
  • s19: "fast" is relative. faster than what?
  • s19: "improve" - compared to what?
  • s21: "at least one set has no object" - you should say that each object is contained in one set
  • s25: one could think you divide the pattern into pieces p1,...,pn and the text into t1,...,tx. Omit p1,...,pn for single characters, this is misleading. Better use |P|=m or "... of length m"
  • s25: "one of the pattern pieces has to match without error" - Match with what and why? You didn't mention that there is an occurrence with <=k errors. And maybe it is easier to start with something like: T differs from P with at most k errors.
  • s26: you need the generic pigeonhole lemma on s41 the first time. Try to rearrange
  • s29: is very similar to s25
  • s30: "p_i[start : end]" what's that? What is start and end? You use i_end and i_start later. I suggested to use positions s_i and e_i for pattern piece p_i
  • s30: "matches at position t[...]" - t[...] is not a position, it's a substring
  • s32: "before t_j" -> t[j] or better: position j in t
  • s35: "text area" -> "text area is"
  • s36: you only have one text, "text1 (t1)" -> "text"
  • s40: "many unnecessary verification" and "many verifications are unsuccessful" - is somehow the partially same, as they are unnecessary if they are unsuccessful.
  • s42: my pdf viewer (Mac, TexShop) doesn't display the arrows/lines(?) between the tree nodes.
  • s48: How do you partition the error rate in the unbalanced case
  • s50: Does Quasar really compute approx. local matches (including BLAST) or only potential approximate matches (w/o BLAST, I'm not sure)
  • s51: "with window length" undefined. Better: |Q'|=w (or |D'| depending on your algorithm)
  • s56: "devide"
  • s56: So if only one read reaches the threshold of a block a you have to verify all reads in this block?
  • s60: Maybe to much text on this slide.
  • s64: "higer"
  • s69: You should consider skipping Minimum Coverage to spare some time
  • s69: increases the filter specificity

Script

-- DavidWeese - 02 Jun 2010: Revision 6452: Comments to the proof on p39:
  • (1) should be omitted, is given in the assumption.
  • Use more connecting words.
  • before (2): What is k_i? "Let k_i be the number of errors in p^i. Then the following holds: ..."
  • "order to show ... assumption" can be written as "We assume"
  • The proof is wrong, as you argue: "Assume the Lemma is wrong. This is a contradiction to the Lemma. So it holds.".
  • (4) is wrong. The indirect assumption is "For all i=1,..,j holds k_i>floor((a_i*k)/A)." The k_i are integers so for all i holds: k_i >= floor(a_i*k/A) + 1 > a_i*k/A. Thus, sum over k_i>=sum over a_i*k/A. -> k > k (contradiction)

Lecture 3

REINERT: Passender wäre vielleicht die Arbeiten von Tammi und Kececioglu zusammen (wie im repeat resolution script) (TopHat und spliced mapping könnte man mit filter prinzipien in lecture 2 machen?)
  • Splicing:
  • Multireads:
    • Algorithmus von Kececioglu

(Duc, Nicolas und Corinna)

Comments

 

Topic attachments
I Attachment Action Size Date Who Comment
1Lecture.pdfpdf 1Lecture.pdf manage 6 MB 13 May 2010 - 15:18 UnknownUser Slides of lecture 1
project_outline_thieme.pdfpdf project_outline_thieme.pdf manage 65 K 04 May 2010 - 15:07 UnknownUser  
Topic revision: r34 - 02 Jun 2010, DavidWeese
 
  • Printable version of this topic (p) Printable version of this topic (p)