# Page ForumSpace

Thsi page will be the discussion forum.

## Lecture 1

REINERT (30.4): Hier könnte eine ausgearbeitetere Version der Vorlesungsabfolge stehen. Oder im SVN. Momentan kann ich dann keine Kommentare geben.

• 1.1 Motivation
• 1.2 Pipeline
• 3 Bowtie
• 3.1 Backtracking (in general)
• 3.2 Suffix Arrays
• 3.3 BWT
• Compression transformation
• Cyclic permutation
• Transformation output
• Decompression transformation
• Last to front mapping
• 3.4 Exactmatch
• Overline: What do we have? What do we want to do?
• FM-Index
• Short idea of BWT compression
• Backward-search algorithm
• Definition of C and Occ (with examples)
• Algorithm of searching a pattern of length=1
• Searching a longer pattern
• Algorithm to locate P in T
• 3.5 Backtracking of Bowtie

(Duc, Nicolas und Corinna)

### Fragen zu lecture 1

F: Wenn ich ein Extrasymbol wie '#' anhänge, wird dann der Index I, den ich als BWT Output und auch für den Input zur Rücktransformation brauche, nicht trivial?
A: # lässt sich zumindest in F trivial finden. Beim Entpacken von hinten nach vorn wird dort auch angefangen, also braucht man I nicht extra zu speichern. -- DavidWeese - 04 May 2010

F: Indem ich L nach der Position von # scanne, weiß ich automat. mein I. Oder soll dies Performance steigernd sein?
A: Beim Entpacken von vorn nach hinten braucht man ja ein F-to-L mapping, damit findet man ja auch gleich das # in L, I oder Scannen ist dort also auch nicht nötig. -- DavidWeese - 04 May 2010:

F: Umgekehrt, ist '#' nötig, nur um bei der Rücktransformation in keinen Loop zu kommen? Das kann ich auch herausfinden, wenn ich weiß, dass der Eintrag F[I] erreicht wurde.
A: Probiert mal die BWT von abab ohne angehängtes # zu entpacken. -- DavidWeese - 04 May 2010:

### Assignments lecture 1

Slides of lecture 1 (13 May 2010 5:26p.m.)

Task 1: Compute the BWT for S=ACGCACGTACGG. APply the exact match algorithm of the lecture to find the number of occurrences of the Pattern P=ACG. In the lecture we will learn how to also find the positions in the text and extend the example.

Task 2: Compute all matches with exactly one mismatch of the Pattern P=AGG using the backtracking algorithm of the lecture. (It will be given on Wednesday, nevertheless prepare ALL for it).

#### Slides

-- DavidWeese - 06 May 2010, 11pm:
• S41: "abracadabra" has 11 suffixes, only after appending # it has 12
• S51: "characters in L occur in same order as in F (rank preserving)" is not what rank/order preserving means
• S57: "every column in matrix M is a permutation of string S", in fact it is a permutation of S# (according to your definition of S)
• S59: "T[i] is position of the character F[i] in L" - No. F=F=i. So I would expect T=T after reading this sentence.
• S67: "that connect"
• S69: "the number of P in T" -> "the number of occurrences of P in T"
• S77: "pos(i) = C[L[i]] + Occ(L[i], i)" you use pos( ) with a different definition later
• S78: "All suffixes prefixed by P occupies in a set of rows" should be "All suffixes prefixed by P occur in a consecutive set of rows"
• S98: "It’s possible to find row(i) indirectly" - what is row(i), define or omit
• I don't understand slides 98-99:
• I thought pos(j) is unknown or do you describe the preprocessing alg. that marks positions in T?
• successor(c)?
• "Determine L[j]: Compare Occ(c, j) with Occ(c,j-i) for every c ∈ Σ ∪ {#}" - this is only necessary if L is compressed
• "pos(i)=C[L[j]] + Occ(L[j], i)" seems wrong to me, "C[L[j]] + Occ(L[j], i)"" is the result of the L->F mapping of i
• Backward_step(x) is simply the result of the L->F mapping of x, it should be explicitly pointed out
• S101: "use Backward_step(i)" and "repeat t times" could be misleading, as you set "j=i" and actually repeat "j=Backward_step(j)" t times
• (hint) you could introduce with "pos(Backward_step(i)) = pos(i) - 1"
• (minor) S103: "Last = 10": You want to calculate pos(i) for i=9,..,10 so "i=10" would be more intuitive here
• (minor) S105: same as above, "First = 9" is correct, but I would prefer "i=9"
• S105: "Finding ssi"
• S107: Shouldn't the red arrow point to [s|7]?
• S111: Why Backward_step(4)? It should be Backward_step(11)=4.
• S119: The ↑ arrows are maybe unusual, you could use "The higher the ..., the higher the ..." or "more ... results in ..."

BLASSE: Folien wurden überarbeitet

#### Script

-- DavidWeese - 19 May 2010

Definitionen
• p10: "Definition" Definitions
• p10: "lexicographically sorted" -> "ordered"
• p10: "The character P[i] is the..." -> "P[i] denotes the ..."
• p10: You not only need prefixes, "P[i,j] is a pattern of symbols from the i-th to the j-th position in P"
• p11: Introduce \$ first and for both suffix array, see the SA definition in previous P4 lectures
BWT
• p12: "an special"
• p12: you already introduced \$ as a character smaller than all characters of Σ, why introducing # with the same properties?
• p12: "M whose rows are cyclic shifts of S" - either of S\$ or of S' with S'=S\$
• p12: "column ... are ..."
• p12: "position of character c in string X" - what is the position of character c= `s`? There are 4 occurrences of `s` in L and F. You have to bring characters of the same origin in T into relation. Either use cyclic shifts or the already mentioned "predecessor of" relation.
• p12: What if c1= `m` and c2= `\$`?
• p12: "of given string"
• p12: Use the latex theorem block for the "rank preserving property" and the latex proof for its proof.
• p12: "There are regions in F with c" - inexact, they occur in one substring/block
• p13: If I=5 your row counting seems to begin with 0, you should say that
• p14: "c1andc2" and wrong formulation of rank preserving property
• p14: "Advantages of BWT" - advantages are relative, compared to what? Better use "Properties..."
• p14: the first paragraph of 6.2 could be shortened
• p14: "it is reversible algorithm"
• p14: cite "Move to Front algorithm", "Run-Length encoding"
• p15: "..algorithm is the string ..."
• p15: instead of writing "Σ ∪ special character" multiple times introduce Σ'=Σ ∪ {\$}
• p15: "character in L corresponding to character F[i]" - the same here, corresponding is undefined, first define corresponding as stemming from the same text position
• p15: How do you get the row beginning with \$ and ending with m (this is definitely not row I)? In this example its coincidence that L[I=5] = m, because your string begins with 1 and I is counted from 0.
• p15: "successor character of S" -> "in S"
• p15: "the i character"? better: "The character corresponding to T[i] in L ...."
• p16: "the rank of c" is inexact. It can only be used to relate the corresponding cyclic rotations ending with the same character. Leave it out.
• p16: "f you do not use" -> "Without"
• p16: "If you construct.. you can use" - colloquial, omit "you", better: "The BWT can be used..."
kleine Änderungen beim L-to-F mapping,
• p17: "to to"
• p17: (minor) "C[L[i]] is the number of lexicographically smaller characters before the 􏰁rst L[i] in F occurs." shorter: "C[L[i]] is the position before the first L[i] in F."
• p17: ".. calculate the first possible position ... add the number of L[i]s which occurred before the L[i]" not exactly how defined it. Its the position before the first L[i] and you add j if L[i] is the j-th occurrence in L.
Exactmatch, Backward_search, Anfang von Backward_search]
• p18: "Bowtie used" You haven't introduced Bowtie. Remove or rephrase.
• p18: "Here you have the challenge" sounds strange
• p18: "a index.."
• p18: the most significant advantage is: Exactmatch is O(|p|) (without compression of Occ)
• p18: "For the performance it just needs" - What performance?
• p18: "occupies in"
• p18: "Because of the rank preserving property it is possible to calculate the first and last position of this set." Why that?
• p19: Please consider rewriting. Most of the sentences are hard to understand. Please avoid "you" (if there's no other way use "we").
• Try to precisely formulate each sentence in german first then translate
• Read the sentence multiple times and check for exactness and unambiguity.
• Let your group mates review the text.
• p19: You could move 4) between 1) and 2) and use j instead of j-1 in 2). Additionally you need 5) "Repeat 2)-4) until j=1"
• p20: The output is actually only First, Last (not number of occ.)
• p20: "Initialisierung"
• p20: "return no rows..." better: "output/print "no rows..."" and always return <First,Last>
• p20: "The ... Algorithm runs..." -> "algorithm"
• p20: "Because of space limits it isn't possible...." What limits, who sets them? Better: "To reduce the space consumption it is possible to..."
• p20: "isn't" -> "is not"
• p20: what is the position of a row in the T? You actually mean the originating position of F[i] the first character of row i.
• p21: "pos(1)-1" -> "pos_T(1)-1"
• p21: "Determine L[j]" again, Occ not L is compressed
• p21: "Extension of the BWT" section empty
• p21: Backtracking should be explained
• ...

To all sections
• p*: check and replace all ?? and cite all non-standard terms/algorithms

## Lecture 2

REINERT: Dann haben wir zwei Vorlesungen über read mapping. Ist das gewünscht? (siehe auch unten)
• RazerS
• Filtering:
• Pigeonhole
• q-gram

(Hannes, Sebastian)

#### Slides

-- DavidWeese - 25 May 2010

• please turn on page numbers in the slides (easier to refer to)
• s13: "exponential runtime increasing" -> "exponential incease in running time"
• s14: "easier" is not correct, it should be faster. A filter only makes sense if the work load of subsequent steps and also the overall running time is reduced.
• s15: "parts ... does"
• s15: "contain possible", "be match" - quantifiers missing
• s16: "no guaranty to lose any significant match" - so the filter is/can be lossy? what is significant?
• s19: "fast" is relative. faster than what?
• s19: "improve" - compared to what?
• s21: "at least one set has no object" - you should say that each object is contained in one set
• s25: one could think you divide the pattern into pieces p1,...,pn and the text into t1,...,tx. Omit p1,...,pn for single characters, this is misleading. Better use |P|=m or "... of length m"
• s25: "one of the pattern pieces has to match without error" - Match with what and why? You didn't mention that there is an occurrence with <=k errors. And maybe it is easier to start with something like: T differs from P with at most k errors.
• s26: you need the generic pigeonhole lemma on s41 the first time. Try to rearrange
• s29: is very similar to s25
• s30: "p_i[start : end]" what's that? What is start and end? You use i_end and i_start later. I suggested to use positions s_i and e_i for pattern piece p_i
• s30: "matches at position t[...]" - t[...] is not a position, it's a substring
• s32: "before t_j" -> t[j] or better: position j in t
• s35: "text area" -> "text area is"
• s36: you only have one text, "text1 (t1)" -> "text"
• s40: "many unnecessary verification" and "many verifications are unsuccessful" - is somehow the partially same, as they are unnecessary if they are unsuccessful.
• s42: my pdf viewer (Mac, TexShop) doesn't display the arrows/lines(?) between the tree nodes.
• s48: How do you partition the error rate in the unbalanced case
• s50: Does Quasar really compute approx. local matches (including BLAST) or only potential approximate matches (w/o BLAST, I'm not sure)
• s51: "with window length" undefined. Better: |Q'|=w (or |D'| depending on your algorithm)
• s56: "devide"
• s56: So if only one read reaches the threshold of a block a you have to verify all reads in this block?
• s60: Maybe to much text on this slide.
• s64: "higer"
• s69: You should consider skipping Minimum Coverage to spare some time
• s69: increases the filter specificity

#### Script

-- DavidWeese - 02 Jun 2010: Revision 6452: Comments to the proof on p39:
• (1) should be omitted, is given in the assumption.
• Use more connecting words.
• before (2): What is k_i? "Let k_i be the number of errors in p^i. Then the following holds: ..."
• "order to show ... assumption" can be written as "We assume"
• The proof is wrong, as you argue: "Assume the Lemma is wrong. This is a contradiction to the Lemma. So it holds.".
• (4) is wrong. The indirect assumption is "For all i=1,..,j holds k_i>floor((a_i*k)/A)." The k_i are integers so for all i holds: k_i >= floor(a_i*k/A) + 1 > a_i*k/A. Thus, sum over k_i>=sum over a_i*k/A. -> k > k (contradiction)

## Lecture 3

REINERT: Passender wäre vielleicht die Arbeiten von Tammi und Kececioglu zusammen (wie im repeat resolution script) (TopHat und spliced mapping könnte man mit filter prinzipien in lecture 2 machen?)
• Splicing:
• Algorithmus von Kececioglu

(Duc, Nicolas und Corinna) pdf 1Lecture.pdf manage 6 MB 13 May 2010 - 15:18 UnknownUser Slides of lecture 1 pdf project_outline_thieme.pdf manage 65 K 04 May 2010 - 15:07 UnknownUser