You are here: Foswiki>ABI Web>LectureWiki>AdvancedAlgorithms>RnaSeqP4>ForumSpace (02 Jun 2010, DavidWeese)Edit Attach

Page ForumSpace

Thsi page will be the discussion forum.

Lecture 1
Lecture 2
- Comments to Lecture 2
  - Slides
  - Script
Lecture 3
Comments

Lecture 1

REINERT (30.4): Hier könnte eine ausgearbeitetere Version der Vorlesungsabfolge stehen. Oder im SVN. Momentan kann ich dann keine Kommentare geben.

Read Mapping

1 Review about RNA-Seq
- 1.1 Motivation
- 1.2 Pipeline
2 Outline about Read Mapping
3 Bowtie
- 3.1 Backtracking (in general)
- 3.2 Suffix Arrays
- 3.3 BWT
  - Compression transformation
    - Cyclic permutation
    - Transformation output
  - Decompression transformation
    - Last to front mapping
  - BWT's advantages
- 3.4 Exactmatch
  - Overline: What do we have? What do we want to do?
  - FM-Index
  - Short idea of BWT compression
  - Backward-search algorithm
    - Definition of C and Occ (with examples)
    - Algorithm of searching a pattern of length=1
    - Searching a longer pattern
  - Algorithm to locate P in T
- 3.5 Backtracking of Bowtie

(Duc, Nicolas und Corinna)

Fragen zu lecture 1

F: Wenn ich ein Extrasymbol wie '#' anhänge, wird dann der Index I, den ich als BWT Output und auch für den Input zur Rücktransformation brauche, nicht trivial?
A: # lässt sich zumindest in F trivial finden. Beim Entpacken von hinten nach vorn wird dort auch angefangen, also braucht man I nicht extra zu speichern. -- DavidWeese - 04 May 2010

F: Indem ich L nach der Position von # scanne, weiß ich automat. mein I. Oder soll dies Performance steigernd sein?
A: Beim Entpacken von vorn nach hinten braucht man ja ein F-to-L mapping, damit findet man ja auch gleich das # in L, I oder Scannen ist dort also auch nicht nötig. -- DavidWeese - 04 May 2010:

F: Umgekehrt, ist '#' nötig, nur um bei der Rücktransformation in keinen Loop zu kommen? Das kann ich auch herausfinden, wenn ich weiß, dass der Eintrag F[I] erreicht wurde.
A: Probiert mal die BWT von abab ohne angehängtes # zu entpacken. -- DavidWeese - 04 May 2010:

Assignments lecture 1

Slides of lecture 1 (13 May 2010 5:26p.m.)

Task 1: Compute the BWT for S=ACGCACGTACGG. APply the exact match algorithm of the lecture to find the number of occurrences of the Pattern P=ACG. In the lecture we will learn how to also find the positions in the text and extend the example.

Task 2: Compute all matches with exactly one mismatch of the Pattern P=AGG using the backtracking algorithm of the lecture. (It will be given on Wednesday, nevertheless prepare ALL for it).

Comments to Lecture 1

Slides

Here are some additional comments to the slides:

-- DavidWeese - 06 May 2010, 11pm:

S41: "abracadabra" has 11 suffixes, only after appending # it has 12
S51: "characters in L occur in same order as in F (rank preserving)" is not what rank/order preserving means
S57: "every column in matrix M is a permutation of string S", in fact it is a permutation of S# (according to your definition of S)
S59: "T[i] is position of the character F[i] in L" - No. F[2]=F[3]=i. So I would expect T[2]=T[3] after reading this sentence.
S67: "that connect"
S69: "the number of P in T" → "the number of occurrences of P in T"
S77: "pos(i) = C[L[i]] + Occ(L[i], i)" you use pos( ) with a different definition later
S78: "All suffixes prefixed by P occupies in a set of rows" should be "All suffixes prefixed by P occur in a consecutive set of rows"
S98: "It’s possible to find row(i) indirectly" - what is row(i), define or omit
I don't understand slides 98-99:
- I thought pos(j) is unknown or do you describe the preprocessing alg. that marks positions in T?
- successor(c)?
- "Determine L[j]: Compare Occ(c, j) with Occ(c,j-i) for every c ∈ Σ ∪ {#}" - this is only necessary if L is compressed
- "pos(i)=C[L[j]] + Occ(L[j], i)" seems wrong to me, "C[L[j]] + Occ(L[j], i)"" is the result of the L→F mapping of i
- Backward_step(x) is simply the result of the L→F mapping of x, it should be explicitly pointed out
S101: "use Backward_step(i)" and "repeat t times" could be misleading, as you set "j=i" and actually repeat "j=Backward_step(j)" t times
(hint) you could introduce with "pos(Backward_step(i)) = pos(i) - 1"
(minor) S103: "Last = 10": You want to calculate pos(i) for i=9,..,10 so "i=10" would be more intuitive here
(minor) S105: same as above, "First = 9" is correct, but I would prefer "i=9"
S105: "Finding ssi"
S107: Shouldn't the red arrow point to [s|7]?
S111: Why Backward_step(4)? It should be Backward_step(11)=4.
S119: The ↑ arrows are maybe unusual, you could use "The higher the …, the higher the ..." or "more … results in ..."

BLASSE: Folien wurden überarbeitet

Script

-- DavidWeese - 19 May 2010

My comments to revision 6251:

Definitionen

p10: "Definition" Definitions
p10: "lexicographically sorted" → "ordered"
p10: "The character P[i] is the..." → "P[i] denotes the ..."
p10: You not only need prefixes, "P[i,j] is a pattern of symbols from the i-th to the j-th position in P"
p11: Introduce $ first and for both suffix array, see the SA definition in previous P4 lectures

BWT

p12: "an special"
p12: you already introduced $ as a character smaller than all characters of Σ, why introducing # with the same properties?
p12: "M whose rows are cyclic shifts of S" - either of S$ or of S' with S'=S$
p12: "column … are ..."
p12: "position of character c in string X" - what is the position of character c= s? There are 4 occurrences of s in L and F. You have to bring characters of the same origin in T into relation. Either use cyclic shifts or the already mentioned "predecessor of" relation.
p12: What if c1= m and c2= $?
p12: "of given string"
p12: Use the latex theorem block for the "rank preserving property" and the latex proof for its proof.
p12: "There are regions in F with c" - inexact, they occur in one substring/block
p13: If I=5 your row counting seems to begin with 0, you should say that
p14: "c1andc2" and wrong formulation of rank preserving property
p14: "Advantages of BWT" - advantages are relative, compared to what? Better use "Properties..."
p14: the first paragraph of 6.2 could be shortened
p14: "it is reversible algorithm"
p14: cite "Move to Front algorithm", "Run-Length encoding"
p15: "..algorithm is the string ..."
p15: instead of writing "Σ ∪ special character" multiple times introduce Σ'=Σ ∪ {$}
p15: "character in L corresponding to character F[i]" - the same here, corresponding is undefined, first define corresponding as stemming from the same text position
p15: How do you get the row beginning with $ and ending with m (this is definitely not row I)? In this example its coincidence that L[I=5] = m, because your string begins with 1 and I is counted from 0.
p15: "successor character of S" → "in S"
p15: "the i character"? better: "The character corresponding to T[i] in L …."
p16: "the rank of c" is inexact. It can only be used to relate the corresponding cyclic rotations ending with the same character. Leave it out.
p16: "f you do not use" → "Without"
p16: "If you construct.. you can use" - colloquial, omit "you", better: "The BWT can be used..."

kleine Änderungen beim L-to-F mapping,

p17: "to to"
p17: (minor) "C[L[i]] is the number of lexicographically smaller characters before the 􏰁rst L[i] in F occurs." shorter: "C[L[i]] is the position before the first L[i] in F."
p17: ".. calculate the first possible position … add the number of L[i]s which occurred before the L[i]" not exactly how defined it. Its the position before the first L[i] and you add j if L[i] is the j-th occurrence in L.

Exactmatch, Backward_search, Anfang von Backward_search]

p18: "Bowtie used" You haven't introduced Bowtie. Remove or rephrase.
p18: "Here you have the challenge" sounds strange
p18: "a index.."
p18: the most significant advantage is: Exactmatch is O(|p|) (without compression of Occ)
p18: "For the performance it just needs" - What performance?
p18: "occupies in"
p18: "Because of the rank preserving property it is possible to calculate the first and last position of this set." Why that?
p19: Please consider rewriting. Most of the sentences are hard to understand. Please avoid "you" (if there's no other way use "we").
- Try to precisely formulate each sentence in german first then translate
- Read the sentence multiple times and check for exactness and unambiguity.
- Let your group mates review the text.
p19: You could move 4) between 1) and 2) and use j instead of j-1 in 2). Additionally you need 5) "Repeat 2)-4) until j=1"
p20: The output is actually only First, Last (not number of occ.)
p20: "Initialisierung"
p20: "return no rows..." better: "output/print "no rows..."" and always return <First,Last>
p20: "The … Algorithm runs..." → "algorithm"
p20: "Because of space limits it isn't possible...." What limits, who sets them? Better: "To reduce the space consumption it is possible to..."
p20: "isn't" → "is not"
p20: what is the position of a row in the T? You actually mean the originating position of F[i] the first character of row i.
p21: "pos(1)-1" → "pos_T(1)-1"
p21: "Determine L[j]" again, Occ not L is compressed
p21: "Extension of the BWT" section empty
p21: Backtracking should be explained
…

To all sections

p*: check and replace all ?? and cite all non-standard terms/algorithms

Lecture 2

REINERT: Dann haben wir zwei Vorlesungen über read mapping. Ist das gewünscht? (siehe auch unten)

RazerS
Filtering:
- Pigeonhole
- q-gram

(Hannes, Sebastian)

Comments to Lecture 2

Slides

-- DavidWeese - 25 May 2010

My comments to revision 6323:

please turn on page numbers in the slides (easier to refer to)
s13: "exponential runtime increasing" → "exponential incease in running time"
s14: "easier" is not correct, it should be faster. A filter only makes sense if the work load of subsequent steps and also the overall running time is reduced.
s15: "parts … does"
s15: "contain possible", "be match" - quantifiers missing
s16: "no guaranty to lose any significant match" - so the filter is/can be lossy? what is significant?
s19: "fast" is relative. faster than what?
s19: "improve" - compared to what?
s21: "at least one set has no object" - you should say that each object is contained in one set
s25: one could think you divide the pattern into pieces p1,...,pn and the text into t1,...,tx. Omit p1,...,pn for single characters, this is misleading. Better use |P|=m or "... of length m"
s25: "one of the pattern pieces has to match without error" - Match with what and why? You didn't mention that there is an occurrence with <=k errors. And maybe it is easier to start with something like: T differs from P with at most k errors.
s26: you need the generic pigeonhole lemma on s41 the first time. Try to rearrange
s29: is very similar to s25
s30: "p_i[start : end]" what's that? What is start and end? You use i_end and i_start later. I suggested to use positions s_i and e_i for pattern piece p_i
s30: "matches at position t[...]" - t[...] is not a position, it's a substring
s32: "before t_j" → t[j] or better: position j in t
s35: "text area" → "text area is"
s36: you only have one text, "text1 (t1)" → "text"
s40: "many unnecessary verification" and "many verifications are unsuccessful" - is somehow the partially same, as they are unnecessary if they are unsuccessful.
s42: my pdf viewer (Mac, TexShop) doesn't display the arrows/lines(?) between the tree nodes.
s48: How do you partition the error rate in the unbalanced case
s50: Does Quasar really compute approx. local matches (including BLAST) or only potential approximate matches (w/o BLAST, I'm not sure)
s51: "with window length" undefined. Better: |Q'|=w (or |D'| depending on your algorithm)
s56: "devide"
s56: So if only one read reaches the threshold of a block a you have to verify all reads in this block?
s60: Maybe to much text on this slide.
s64: "higer"
s69: You should consider skipping Minimum Coverage to spare some time
s69: increases the filter specificity

Script

-- DavidWeese - 02 Jun 2010: Revision 6452: Comments to the proof on p39:

(1) should be omitted, is given in the assumption.
Use more connecting words.
before (2): What is k_i? "Let k_i be the number of errors in p^i. Then the following holds: ..."
"order to show … assumption" can be written as "We assume"
The proof is wrong, as you argue: "Assume the Lemma is wrong. This is a contradiction to the Lemma. So it holds.".
(4) is wrong. The indirect assumption is "For all i=1,..,j holds k_i>floor((a_i*k)/A)." The k_i are integers so for all i holds: k_i >= floor(a_i*k/A) + 1 > a_i*k/A. Thus, sum over k_i>=sum over a_i*k/A. → k > k (contradiction)

Lecture 3

REINERT: Passender wäre vielleicht die Arbeiten von Tammi und Kececioglu zusammen (wie im repeat resolution script) (TopHat und spliced mapping könnte man mit filter prinzipien in lecture 2 machen?)

Splicing:
- TopHat
Multireads:
- Algorithmus von Kececioglu

(Duc, Nicolas und Corinna)

Comments

1Lecture.pdf: Slides of lecture 1

Attachments 2

I	Attachment	Action	Size	Date	Who	Comment
pdf	1Lecture.pdf	manage	6 MB	13 May 2010 - 17:18	UnknownUser	Slides of lecture 1
pdf	project_outline_thieme.pdf	manage	65 K	04 May 2010 - 17:07	UnknownUser

Topic revision: r34 - 02 Jun 2010, DavidWeese - This page was cached on 09 Mar 2025 - 15:48.

ABI

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback