Bioinformatics Algorithms
(Fundamental Algorithms, module 2)
Zsuzsanna Lipt´ ak
Masters in Medical Bioinformatics academic year 2018/19, II. semester
Bioinformatics Algorithms (Fundamental Algorithms, module 2) - - PowerPoint PPT Presentation
Bioinformatics Algorithms (Fundamental Algorithms, module 2) Zsuzsanna Lipt ak Masters in Medical Bioinformatics academic year 2018/19, II. semester Suffix Trees 2 Pattern matching with the suffix tree 2 / 18 Recall: Pattern matching
(Fundamental Algorithms, module 2)
Masters in Medical Bioinformatics academic year 2018/19, II. semester
2 / 18
3 / 18
A N A BANANA$ N A $ $ $ $ N A $ NA$
7 6 4 2 1 5 3
1recall that every outgoing edge from an inner node starts with a different character 4 / 18
A N A B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3 5 / 18
A N A B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3
A NA B A N A N A $ N A $ $ $ $ N A $ NA$
7 6 4 2 1 5 3 5 / 18
A N A B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3
A NA B A N A N A $ N A $ $ $ $ N A $ NA$
7 6 4 2 1 5 3
5 / 18
A NA B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3 6 / 18
A NA B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3
A N A BANANA$ N A $ $ $ $ N A $ NA$
7 6 4 2 1 5 3 6 / 18
A N A BANANA$ NA$ $ $ $ N A $ NA$
7 6 4 2 1 5 3 7 / 18
A N A BANANA$ NA$ $ $ $ N A $ NA$
7 6 4 2 1 5 3
A NA B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3 7 / 18
8 / 18
Let locus(P) = (u, d): traverse the subtree rooted in u, this takes time linear in the size of the subtree, which is O(occP), thus altogether O(m + occP).
8 / 18
Let locus(P) = (u, d): traverse the subtree rooted in u, this takes time linear in the size of the subtree, which is O(occP), thus altogether O(m + occP). (Proof for size of subtree: Number of leaves of subtree = occP ⇒ number
nodes < 2occP ⇒ number of edges < 2occP − 1 ⇒ size of subtree < 4occP.)
8 / 18
Let locus(P) = (u, d): traverse the subtree rooted in u, this takes time linear in the size of the subtree, which is O(occP), thus altogether O(m + occP). (Proof for size of subtree: Number of leaves of subtree = occP ⇒ number
nodes < 2occP ⇒ number of edges < 2occP − 1 ⇒ size of subtree < 4occP.)
8 / 18
Let locus(P) = (u, d): traverse the subtree rooted in u, this takes time linear in the size of the subtree, which is O(occP), thus altogether O(m + occP). (Proof for size of subtree: Number of leaves of subtree = occP ⇒ number
nodes < 2occP ⇒ number of edges < 2occP − 1 ⇒ size of subtree < 4occP.)
8 / 18
Let locus(P) = (u, d): traverse the subtree rooted in u, this takes time linear in the size of the subtree, which is O(occP), thus altogether O(m + occP). (Proof for size of subtree: Number of leaves of subtree = occP ⇒ number
nodes < 2occP ⇒ number of edges < 2occP − 1 ⇒ size of subtree < 4occP.)
8 / 18
9 / 18
10 / 18
11 / 18
(i) if Xc is a singleton, create a leaf; (ii) otherwise, find the longest common prefix of the suffixes in Xc, create an internal node, and recursively continue with Step 2, X being the set
prefix.
12 / 18
13 / 18
Recall the pattern matching problem, counting variant: Return the number of
A NA B A N A N A $ NA$ $ $ $ NA $ NA$
7 6 4 2 1 5 3 1 1 1 1 3 2 1 1 1 2 7
If we store g(u) in u, then we can solve the counting problem in O(m) time: match P in ST, if found in locus(P) = (u, d), then return g(u). E.g. the number
14 / 18
v child of u g(v)
15 / 18
Another piece of information we often need is the stringdepth sd(u) of a node u (the length of its label). A N A B A N A N A $ NA$ $ $ $ N A $ NA$
7 6 4 2 1 5 3 6 4 2 1 1 3 7 3 5 2
16 / 18
17 / 18
18 / 18