SLIDE 1 Suffix tree and Suffix array
Karatsuba
CS214: Algorithms and Complexity Shanghai Jiao Tong University 2016.12.22
SLIDE 2 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 3 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 4 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 5 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 6 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 7 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 8 Q: How to find a match of S in a target DNA sequence?
S: DNA:
SLIDE 9 Q: How to find a match of S in a target DNA sequence?
S: DNA:
Average time complexity: O(|S| ∗ |DNA|)
SLIDE 10 Q: How to find a match of S in a target DNA sequence?
S: DNA:
Too Slow!
Average time complexity: O(|S| ∗ |DNA|)
SLIDE 11
We need a much more efficient algorithm!
SLIDE 12 Structure
– Suffix Trees
SLIDE 13
What’s suffix?
Suffix Tries
SLIDE 14
What’s suffix?
Example:
abcdefg S: Suffix Tries
SLIDE 15
What’s suffix?
Example:
abcdefg S: Suffix Tries
SLIDE 16
What’s suffix?
Example:
abcdefg S: Suffix Tries
SLIDE 17
What’s suffix?
Example:
abcdefg S: Suffix Tries
SLIDE 18
What’s suffix?
Example:
abcdefg S: Suffix Tries
SLIDE 19
What’s suffix?
Example:
abcdefg S: Suffix Tries And so on.
SLIDE 20
What’s suffix?
Example:
abcdefg S: Suffix Tries And so on.
Attention: We use $ to represent empty suffix!
SLIDE 21
Suffix Tries
Suffix trie are a space-efficient data structure to store a string that allows many kinds of queries to be answered quickly.
SLIDE 22
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
SLIDE 23
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 5
SLIDE 24
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 5
SLIDE 25
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 5
SLIDE 26
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a 5
SLIDE 27
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4
SLIDE 28
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5
SLIDE 29
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b
SLIDE 30
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 a
SLIDE 31
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 4 a a
SLIDE 32
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 4 a a 5 $
SLIDE 33
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 34
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 $ a a 5 $ b a a $ $ b a a $
SLIDE 35
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 $ a a 5 $ b a a $ $ b a a $
SLIDE 36
Suffix Tries
S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$
0 1 2 3 4 $ a a 5 $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 37 Suffix Tries
S=abaa$
Every suffix of s is represented by some path from the root to a leaf node.
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 38
Application of Suffix Tries (1)
S=abaa$ What can we do with this?
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 39 S=abaa$ What can we do with this?
- 1. Is ”ba” a substring of S?
Application of Suffix Tries (1)
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 40 S=abaa$ What can we do with this?
- 1. Is ”ba” a substring of S?
Application of Suffix Tries (1)
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 41 S=abaa$ What can we do with this?
- 1. Is ”ba” a substring of S?
- 2. Is ”ba” a suffix of S?
Application of Suffix Tries (1)
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 42 S=abaa$ What can we do with this?
- 1. Is ”ba” a substring of S?
- 2. Is ”ba” a suffix of S?
- 3. How many times does ”ba” appear?
Application of Suffix Tries (1)
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 43 S=abaa$ What can we do with this?
- 1. Is ”ba” a substring of S?
- 2. Is ”ba” a suffix of S?
- 3. How many times does ”ba” appear?
... ...
Application of Suffix Tries (1)
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 44 Suffix Tries
S=abaa$ What we can not do
Find the longest common substring of S and t S= abaa$ t= bba
$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$
SLIDE 45 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 46 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 47 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 48 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 49 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 50 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 51 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 52 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 53 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 54 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 55 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 56 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 57 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 58 Suffix Links
S=abaa$
Suffix links connect node representing xα to a node representing α.
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 59
Suffix Tires with Links
S=abaa$
01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $
SLIDE 60
Suffix Tires with Links
S=abaa$
$ a a $ b a a $ $ b a a $
SLIDE 61 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=$
SLIDE 62 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=$
SLIDE 63 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=$
SLIDE 64 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=$
Ooops!
SLIDE 65 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=ba
SLIDE 66 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=ba What’s the semantic of jumping?
SLIDE 67 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=ba LCS(t1,...,tk,S)= either (t1,...ti)
SLIDE 68 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=ba
SLIDE 69 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=ba
SLIDE 70 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=ba
SLIDE 71 S=abaa$
$ a a $ b a a $ $ b a a $
Find the longest common substring of S and t
S= abaa$ t= babaa
Application with Suffix Tries (2)
max commom=abaa
SLIDE 72
Constructing Suffix Tries
SLIDE 73
First build a suffix tries of S[0] Add char one by one into the suffix tries To build a suffix tries of S
SLIDE 74
r
SLIDE 75
r
a
a
S=abaa$
SLIDE 76
r
a
a
S=abaa$ Suffix: a
SLIDE 77
r
ab
a b
S=abaa$ Suffix: ab
SLIDE 78
r
ab
a b b
S=abaa$ Suffix: ab b
SLIDE 79
r
ab
a b b
S=abaa$ Suffix: ab b
SLIDE 80
r
ab
a b b
S=abaa$ Suffix: ab b
SLIDE 81
r
aba
a b b a
S=abaa$ Suffix: aba b
SLIDE 82
r
aba
a b b a a
S=abaa$ Suffix: aba ba
SLIDE 83
r
aba
a b b a a
S=abaa$ Suffix: aba ba
SLIDE 84
r
aba
a b b a a
S=abaa$ Suffix: aba ba a
SLIDE 85
r
aba
a b b a a
S=abaa$ Suffix: aba ba a
SLIDE 86 r
abaa$
a b b a a a a a
$ $ $ $ $
S=abaa$ Suffix: abaa$ baa$ aa$ a$ $
SLIDE 87
How many nodes can a suffix trie have?
SLIDE 88
Space-Efficient Suffix Trees
SLIDE 89 A More Compact Represntation
r a b b a a a a a
$ $ $ $ $
S=abaa$
SLIDE 90 A More Compact Represntation
r a baa$ a$
$ $
baa$
S=abaa$ 12345
SLIDE 91 A More Compact Represntation
r 3:3 2:4 4:5
5:5
2:4
S=abaa$ 12345
5:5
SLIDE 92
SLIDE 93
How to construct suffix tree in Linear time
Further reading: Ukkonens Algorithm
SLIDE 94
Suffix arrays
SLIDE 95 Suffix Array Example str = catttcat $
1 catttcat$ 2 attcat$ 3 ttcat$ 4 tcat$ 5 cat$ 6 at$ 7 t$ 8 $
sort the suffixes alphabetically
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 96 Suffix Arrays
What can we do with this?
how many times does ’at’ occur? All the suffixes that start with ’at’ will be next to each other in the array. Binary search to find ’at’
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 97 Suffix Arrays
What can we do with this?
k-length substring that occurs exactly i times.
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 98
Suffix Arrays
K = 2 CurrentCount
1 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 99
Suffix Arrays
K = 2 CurrentCount
1 1 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 100
Suffix Arrays
K = 2 CurrentCount
1 1 2 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 101
Suffix Arrays
K = 2 CurrentCount
1 1 2 1 (at,2) 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
SLIDE 102
Suffix Arrays
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
K = 2 CurrentCount
1 1 2 1 2 (at,2)
SLIDE 103
Suffix Arrays
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
K = 2 CurrentCount
1 1 2 1 2 1 (at,2) (ca,2)
SLIDE 104
Suffix Arrays
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
K = 2 CurrentCount
1 1 2 1 2 1 1 (at,2) (ca,2) (t$,1)
SLIDE 105
Suffix Arrays
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
K = 2 CurrentCount
1 1 2 1 2 1 1 1 (at,2) (ca,2) (t$,1) (tc,1)
SLIDE 106
Suffix Arrays
8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$
K = 2 CurrentCount
1 1 2 1 2 1 1 1 1 (at,2) (ca,2) (t$,1) (tc,1) (tt,1)
SLIDE 107 Constructing Suffix Arrays
- Easy O(n2 log n) algorithm:
Sort the n suffixes, which takes O(n log n) comparisons, where each comparison takes O(n).
An simple O(n) algorithm: The Skew Algorithm