Suffix tree and Suffix array Karatsuba CS214: Algorithms and - - PowerPoint PPT Presentation

suffix tree and suffix array
SMART_READER_LITE
LIVE PREVIEW

Suffix tree and Suffix array Karatsuba CS214: Algorithms and - - PowerPoint PPT Presentation

Suffix tree and Suffix array Karatsuba CS214: Algorithms and Complexity Shanghai Jiao Tong University 2016.12.22 Q: How to find a match of S in a target DNA sequence? S: DNA: Q: How to find a match of S in a target DNA sequence? S: DNA:


slide-1
SLIDE 1

Suffix tree and Suffix array

Karatsuba

CS214: Algorithms and Complexity Shanghai Jiao Tong University 2016.12.22

slide-2
SLIDE 2

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-3
SLIDE 3

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-4
SLIDE 4

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-5
SLIDE 5

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-6
SLIDE 6

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-7
SLIDE 7

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-8
SLIDE 8

Q: How to find a match of S in a target DNA sequence?

S: DNA:

slide-9
SLIDE 9

Q: How to find a match of S in a target DNA sequence?

S: DNA:

Average time complexity: O(|S| ∗ |DNA|)

slide-10
SLIDE 10

Q: How to find a match of S in a target DNA sequence?

S: DNA:

Too Slow!

Average time complexity: O(|S| ∗ |DNA|)

slide-11
SLIDE 11

We need a much more efficient algorithm!

slide-12
SLIDE 12

Structure

  • Suffix Tries

– Suffix Trees

  • Suffix Array
slide-13
SLIDE 13

What’s suffix?

Suffix Tries

slide-14
SLIDE 14

What’s suffix?

Example:

abcdefg S: Suffix Tries

slide-15
SLIDE 15

What’s suffix?

Example:

abcdefg S: Suffix Tries

slide-16
SLIDE 16

What’s suffix?

Example:

abcdefg S: Suffix Tries

slide-17
SLIDE 17

What’s suffix?

Example:

abcdefg S: Suffix Tries

slide-18
SLIDE 18

What’s suffix?

Example:

abcdefg S: Suffix Tries

slide-19
SLIDE 19

What’s suffix?

Example:

abcdefg S: Suffix Tries And so on.

slide-20
SLIDE 20

What’s suffix?

Example:

abcdefg S: Suffix Tries And so on.

Attention: We use $ to represent empty suffix!

slide-21
SLIDE 21

Suffix Tries

Suffix trie are a space-efficient data structure to store a string that allows many kinds of queries to be answered quickly.

slide-22
SLIDE 22

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

slide-23
SLIDE 23

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 5

slide-24
SLIDE 24

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 5

slide-25
SLIDE 25

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 5

slide-26
SLIDE 26

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a 5

slide-27
SLIDE 27

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4

slide-28
SLIDE 28

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5

slide-29
SLIDE 29

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b

slide-30
SLIDE 30

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 a

slide-31
SLIDE 31

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 4 a a

slide-32
SLIDE 32

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 4 a a 5 $

slide-33
SLIDE 33

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 01234 $ 5 134 a a 5 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-34
SLIDE 34

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 $ a a 5 $ b a a $ $ b a a $

slide-35
SLIDE 35

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 $ a a 5 $ b a a $ $ b a a $

slide-36
SLIDE 36

Suffix Tries

S= a b a a $ Suffix: $ a$ aa$ baa$ abaa$

0 1 2 3 4 $ a a 5 $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-37
SLIDE 37

Suffix Tries

S=abaa$

Every suffix of s is represented by some path from the root to a leaf node.

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-38
SLIDE 38

Application of Suffix Tries (1)

S=abaa$ What can we do with this?

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-39
SLIDE 39

S=abaa$ What can we do with this?

  • 1. Is ”ba” a substring of S?

Application of Suffix Tries (1)

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-40
SLIDE 40

S=abaa$ What can we do with this?

  • 1. Is ”ba” a substring of S?

Application of Suffix Tries (1)

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-41
SLIDE 41

S=abaa$ What can we do with this?

  • 1. Is ”ba” a substring of S?
  • 2. Is ”ba” a suffix of S?

Application of Suffix Tries (1)

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-42
SLIDE 42

S=abaa$ What can we do with this?

  • 1. Is ”ba” a substring of S?
  • 2. Is ”ba” a suffix of S?
  • 3. How many times does ”ba” appear?

Application of Suffix Tries (1)

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-43
SLIDE 43

S=abaa$ What can we do with this?

  • 1. Is ”ba” a substring of S?
  • 2. Is ”ba” a suffix of S?
  • 3. How many times does ”ba” appear?

... ...

Application of Suffix Tries (1)

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-44
SLIDE 44

Suffix Tries

S=abaa$ What we can not do

Find the longest common substring of S and t S= abaa$ t= bba

$ a a $ b a a $ $ b a a $ abaa$ aa$ a$ $ baa$

slide-45
SLIDE 45

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-46
SLIDE 46

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-47
SLIDE 47

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-48
SLIDE 48

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-49
SLIDE 49

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-50
SLIDE 50

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-51
SLIDE 51

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-52
SLIDE 52

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-53
SLIDE 53

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-54
SLIDE 54

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-55
SLIDE 55

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-56
SLIDE 56

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-57
SLIDE 57

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-58
SLIDE 58

Suffix Links

S=abaa$

Suffix links connect node representing xα to a node representing α.

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-59
SLIDE 59

Suffix Tires with Links

S=abaa$

01234 $ 5 134 a a 4 $ 5 b 2 3 4 a a 5 $ 5 $ b 2 3 4 a a 5 $

slide-60
SLIDE 60

Suffix Tires with Links

S=abaa$

$ a a $ b a a $ $ b a a $

slide-61
SLIDE 61

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=$

slide-62
SLIDE 62

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=$

slide-63
SLIDE 63

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=$

slide-64
SLIDE 64

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=$

Ooops!

slide-65
SLIDE 65

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=ba

slide-66
SLIDE 66

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=ba What’s the semantic of jumping?

slide-67
SLIDE 67

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=ba LCS(t1,...,tk,S)= either (t1,...ti)

  • r LCS(t2,...,tk,S)
slide-68
SLIDE 68

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=ba

slide-69
SLIDE 69

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=ba

slide-70
SLIDE 70

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=ba

slide-71
SLIDE 71

S=abaa$

$ a a $ b a a $ $ b a a $

Find the longest common substring of S and t

S= abaa$ t= babaa

Application with Suffix Tries (2)

max commom=abaa

slide-72
SLIDE 72

Constructing Suffix Tries

slide-73
SLIDE 73

First build a suffix tries of S[0] Add char one by one into the suffix tries To build a suffix tries of S

slide-74
SLIDE 74

r

slide-75
SLIDE 75

r

a

a

S=abaa$

slide-76
SLIDE 76

r

a

a

S=abaa$ Suffix: a

slide-77
SLIDE 77

r

ab

a b

S=abaa$ Suffix: ab

slide-78
SLIDE 78

r

ab

a b b

S=abaa$ Suffix: ab b

slide-79
SLIDE 79

r

ab

a b b

S=abaa$ Suffix: ab b

slide-80
SLIDE 80

r

ab

a b b

S=abaa$ Suffix: ab b

slide-81
SLIDE 81

r

aba

a b b a

S=abaa$ Suffix: aba b

slide-82
SLIDE 82

r

aba

a b b a a

S=abaa$ Suffix: aba ba

slide-83
SLIDE 83

r

aba

a b b a a

S=abaa$ Suffix: aba ba

slide-84
SLIDE 84

r

aba

a b b a a

S=abaa$ Suffix: aba ba a

slide-85
SLIDE 85

r

aba

a b b a a

S=abaa$ Suffix: aba ba a

slide-86
SLIDE 86

r

abaa$

a b b a a a a a

$ $ $ $ $

S=abaa$ Suffix: abaa$ baa$ aa$ a$ $

slide-87
SLIDE 87

How many nodes can a suffix trie have?

slide-88
SLIDE 88

Space-Efficient Suffix Trees

slide-89
SLIDE 89

A More Compact Represntation

r a b b a a a a a

$ $ $ $ $

S=abaa$

slide-90
SLIDE 90

A More Compact Represntation

r a baa$ a$

$ $

baa$

S=abaa$ 12345

slide-91
SLIDE 91

A More Compact Represntation

r 3:3 2:4 4:5

5:5

2:4

S=abaa$ 12345

5:5

slide-92
SLIDE 92
slide-93
SLIDE 93

How to construct suffix tree in Linear time

Further reading: Ukkonens Algorithm

slide-94
SLIDE 94

Suffix arrays

slide-95
SLIDE 95

Suffix Array Example str = catttcat $

1 catttcat$ 2 attcat$ 3 ttcat$ 4 tcat$ 5 cat$ 6 at$ 7 t$ 8 $

sort the suffixes alphabetically

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-96
SLIDE 96

Suffix Arrays

What can we do with this?

  • 1. Counting:

how many times does ’at’ occur? All the suffixes that start with ’at’ will be next to each other in the array. Binary search to find ’at’

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-97
SLIDE 97

Suffix Arrays

What can we do with this?

  • 2. K-mer counting:

k-length substring that occurs exactly i times.

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-98
SLIDE 98

Suffix Arrays

K = 2 CurrentCount

1 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-99
SLIDE 99

Suffix Arrays

K = 2 CurrentCount

1 1 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-100
SLIDE 100

Suffix Arrays

K = 2 CurrentCount

1 1 2 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-101
SLIDE 101

Suffix Arrays

K = 2 CurrentCount

1 1 2 1 (at,2) 8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

slide-102
SLIDE 102

Suffix Arrays

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

K = 2 CurrentCount

1 1 2 1 2 (at,2)

slide-103
SLIDE 103

Suffix Arrays

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

K = 2 CurrentCount

1 1 2 1 2 1 (at,2) (ca,2)

slide-104
SLIDE 104

Suffix Arrays

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

K = 2 CurrentCount

1 1 2 1 2 1 1 (at,2) (ca,2) (t$,1)

slide-105
SLIDE 105

Suffix Arrays

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

K = 2 CurrentCount

1 1 2 1 2 1 1 1 (at,2) (ca,2) (t$,1) (tc,1)

slide-106
SLIDE 106

Suffix Arrays

8 $ 6 at$ 2 attcat$ 5 cat$ 1 cattcat$ 7 t$ 4 tcat$ 3 ttcat$

K = 2 CurrentCount

1 1 2 1 2 1 1 1 1 (at,2) (ca,2) (t$,1) (tc,1) (tt,1)

slide-107
SLIDE 107

Constructing Suffix Arrays

  • Easy O(n2 log n) algorithm:

Sort the n suffixes, which takes O(n log n) comparisons, where each comparison takes O(n).

  • We can do better:

An simple O(n) algorithm: The Skew Algorithm