Suffix Trees Construction and Applications Joo Carreira 2008 - - PowerPoint PPT Presentation

suffix trees
SMART_READER_LITE
LIVE PREVIEW

Suffix Trees Construction and Applications Joo Carreira 2008 - - PowerPoint PPT Presentation

Suffix Trees Construction and Applications Joo Carreira 2008 Outline Why Suffix Trees? Definition Ukkonen's Algorithm (construction) Applications Why Suffix Trees? Why Suffix Trees? Asymptotically fast. Why Suffix Trees?


slide-1
SLIDE 1
slide-2
SLIDE 2

Suffix Trees

Construction and Applications

João Carreira 2008

slide-3
SLIDE 3

Outline

  • Why Suffix Trees?
  • Definition
  • Ukkonen's Algorithm (construction)
  • Applications
slide-4
SLIDE 4

Why Suffix Trees?

slide-5
SLIDE 5

Why Suffix Trees?

  • Asymptotically fast.
slide-6
SLIDE 6

Why Suffix Trees?

  • Asymptotically fast.
  • The basis of state of the art data structures.
slide-7
SLIDE 7

Why Suffix Trees?

  • Asymptotically fast.
  • The basis of state of the art data structures.
  • You don't need a Phd to use them.
slide-8
SLIDE 8

Why Suffix Trees?

  • Asymptotically fast.
  • The basis of state of the art data structures.
  • You don't need a Phd to use them.
  • Challenging.
slide-9
SLIDE 9

Why Suffix Trees?

  • Asymptotically fast.
  • The basis of state of the art data structures.
  • You don't need a Phd to use them.
  • Challenging.
  • Expose interesting algorithmic ideas.
slide-10
SLIDE 10

Definition

  • m leaves numbered 1 to m

Suffix Tree for an m-character string:

slide-11
SLIDE 11

Definition

  • m leaves numbered 1 to m
  • edge-label vs node-label

Suffix Tree for an m-character string:

slide-12
SLIDE 12

Definition

  • m leaves numbered 1 to m
  • edge-label vs node-label
  • each internal node has at least two children

Suffix Tree for an m-character string:

slide-13
SLIDE 13

Definition

  • m leaves numbered 1 to m
  • edge-label vs node-label
  • each internal node has at least two children
  • the label of the leaf j is S[ j..m ]

Suffix Tree for an m-character string:

slide-14
SLIDE 14

Definition

  • m leaves numbered 1 to m
  • edge-label vs node-label
  • each internal node has at least two children
  • the label of the leaf j is S[ j..m ]
  • no two edges out of the same node can have edge-labels

beginning with the same character Suffix Tree for an m-character string:

slide-15
SLIDE 15

Definition Example

String: xabxac Length (m): 6 characters Number of Leaves: 6 Node 5 label: ac

slide-16
SLIDE 16

Implicit vs Explicit

  • What if we have “axabx” ?
slide-17
SLIDE 17

Ukkonen's Algorithm

suffix tree construction

slide-18
SLIDE 18

Ukkonen's Algorithm

  • Text: S[ 1..m ]
  • m phases
  • phase j is divided into j extensions:

In extension j of phase i + 1:

  • find the end of the path from the root labeled with substring S[ j..i ]
  • extend the substring by adding the character S(i + 1) to its end

suffix tree construction

slide-19
SLIDE 19

Extension Rules

  • Rule 1: Path β ends at a leaf. S(i + 1) is added to the end of the label on that leaf edge.
slide-20
SLIDE 20

Extension Rules

  • Rule 2: No path from the end of β starts with S(i + 1), but at least one labeled path

continues from the end of β.

slide-21
SLIDE 21

Extension Rules

  • Rule 3: Some path from the end of β starts with S(i + 1), so we do nothing.
slide-22
SLIDE 22

Ukkonen's Algorithm

Complexity:

suffix tree construction

slide-23
SLIDE 23

Ukkonen's Algorithm

Complexity:

  • m phases

suffix tree construction

slide-24
SLIDE 24

Ukkonen's Algorithm

Complexity:

  • m phases
  • phase j -> j extensions

suffix tree construction

slide-25
SLIDE 25

Ukkonen's Algorithm

Complexity:

  • m phases
  • phase j -> j extensions
  • find the end of the path of substring β: O(|β|) = O(m)

suffix tree construction

slide-26
SLIDE 26

Ukkonen's Algorithm

Complexity:

  • m phases
  • phase j -> j extensions
  • find the end of the path of substring β: O(|β|) = O(m)
  • each extension: O(1)

suffix tree construction

slide-27
SLIDE 27

Ukkonen's Algorithm

Complexity:

  • m phases
  • phase j -> j extensions
  • find the end of the path of substring β: O(|β|) = O(m)
  • each extension: O(1)

O(m3)

suffix tree construction

slide-28
SLIDE 28

“First make it run, then make it run fast.”

Brian Kernighan

slide-29
SLIDE 29

Suffix Links

Definition:

  • For an internal node v with path-label xα, if there is another node s(v), with

path-label α, then a pointer from v to s(v) is called a suffix link.

slide-30
SLIDE 30

Suffix Links

Lemma:

  • If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

  • r an internal at the end of the string α will be created in the next extension
  • f the same phase.

If Rule 2 applies:

slide-31
SLIDE 31

Suffix Links

Lemma:

  • If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

  • r an internal at the end of the string α will be created in the next extension
  • f the same phase.

If Rule 2 applies:

  • S[ j..i ] continues with c ≠ S(i + 1)
slide-32
SLIDE 32

Suffix Links

Lemma:

  • If a new internal node v with path label xα is added to the current tree in extension

j of some phase, then either the path labeled α already ends at an internal node

  • r an internal at the end of the string α will be created in the next extension
  • f the same phase.

If Rule 2 applies:

  • S[ j..i ] continues with c ≠ S(i + 1)
  • S[ j + 1..i ] continues with c.
slide-33
SLIDE 33

Single Extension Algorithm

Extension j of phase i + 1:

  • 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

slide-34
SLIDE 34

Single Extension Algorithm

Extension j of phase i + 1:

  • 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

  • 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

slide-35
SLIDE 35

Single Extension Algorithm

Extension j of phase i + 1:

  • 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

  • 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

  • 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
slide-36
SLIDE 36

Single Extension Algorithm

Extension j of phase i + 1:

  • 1. Find the first node v at or above the end of S[ j - 1..i ] that either has a suffix link

from it or is the root. Let λ denote the string between v and the end of S[ j – 1..i ].

  • 2. If v is the root, follow the path for S[ j..i ] (as in the naive algorithm). Else traverse the

suffix link and walk down from s(v) following the path for string λ.

  • 3. Using the extension rules, ensure that the string S[ j..i ] S(i+1) is in the tree.
  • 4. If a new internal w was created in extension j – 1 (by rule 2), then string α must

end at node s(w), the end node for the suffix link from w. Create the suffix link (w, s(w)) from w to s(w).

slide-37
SLIDE 37

Node Depth

The node-depth of v is at most one greater than the node depth of s(v).

α ß xß xα xλ λ xß xα xλ ß α λ equal node-depth: 3 Node depth: 4 Node depth: 3

slide-38
SLIDE 38
  • γ number of characters in an edge
  • “Directly implemented” edge traversal: O(|γ|)

Skip/count Trick

slide-39
SLIDE 39

Skip/count Trick

  • “Jump” from node to node.
  • K = number of nodes in a path
  • Time to traverse a path: O(|K|)
  • γ number of characters in an edge
  • “Directly implemented” edge traversal: O(|γ|)
slide-40
SLIDE 40

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

slide-41
SLIDE 41

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
slide-42
SLIDE 42

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
  • In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

slide-43
SLIDE 43

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
  • In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

  • The up-walk decreases the current node-depth by at most one.
slide-44
SLIDE 44

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
  • In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

  • The up-walk decreases the current node-depth by at most one.
  • Each suffix link traversal decreases the node-depth by at most another one.
slide-45
SLIDE 45

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
  • In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

  • The up-walk decreases the current node-depth by at most one.
  • Each suffix link traversal decreases the node-depth by at most another one.
  • Each down-walk moves to a node of greater depth.
slide-46
SLIDE 46

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
  • In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

  • The up-walk decreases the current node-depth by at most one.
  • Each suffix link traversal decreases the node-depth by at most another one.
  • Each down-walk moves to a node of greater depth.
  • Over the entire phase the node-depth is decremented at most 2m times.
slide-47
SLIDE 47

Ukkonen's Algorithm

Using the skip/count trick:

  • any phase of Ukkonen's algorithm takes O(m) time.

Proof:

  • There are i + 1 ≤ m extensions in phase i + 1
  • In a single extension, the algorithm walks up at most one edge, traverses one suffix link,

walks down some number of nodes, applies the extension rules and may add a suffix link.

  • The up-walk decreases the current node-depth by at most one.
  • Each suffix link traversal decreases the node-depth by at most another one.
  • Each down-walk moves to a node of greater depth.
  • Over the entire phase the node-depth is decremented at most 2m times.
  • No node can have depth greater than m, so the total increment to current node-depth

(down walks) is bounded by 3m over the entire phase.

slide-48
SLIDE 48

Ukkonen's Algorithm

  • m phases
  • 1 phase: O(m)
slide-49
SLIDE 49

Ukkonen's Algorithm

  • m phases
  • 1 phase: O(m)

O(m2)

slide-50
SLIDE 50

“First make it run fast, then make it run faster.”

João Carreira

slide-51
SLIDE 51

Edge-Label Compression

  • A string with m characters has m suffixes.
  • If edge labels are represented with characters, O(m2) space is needed.
slide-52
SLIDE 52

Edge-Label Compression

  • A string with m characters has m suffixes.
  • If edge labels are represented with characters, O(m2) space is needed.

To achieve O(m) space, each edge-label:

(p, q)

slide-53
SLIDE 53

Two more tricks...

slide-54
SLIDE 54

Rule 3 is a show stopper

If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why?

slide-55
SLIDE 55

Rule 3 is a show stopper

If rule 3 applies in extension j, it will also apply in all further extensions until the end of the phase. Why?

  • When rule 3 applies, the path labeled S[ j..i ] must continue with character S(i + 1), and

so the path labeled S[ j + 1..i ] does also, and rule 3 again applies in extensions j+1...i+1.

slide-56
SLIDE 56

Rule 3 is a show stopper

  • End any phase i +1 the first time rule 3 applies.
  • The remaining extensions are said to be done implicitly.
slide-57
SLIDE 57

Once a leaf always a leaf

  • Leaf created => always a leaf in all successive trees.
  • No mechanism for extending a leaf edge beyond its current leaf.
  • Once there is a leaf labeled j, extension rule 1 will always apply to extension j

in any sucessive phase.

slide-58
SLIDE 58

Once a leaf always a leaf

  • Leaf created => always a leaf in all successive trees.
  • No mechanism for extending a leaf edge beyond its current leaf.
  • Once there is a leaf labeled j, extension rule 1 will always apply to extension j

in any sucessive phase.

Leaf Edge Label:

(p, e)

slide-59
SLIDE 59

Single Phase Algorithm

In each phase i:

slide-60
SLIDE 60

Single Phase Algorithm

During construction:

slide-61
SLIDE 61

Implicit to Explicit

One last phase to add character $: O(m)

slide-62
SLIDE 62

Suffix Trees are a Swiss Knife

slide-63
SLIDE 63

Applications

Exact String Matching:

slide-64
SLIDE 64

Applications

Exact String Matching:

Three ocurrences of string aw. Preprocessing: O(m) Search: O(n + k)

slide-65
SLIDE 65

Applications

And much more..

  • Longest common substring

O(n)

  • Longest repeated substring

O(n)

  • Longest palindrome

O(n)

  • Most frequently occurring substrings of a minimum length

O(n)

  • Shortest substrings occurring only once

O(n)

  • Lempel-Ziv decomposition

O(n)

  • .....
slide-66
SLIDE 66

“Biology easily has 500 years of exciting problems to work on.”

Donald Knuth

slide-67
SLIDE 67

web.ist.utl.pt/joao.carreira

slide-68
SLIDE 68

web.ist.utl.pt/joao.carreira

Questions?