Tries and Suffix Trees Inge Li Grtz String indexing problem String - - PowerPoint PPT Presentation

tries and suffix trees
SMART_READER_LITE
LIVE PREVIEW

Tries and Suffix Trees Inge Li Grtz String indexing problem String - - PowerPoint PPT Presentation

Tries and Suffix Trees Inge Li Grtz String indexing problem String matching problem. Given strings T (text) and P (pattern) over an alphabet , report starting positions of all occurrences of P in T. Finite automaton: O(m + n) time


slide-1
SLIDE 1

Tries and Suffix Trees

Inge Li Gørtz

slide-2
SLIDE 2
  • String matching problem. Given strings T (text) and P (pattern) over an alphabet Σ,

report starting positions of all occurrences of P in T.

  • Finite automaton: O(mΣ + n) time and space
  • KMP: O(m+n) time and space
  • String indexing problem. Given a string S of characters from an alphabet Σ.

Preprocess S into a data structure to support

  • Search(P): Return starting position of all occurrences of P in S.
  • Today: Data structure using O(n) space and supporting Search(P) in O(m) time.
  • Applications:
  • Search engines, e.g. prefix searches.
  • Finding common substrings of many biological strings
  • Finding repeating substructures in biological strings
  • Detecting DNA contamination

String indexing problem

slide-3
SLIDE 3
  • Tries
  • Compressed tries
  • Suffix trees
  • Applications of suffix trees

Outline

slide-4
SLIDE 4

Tries

slide-5
SLIDE 5
  • Text retrieval
  • Trie over the strings: sells, by, the, sea, shells, tea.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

slide-6
SLIDE 6
  • Text retrieval
  • Prefix-free?
  • Trie over the strings: sells, by, the, sea, shells, tea, she.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

slide-7
SLIDE 7
  • Text retrieval
  • Prefix-free?
  • Trie over the strings: sells, by, the, sea, shells, tea, she.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-8
SLIDE 8

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
slide-9
SLIDE 9
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-10
SLIDE 10
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-11
SLIDE 11
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-12
SLIDE 12
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-13
SLIDE 13
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-14
SLIDE 14
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-15
SLIDE 15
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-16
SLIDE 16
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-17
SLIDE 17
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-18
SLIDE 18
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-19
SLIDE 19
  • Text retrieval
  • Search for “sea”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

S4

Tries

b y

S2 S1

s s s

S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-20
SLIDE 20
  • Text retrieval
  • Search for “short”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-21
SLIDE 21
  • Text retrieval
  • Search for “short”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-22
SLIDE 22
  • Text retrieval
  • Search for “short”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-23
SLIDE 23
  • Text retrieval
  • Search for “short”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-24
SLIDE 24
  • Text retrieval
  • Search for “short”
  • Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-25
SLIDE 25
  • Build a trie over the strings: by$, sells$, sea$.

Tries

b y

S2 S1

s s

S4

e a l l $ $ $

slide-26
SLIDE 26
  • Properties of the trie. A trie T storing a collection S of s strings of total length n from

an alphabet of size d has the following properties:

  • How many children can a node have?
  • How many leaves does T have?
  • What is the height of T?
  • What is the number of nodes in T?

Trie

slide-27
SLIDE 27
  • Search time: O(d) in each node => O(dm).
  • O(m) if d constant.
  • d not constant: use dictionary
  • Hashing O(1)
  • Balanced BST: O(log d)
  • Time and space for a trie (for small/constant d):
  • O(m) for searching for a string of length m.
  • O(n) space.
  • Preprocessing: O(n)

Trie

slide-28
SLIDE 28
  • Prefix search: return all words in the trie starting with “se”

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-29
SLIDE 29

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

  • Prefix search: return all words in the trie starting with “se”
slide-30
SLIDE 30
  • Prefix search: return all words in the trie starting with “se”

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-31
SLIDE 31
  • Time for prefix search: O(m) + time to report all occurrences.
  • Solution: compact tries.

Trie

Could be large!!

slide-32
SLIDE 32

Compact tries

slide-33
SLIDE 33
  • Compact trie: Chains of nodes with a single child is merged into a single node.

Tries

b y

S2 S1

s s s

S4 S6 S3

e e e e a a l l l l h h t

S5

$ $ $ $ $ $

S7

$

slide-34
SLIDE 34
  • Compact trie: Chains of nodes with a single child is merged into a single node.

Tries

b t y a s

e

l l e s a h e h e l l s $ $ $ $ $ $ $

b y $ S2 s e a $ S4 l l s $ S1 he l l s $ S6 S3 S5 S7 $ t e a $ h e $

slide-35
SLIDE 35
  • Properties of the compact trie. A compact trie T storing a collection S of s strings of

total length n from an alphabet of size d has the following properties:

  • Every internal node of T has at least 2 and at most d children.
  • T has s leaves
  • The number of nodes in T is < 2s.
  • Time and space for a compact trie (constant d):
  • O(m) for searching for a string of length m.
  • O(m + occ) for prefix search, where occ = #occurrences
  • O(s) space.
  • Preprocessing: O(n)

Trie

slide-36
SLIDE 36

Suffix trees

slide-37
SLIDE 37
  • String indexing problem. Given a string S of characters from an alphabet Σ.

Preprocess S into a data structure to support

  • Search(P): Return starting position of all occurrences of P in S.
  • Build a compressed trie over all suffixes of S (suffix tree). Label leaves with

index of suffix.

  • Observation: An occurrence of P is a prefix of a suffix of S.

Suffix tree

  • ccurrence of P

Suffix of S

slide-38
SLIDE 38
  • String indexing problem. Given a string S of characters from an alphabet Σ.

Preprocess S into a data structure to support

  • Search(P): Return starting position of all occurrences of P in S.
  • Build a compressed trie over all suffixes of S (suffix tree). Label leaves with

index of suffix.

  • Observation: An occurrence of P is a prefix of a suffix of S.
  • Example: P = ana.

Suffix tree

  • ccurrence of P

Suffix of S b a n a n a s t r i n g s s a l a d s Suffix of S Suffix of S

slide-39
SLIDE 39
  • Suffix tree: over the string banana$

Suffix Tree

$

b

1

a n a a n a

$ 6

n a

$ 4

n a

$ 2

n a n a

$ 3 5 $ $ 7

slide-40
SLIDE 40
  • Suffix tree: over the string banana$

Suffix Tree

$

b

1

a n a a n a

$ 6

n a

$ 4

n a

$ 2

n a n a

$ 3 5 $ $ 7

  • Search for P

.

  • Report labels of all leaves

below final node

slide-41
SLIDE 41
  • Suffix tree: over the string banana$
  • Find all occurrences of P=“an”

Suffix Tree

$

b

1

a n a a n a

$ 6

n a

$ 4

n a

$ 2

n a n a

$ 3 5 $ $ 7

  • Search for P

.

  • Report labels of all leaves

below final node

slide-42
SLIDE 42
  • Suffix tree: over the string banana$

Suffix Tree

$

b

1

a n a a n a

$ 6

n a

$ 4

n a

$ 2

n a n a

$ 3 5 $ $ 7

1 2 3 4 5 6 7 b a n a n a $

  • Store S and store node labels by reference to S.
slide-43
SLIDE 43
  • Suffix tree: over the string banana$

Suffix Tree

1

[2,2]

6 4 2 3 5 7

1 2 3 4 5 6 7 b a n a n a $

  • Store S and store node labels by reference to S.

[3,4] [7,7] [5,7] [7,7] [7,7] [7,7] [1,7] [3,4] [5,7]

slide-44
SLIDE 44

Suffix trees and common substrings

9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4

$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1

slide-45
SLIDE 45
  • Suffix tree of a string S: Compact trie over all suffixes of S.
  • Space and time:
  • Space: O(n)
  • Search time: O(m) + time to report all occurrences = O(m+occ)
  • Preprocessing: Can be done in O(sort(n,|Σ|)) time, where sort(n,|Σ|) is the time it

takes to sort n characters from an alphabet Σ.

  • Suffix trees can be used to solve the String indexing problem in:
  • Space: O(n)
  • Search time: O(m+occ)
  • Preprocessing: O(sort(n,|Σ|)) time

Suffix tree

slide-46
SLIDE 46

Applications of suffix trees

slide-47
SLIDE 47
  • Find longest common substring of strings S1 and S2.
  • Construct the suffix tree over S1$1S2$2.
  • Example: Find longest common substring of piespies and piepiees:
  • Construct suffix tree of piespies$1piepiees$2.

Longest common substring

slide-48
SLIDE 48
  • Suffix tree of piespies$1piepiees$2.

Generalized suffix tree

$1

9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4

$2 $1 $1 $1 $2 $2 $2 $2 $2 p p p p p i i i e e e e e e e e s s s s s s i i i s s $1 $2 $2 $2 s s e e . . . . . . . . . . . . . . . e e $2 $2 $2 $2 $1 $2 . . . e $2 s p s i e $1 $2 . . . p s i $1 $2 e . . . p s i $1 $2 e . . .

slide-49
SLIDE 49
  • Suffix tree of piespies$1piepiees$2.
  • Mark leaf with if $1 suffix starts in S1.

Generalized suffix tree

9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4

$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1

slide-50
SLIDE 50
  • Suffix tree of piespies$1piepiees$2.
  • Mark leaf with if $1 suffix starts in S1.

Generalized suffix tree

9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4

$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1

slide-51
SLIDE 51
  • Suffix tree of piespies$1piepiees$2.
  • Mark leaf with if $1 suffix starts in S1.
  • Add string-depth.

15 11 1

Generalized suffix tree

9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4

$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1

[18,18] [9,18] [15,15] [16,18] [13,18] [17,17] [18,18] [9,18] [5,18] [14,15] [16,18] [13,18] [8,8] [9,18] [5,18] [13,15] [16,18] [13,18] [8,8] [9,18] [5,18] [18,18] [17,17] [9,18] [5,18] 1 10 1 4 7 2 3 12 16 2 5 8 3 13 17 3 6 9 4 14 18

slide-52
SLIDE 52

13 3

  • Suffix tree of piespies$1piepiees$2.
  • Mark leaf with if $1 suffix starts in S1.
  • Add string-depth.

15 11 1

Generalized suffix tree

9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4

$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1

[18,18] [9,18] [15,15] [16,18] [13,18] [17,17] [18,18] [9,18] [5,18] [14,15] [16,18] [13,18] [8,8] [9,18] [5,18] [13,15] [16,18] [13,18] [8,8] [9,18] [5,18] [18,18] [17,17] [9,18] [5,18] 1 11 1 4 7 2 3 12 16 2 5 8 3 17 6 9 4 14 18

S[13,15] = “pie” is the longest common substring.

slide-53
SLIDE 53
  • Using a suffix tree we can solve the longest common substring problem in linear

time (for a constant size alphabet).

Longest common substring