Tries and Suffix Trees Inge Li Grtz String indexing problem String - - PowerPoint PPT Presentation
Tries and Suffix Trees Inge Li Grtz String indexing problem String - - PowerPoint PPT Presentation
Tries and Suffix Trees Inge Li Grtz String indexing problem String matching problem. Given strings T (text) and P (pattern) over an alphabet , report starting positions of all occurrences of P in T. Finite automaton: O(m + n) time
- String matching problem. Given strings T (text) and P (pattern) over an alphabet Σ,
report starting positions of all occurrences of P in T.
- Finite automaton: O(mΣ + n) time and space
- KMP: O(m+n) time and space
- String indexing problem. Given a string S of characters from an alphabet Σ.
Preprocess S into a data structure to support
- Search(P): Return starting position of all occurrences of P in S.
- Today: Data structure using O(n) space and supporting Search(P) in O(m) time.
- Applications:
- Search engines, e.g. prefix searches.
- Finding common substrings of many biological strings
- Finding repeating substructures in biological strings
- Detecting DNA contamination
String indexing problem
- Tries
- Compressed tries
- Suffix trees
- Applications of suffix trees
Outline
Tries
- Text retrieval
- Trie over the strings: sells, by, the, sea, shells, tea.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
- Text retrieval
- Prefix-free?
- Trie over the strings: sells, by, the, sea, shells, tea, she.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
- Text retrieval
- Prefix-free?
- Trie over the strings: sells, by, the, sea, shells, tea, she.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “sea”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
S4
Tries
b y
S2 S1
s s s
S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “short”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “short”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “short”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “short”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Text retrieval
- Search for “short”
- Trie over the strings: sells$, by$, the$, sea$, shells$, tea$, she$.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Build a trie over the strings: by$, sells$, sea$.
Tries
b y
S2 S1
s s
S4
e a l l $ $ $
- Properties of the trie. A trie T storing a collection S of s strings of total length n from
an alphabet of size d has the following properties:
- How many children can a node have?
- How many leaves does T have?
- What is the height of T?
- What is the number of nodes in T?
Trie
- Search time: O(d) in each node => O(dm).
- O(m) if d constant.
- d not constant: use dictionary
- Hashing O(1)
- Balanced BST: O(log d)
- Time and space for a trie (for small/constant d):
- O(m) for searching for a string of length m.
- O(n) space.
- Preprocessing: O(n)
Trie
- Prefix search: return all words in the trie starting with “se”
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Prefix search: return all words in the trie starting with “se”
- Prefix search: return all words in the trie starting with “se”
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Time for prefix search: O(m) + time to report all occurrences.
- Solution: compact tries.
Trie
Could be large!!
Compact tries
- Compact trie: Chains of nodes with a single child is merged into a single node.
Tries
b y
S2 S1
s s s
S4 S6 S3
e e e e a a l l l l h h t
S5
$ $ $ $ $ $
S7
$
- Compact trie: Chains of nodes with a single child is merged into a single node.
Tries
b t y a s
e
l l e s a h e h e l l s $ $ $ $ $ $ $
b y $ S2 s e a $ S4 l l s $ S1 he l l s $ S6 S3 S5 S7 $ t e a $ h e $
- Properties of the compact trie. A compact trie T storing a collection S of s strings of
total length n from an alphabet of size d has the following properties:
- Every internal node of T has at least 2 and at most d children.
- T has s leaves
- The number of nodes in T is < 2s.
- Time and space for a compact trie (constant d):
- O(m) for searching for a string of length m.
- O(m + occ) for prefix search, where occ = #occurrences
- O(s) space.
- Preprocessing: O(n)
Trie
Suffix trees
- String indexing problem. Given a string S of characters from an alphabet Σ.
Preprocess S into a data structure to support
- Search(P): Return starting position of all occurrences of P in S.
- Build a compressed trie over all suffixes of S (suffix tree). Label leaves with
index of suffix.
- Observation: An occurrence of P is a prefix of a suffix of S.
Suffix tree
- ccurrence of P
Suffix of S
- String indexing problem. Given a string S of characters from an alphabet Σ.
Preprocess S into a data structure to support
- Search(P): Return starting position of all occurrences of P in S.
- Build a compressed trie over all suffixes of S (suffix tree). Label leaves with
index of suffix.
- Observation: An occurrence of P is a prefix of a suffix of S.
- Example: P = ana.
Suffix tree
- ccurrence of P
Suffix of S b a n a n a s t r i n g s s a l a d s Suffix of S Suffix of S
- Suffix tree: over the string banana$
Suffix Tree
$
b
1
a n a a n a
$ 6
n a
$ 4
n a
$ 2
n a n a
$ 3 5 $ $ 7
- Suffix tree: over the string banana$
Suffix Tree
$
b
1
a n a a n a
$ 6
n a
$ 4
n a
$ 2
n a n a
$ 3 5 $ $ 7
- Search for P
.
- Report labels of all leaves
below final node
- Suffix tree: over the string banana$
- Find all occurrences of P=“an”
Suffix Tree
$
b
1
a n a a n a
$ 6
n a
$ 4
n a
$ 2
n a n a
$ 3 5 $ $ 7
- Search for P
.
- Report labels of all leaves
below final node
- Suffix tree: over the string banana$
Suffix Tree
$
b
1
a n a a n a
$ 6
n a
$ 4
n a
$ 2
n a n a
$ 3 5 $ $ 7
1 2 3 4 5 6 7 b a n a n a $
- Store S and store node labels by reference to S.
- Suffix tree: over the string banana$
Suffix Tree
1
[2,2]
6 4 2 3 5 7
1 2 3 4 5 6 7 b a n a n a $
- Store S and store node labels by reference to S.
[3,4] [7,7] [5,7] [7,7] [7,7] [7,7] [1,7] [3,4] [5,7]
Suffix trees and common substrings
9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4
$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1
- Suffix tree of a string S: Compact trie over all suffixes of S.
- Space and time:
- Space: O(n)
- Search time: O(m) + time to report all occurrences = O(m+occ)
- Preprocessing: Can be done in O(sort(n,|Σ|)) time, where sort(n,|Σ|) is the time it
takes to sort n characters from an alphabet Σ.
- Suffix trees can be used to solve the String indexing problem in:
- Space: O(n)
- Search time: O(m+occ)
- Preprocessing: O(sort(n,|Σ|)) time
Suffix tree
Applications of suffix trees
- Find longest common substring of strings S1 and S2.
- Construct the suffix tree over S1$1S2$2.
- Example: Find longest common substring of piespies and piepiees:
- Construct suffix tree of piespies$1piepiees$2.
Longest common substring
- Suffix tree of piespies$1piepiees$2.
Generalized suffix tree
$1
9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4
$2 $1 $1 $1 $2 $2 $2 $2 $2 p p p p p i i i e e e e e e e e s s s s s s i i i s s $1 $2 $2 $2 s s e e . . . . . . . . . . . . . . . e e $2 $2 $2 $2 $1 $2 . . . e $2 s p s i e $1 $2 . . . p s i $1 $2 e . . . p s i $1 $2 e . . .
- Suffix tree of piespies$1piepiees$2.
- Mark leaf with if $1 suffix starts in S1.
Generalized suffix tree
9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4
$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1
- Suffix tree of piespies$1piepiees$2.
- Mark leaf with if $1 suffix starts in S1.
Generalized suffix tree
9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4
$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1
- Suffix tree of piespies$1piepiees$2.
- Mark leaf with if $1 suffix starts in S1.
- Add string-depth.
15 11 1
Generalized suffix tree
9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4
$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1
[18,18] [9,18] [15,15] [16,18] [13,18] [17,17] [18,18] [9,18] [5,18] [14,15] [16,18] [13,18] [8,8] [9,18] [5,18] [13,15] [16,18] [13,18] [8,8] [9,18] [5,18] [18,18] [17,17] [9,18] [5,18] 1 10 1 4 7 2 3 12 16 2 5 8 3 13 17 3 6 9 4 14 18
13 3
- Suffix tree of piespies$1piepiees$2.
- Mark leaf with if $1 suffix starts in S1.
- Add string-depth.
15 11 1
Generalized suffix tree
9 18 15 12 7 16 3 6 14 11 2 10 13 5 1 17 8 4
$2 $1 $2 $2 $2 $1 $1 $2 $2 $1 $1 $2 $2 $1 $1 $2 $1 $1
[18,18] [9,18] [15,15] [16,18] [13,18] [17,17] [18,18] [9,18] [5,18] [14,15] [16,18] [13,18] [8,8] [9,18] [5,18] [13,15] [16,18] [13,18] [8,8] [9,18] [5,18] [18,18] [17,17] [9,18] [5,18] 1 11 1 4 7 2 3 12 16 2 5 8 3 17 6 9 4 14 18
S[13,15] = “pie” is the longest common substring.
- Using a suffix tree we can solve the longest common substring problem in linear
time (for a constant size alphabet).