String Indexing for Patterns with Wildcards
Philip Bille1, Inge Li Gørtz1, Hjalte Wedel Vildhøj1, and Søren Vind1
1Technical University of Denmark, DTU Informatics
SWAT 2012, Helsinki July 6, 2012
1 / 37
String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li - - PowerPoint PPT Presentation
String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Grtz 1 , Hjalte Wedel Vildhj 1 , and Sren Vind 1 1 Technical University of Denmark, DTU Informatics SWAT 2012, Helsinki July 6, 2012 1 / 37 String Indexing for Patterns
1Technical University of Denmark, DTU Informatics
1 / 37
1
❝
2
♦
3
♠
4
❜
5
✐
6
♥
7
❛
8
t
9
♦
10
r
11
✐
12
❛
13
❧
14
♣
15
❛
16
t
17
t
18
❡
19
r
20
♥
21
♠
22
❛
23
t
24
❝
25
❤
26
✐
27
♥
28
❣ ∗ a t ∗ ∗ ∗ n ∗ a t ∗ ∗ ∗ n
2 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
3 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
4 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
5 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
6 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
7 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
8 / 37
1 2 3 4 5 6 7 2
4
6
1
3
5
7
9 / 37
2 nas$ 4 s$ n a 6 s$ a 1 bananas$ 3 nas$ 5 s$ na 7 s$
10 / 37
2 nas$ 4 s$ n a 6 s$ a 1 bananas$ 3 nas$ 5 s$ na 7 s$
11 / 37
2 nas$ 4 s$ n a 6 s$ a 1 bananas$ 3 nas$ 5 s$ na 7 s$ 1 nas$ 3 s $ na 5 s$ a 2 nas$ 4 s$ na 6 s$ 7 $ ∗
12 / 37
2 nas$ 4 s$ 2 as$ 4 $ ∗ n a 6 s$ 2 nas$ 4 s$ a 6 $ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ ∗ na 7 s$ 1 nas$ 3 s $ na 5 s$ a 2 nas$ 4 s$ na 6 s$ 7 $ ∗
13 / 37
2 nas$ 4 s$ 2 as$ 4 $ 2 s$ ∗ ∗ n a 6 s$ 2 nas$ 4 s$ 2 a s $ 4 $ ∗ a 6 $ 2 n a s $ 4 s $ ∗ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ 3 s$ ∗ ∗ na 7 s$ 1 nas$ 3 s $ 1 a s $ 3 $ ∗ na 5 s$ 1 n a s $ 3 s $ a 5 $ ∗ a 2 nas$ 4 s$ 2 a s $ 4 $ ∗ na 6 s$ 7 $ 2 n a s $ 4 s $ a 1 n a s $ 3 s $ na 5 s $ 6 $ ∗ ∗
14 / 37
2 nas$ 4 s$ 2 as$ 4 $ 2 s$ ∗ ∗ n a 6 s$ 2 nas$ 4 s$ 2 a s $ 4 $ ∗ a 6 $ 2 n a s $ 4 s $ ∗ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ 3 s$ ∗ ∗ na 7 s$ 1 nas$ 3 s $ 1 a s $ 3 $ ∗ na 5 s$ 1 n a s $ 3 s $ a 5 $ ∗ a 2 nas$ 4 s$ 2 a s $ 4 $ ∗ na 6 s$ 7 $ 2 n a s $ 4 s $ a 1 n a s $ 3 s $ na 5 s $ 6 $ ∗ ∗
15 / 37
2 nas$ 4 s$ 2 as$ 4 $ 2 s$ ∗ ∗ n a 6 s$ 2 nas$ 4 s$ 2 a s $ 4 $ ∗ a 6 $ 2 n a s $ 4 s $ ∗ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ 3 s$ ∗ ∗ na 7 s$ 1 nas$ 3 s $ 1 a s $ 3 $ ∗ na 5 s$ 1 n a s $ 3 s $ a 5 $ ∗ a 2 nas$ 4 s$ 2 a s $ 4 $ ∗ na 6 s$ 7 $ 2 n a s $ 4 s $ a 1 n a s $ 3 s $ na 5 s $ 6 $ ∗ ∗
16 / 37
ℓ
LCP(x, i, ℓ)
Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 17 / 37
◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards:
◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution.
Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 18 / 37
◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards:
◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution.
◮ O(log log n) time and O(n log n) space.
◮ We show that you can also do O(log n) time and O(n) space.
Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 19 / 37
◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards:
◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution.
◮ O(log log n) time and O(n log n) space.
◮ We show that you can also do O(log n) time and O(n) space.
Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 20 / 37
21 / 37
◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.
Marked ancestor problems. Proc. 39th FOCS, 1998. 22 / 37
◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.
B1
Marked ancestor problems. Proc. 39th FOCS, 1998. 23 / 37
◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.
B1 B2 B3 B4 B5 B6 B7 B8 B9
Marked ancestor problems. Proc. 39th FOCS, 1998. 24 / 37
◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.
B1 B2 B3 B4 B5 B6 B7 B8 B9
n log n) leaves.
Marked ancestor problems. Proc. 39th FOCS, 1998. 25 / 37
◮ Use the ART decomposition to decompose the suffix tree into a
n log n) leaves. ◮ Store the top and bottom trees in LCP data structure. ◮ On the top tree T′: Add support for O(log log n) time LCP queries
◮ This requires space O(|T′| log |T′|) = O
log n log( n log n)
◮ On the bottom trees T(C1), . . . , T(Cq): Add support for O(log n)
◮ This requires O
i=1 |Ci|
◮ The query time becomes O(log |Ci|) = O(log log n).
Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 26 / 37
β
27 / 37
T0
β(C) 28 / 37
Tk−1
β
(suff2(lightstrings(v)))
β − 1 lightstrings(v) T0
β(C) 29 / 37
Tk−1
β
(suff2(lightstrings(v)))
β − 1 lightstrings(v) T0
β(C)
T1
β(C) 30 / 37
Tk−1
β
(suff2(lightstrings(v)))
β − 1 lightstrings(v) T0
β(C)
T1
β(C)
Tk
β(C) 31 / 37
β n
k
β n = O(n logk β n) .
β
32 / 37
33 / 37
◮ linear space usage, and ◮ query time O(m + σj log log n + occ).
◮ If m + j > σk log log n > σj log log n, (i.e., the query pattern is long)
◮ If m + j ≤ σk log log n, we query a special wildcard index B for
34 / 37
Tk
1(prefG(C))
1(prefG(C)) contains at most n strings. Consider a string x in one of the
1(prefG(C)) is bounded by k
35 / 37
◮ Three new solutions for string indexing for patterns with
◮ The fastest linear space index. ◮ A trade-off for k-bounded wildcard indexes. ◮ The first non-trivial linear time index.
◮ All solutions generalize to string indexing for patterns with
36 / 37
◮ Three new solutions for string indexing for patterns with
◮ The fastest linear space index. ◮ A trade-off for k-bounded wildcard indexes. ◮ The first non-trivial linear time index.
◮ All solutions generalize to string indexing for patterns with
37 / 37