String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li - - PowerPoint PPT Presentation

string indexing for patterns with wildcards
SMART_READER_LITE
LIVE PREVIEW

String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li - - PowerPoint PPT Presentation

String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Grtz 1 , Hjalte Wedel Vildhj 1 , and Sren Vind 1 1 Technical University of Denmark, DTU Informatics SWAT 2012, Helsinki July 6, 2012 1 / 37 String Indexing for Patterns


slide-1
SLIDE 1

String Indexing for Patterns with Wildcards

Philip Bille1, Inge Li Gørtz1, Hjalte Wedel Vildhøj1, and Søren Vind1

1Technical University of Denmark, DTU Informatics

SWAT 2012, Helsinki July 6, 2012

1 / 37

slide-2
SLIDE 2

String Indexing for Patterns with Wildcards

Problem Definition

Build an index for a string t ∈ Σ∗, that, given a query pattern p, quickly can report where p occurs in t. p = p0 ∗ p1 ∗ . . . ∗ pj Example t = combinatorialpatternmatching p = ∗ at ∗ ∗ ∗ n

1

2

3

4

5

6

7

8

t

9

10

r

11

12

13

14

15

16

t

17

t

18

19

r

20

21

22

23

t

24

25

26

27

28

❣ ∗ a t ∗ ∗ ∗ n ∗ a t ∗ ∗ ∗ n

2 / 37

slide-3
SLIDE 3

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

3 / 37

slide-4
SLIDE 4

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

4 / 37

slide-5
SLIDE 5

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

5 / 37

slide-6
SLIDE 6

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

6 / 37

slide-7
SLIDE 7

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

7 / 37

slide-8
SLIDE 8

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

8 / 37

slide-9
SLIDE 9

Two Simple Solutions

Suffix Tree Search

p = ∗na∗ t = bananas

1 2 3 4 5 6 7 2

n a s

4

s n a

6

s a

1

bananas

3

n a s

5

s n a

7

s

Time: O(σjm + occ) Space: O(n)

9 / 37

slide-10
SLIDE 10

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ n a 6 s$ a 1 bananas$ 3 nas$ 5 s$ na 7 s$

10 / 37

slide-11
SLIDE 11

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ n a 6 s$ a 1 bananas$ 3 nas$ 5 s$ na 7 s$

11 / 37

slide-12
SLIDE 12

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ n a 6 s$ a 1 bananas$ 3 nas$ 5 s$ na 7 s$ 1 nas$ 3 s $ na 5 s$ a 2 nas$ 4 s$ na 6 s$ 7 $ ∗

12 / 37

slide-13
SLIDE 13

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ 2 as$ 4 $ ∗ n a 6 s$ 2 nas$ 4 s$ a 6 $ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ ∗ na 7 s$ 1 nas$ 3 s $ na 5 s$ a 2 nas$ 4 s$ na 6 s$ 7 $ ∗

13 / 37

slide-14
SLIDE 14

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ 2 as$ 4 $ 2 s$ ∗ ∗ n a 6 s$ 2 nas$ 4 s$ 2 a s $ 4 $ ∗ a 6 $ 2 n a s $ 4 s $ ∗ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ 3 s$ ∗ ∗ na 7 s$ 1 nas$ 3 s $ 1 a s $ 3 $ ∗ na 5 s$ 1 n a s $ 3 s $ a 5 $ ∗ a 2 nas$ 4 s$ 2 a s $ 4 $ ∗ na 6 s$ 7 $ 2 n a s $ 4 s $ a 1 n a s $ 3 s $ na 5 s $ 6 $ ∗ ∗

14 / 37

slide-15
SLIDE 15

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ 2 as$ 4 $ 2 s$ ∗ ∗ n a 6 s$ 2 nas$ 4 s$ 2 a s $ 4 $ ∗ a 6 $ 2 n a s $ 4 s $ ∗ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ 3 s$ ∗ ∗ na 7 s$ 1 nas$ 3 s $ 1 a s $ 3 $ ∗ na 5 s$ 1 n a s $ 3 s $ a 5 $ ∗ a 2 nas$ 4 s$ 2 a s $ 4 $ ∗ na 6 s$ 7 $ 2 n a s $ 4 s $ a 1 n a s $ 3 s $ na 5 s $ 6 $ ∗ ∗

p = ∗na∗

15 / 37

slide-16
SLIDE 16

Two Simple Solutions

Simple Linear Time Index

2 nas$ 4 s$ 2 as$ 4 $ 2 s$ ∗ ∗ n a 6 s$ 2 nas$ 4 s$ 2 a s $ 4 $ ∗ a 6 $ 2 n a s $ 4 s $ ∗ ∗ a 1 bananas$ 3 nas$ 5 s$ 3 as$ 5 $ 3 s$ ∗ ∗ na 7 s$ 1 nas$ 3 s $ 1 a s $ 3 $ ∗ na 5 s$ 1 n a s $ 3 s $ a 5 $ ∗ a 2 nas$ 4 s$ 2 a s $ 4 $ ∗ na 6 s$ 7 $ 2 n a s $ 4 s $ a 1 n a s $ 3 s $ na 5 s $ 6 $ ∗ ∗

p = ∗na∗ Time: O(m + j + occ) Space: O(nk+1)

16 / 37

slide-17
SLIDE 17

The Longest Common Prefix Data Structure 1

LCP Queries

Let Ci be a set of substrings of the indexed string. Consider the following query on the compressed trie T(Ci) storing the strings in Ci.

LCP(x, i, ℓ): The location where the search for x ∈ Σ∗ stops when

starting in location ℓ ∈ T(Ci). Example: x = angry and Ci = suff(bananas).

n a s s n a s a bananas n a s s n a s

T(Ci)

LCP(x, i, ℓ)

  • 1R. Cole, L. Gottlieb, and M. Lewenstein.

Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 17 / 37

slide-18
SLIDE 18

The Longest Common Prefix Data Structure 1

An Application

Search for subpatterns in the suffix tree using the LCP data structure:

◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards:

◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution.

  • 1R. Cole, L. Gottlieb, and M. Lewenstein.

Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 18 / 37

slide-19
SLIDE 19

The Longest Common Prefix Data Structure 1

An Application

Search for subpatterns in the suffix tree using the LCP data structure:

◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards:

◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution.

How fast can you answer an LCP query?

◮ O(log log n) time and O(n log n) space.

⇒ Index with query time O(m + σj log log n + occ) and space O(n log n).

◮ We show that you can also do O(log n) time and O(n) space.

⇒ Index with query time O(m + σj log n + occ) and space O(n) .

  • 1R. Cole, L. Gottlieb, and M. Lewenstein.

Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 19 / 37

slide-20
SLIDE 20

The Longest Common Prefix Data Structure 1

An Application

Search for subpatterns in the suffix tree using the LCP data structure:

◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards:

◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution.

How fast can you answer an LCP query?

◮ O(log log n) time and O(n log n) space.

⇒ Index with query time O(m + σj log log n + occ) and space O(n log n).

◮ We show that you can also do O(log n) time and O(n) space.

⇒ Index with query time O(m + σj log n + occ) and space O(n) .

  • 1R. Cole, L. Gottlieb, and M. Lewenstein.

Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 20 / 37

slide-21
SLIDE 21

SOLUTION 1

An Unbounded Wildcard Index Using Linear Space

Query Time: O(m + σj log log n + occ) Space Usage: O(n)

21 / 37

slide-22
SLIDE 22

An Unbounded Wildcard Index Using Linear Space

ART Decomposition 2

Definition:

◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.

Example: A tree with n = 16 leaves (log n = 4).

  • 2S. Alstrup, T. Husfeldt, and T. Rauhe

Marked ancestor problems. Proc. 39th FOCS, 1998. 22 / 37

slide-23
SLIDE 23

An Unbounded Wildcard Index Using Linear Space

ART Decomposition 2

Definition:

◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.

Example: A tree with n = 16 leaves (log n = 4).

B1

  • 2S. Alstrup, T. Husfeldt, and T. Rauhe

Marked ancestor problems. Proc. 39th FOCS, 1998. 23 / 37

slide-24
SLIDE 24

An Unbounded Wildcard Index Using Linear Space

ART Decomposition 2

Definition:

◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.

Example: A tree with n = 16 leaves (log n = 4).

B1 B2 B3 B4 B5 B6 B7 B8 B9

  • 2S. Alstrup, T. Husfeldt, and T. Rauhe

Marked ancestor problems. Proc. 39th FOCS, 1998. 24 / 37

slide-25
SLIDE 25

An Unbounded Wildcard Index Using Linear Space

ART Decomposition 2

Definition:

◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree.

Example: A tree with n = 16 leaves (log n = 4).

B1 B2 B3 B4 B5 B6 B7 B8 B9

Property: The top tree has O(

n log n) leaves.

  • 2S. Alstrup, T. Husfeldt, and T. Rauhe

Marked ancestor problems. Proc. 39th FOCS, 1998. 25 / 37

slide-26
SLIDE 26

An Unbounded Wildcard Index Using Linear Space

Obtaining the Index

◮ Use the ART decomposition to decompose the suffix tree into a

number of logarithmic sized bottom trees and a single top tree containing O(

n log n) leaves. ◮ Store the top and bottom trees in LCP data structure. ◮ On the top tree T′: Add support for O(log log n) time LCP queries

using the method by Cole et al. 3

◮ This requires space O(|T′| log |T′|) = O

  • n

log n log( n log n)

  • = O(n).

◮ On the bottom trees T(C1), . . . , T(Cq): Add support for O(log n)

time LCP queries using our new method.

◮ This requires O

q

i=1 |Ci|

  • = O(n) space.

◮ The query time becomes O(log |Ci|) = O(log log n).

This gives an unbounded wildcard index using O(n) space with query time O(m + σj log log n + occ).

  • 3R. Cole, L. Gottlieb, and M. Lewenstein.

Dictionary matching and indexing with errors and don’t cares. Proc. 36th STOC, 2004. 26 / 37

slide-27
SLIDE 27

SOLUTION 2

A Time-Space Trade-Off for k-Bounded Wildcard Indexes

Query Time: O(m + βj log log n + occ) Space Usage: O(n logk−1

β

(n) log n)

27 / 37

slide-28
SLIDE 28

A Time-Space Trade-Off for Bounded Wildcard Indexes

General Idea

Reduce the branching factor of the suffix tree search from σ to β by creating wildcard trees. Query time: O(m + βj log log n + occ) when using the LCP data structure. v

T0

β(C) 28 / 37

slide-29
SLIDE 29

A Time-Space Trade-Off for Bounded Wildcard Indexes

General Idea

Reduce the branching factor of the suffix tree search from σ to β by creating wildcard trees. Query time: O(m + βj log log n + occ) when using the LCP data structure.

Tk−1

β

(suff2(lightstrings(v)))

v

β − 1 lightstrings(v) T0

β(C) 29 / 37

slide-30
SLIDE 30

A Time-Space Trade-Off for Bounded Wildcard Indexes

General Idea

Reduce the branching factor of the suffix tree search from σ to β by creating wildcard trees. Query time: O(m + βj log log n + occ) when using the LCP data structure.

Tk−1

β

(suff2(lightstrings(v)))

∗ ∗ v

β − 1 lightstrings(v) T0

β(C)

T1

β(C) 30 / 37

slide-31
SLIDE 31

A Time-Space Trade-Off for Bounded Wildcard Indexes

General Idea

Reduce the branching factor of the suffix tree search from σ to β by creating wildcard trees. Query time: O(m + βj log log n + occ) when using the LCP data structure.

Tk−1

β

(suff2(lightstrings(v)))

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ v

β − 1 lightstrings(v) T0

β(C)

T1

β(C)

Tk

β(C) 31 / 37

slide-32
SLIDE 32

A Time-Space Trade-Off for Bounded Wildcard Indexes

Analysing the Space Usage

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

x n strings ≤ n logβ n ≤ n logk

β n

Each string in T(C) gives rise to at most lightdepth(x) ≤ logβ n strings

  • n the next level. So the number of strings in a k-level index is at most

k

  • i=0

n logi

β n = O(n logk β n) .

By using the LCP data structure to support LCP queries on every subtrie, we obtain a k-bounded wildcard index with query time O(m + βj log log n + occ) using space O(n logk−1

β

(n) log n).

32 / 37

slide-33
SLIDE 33

SOLUTION 3

A k-Bounded Wildcard Index with Linear Query Time

Query Time: O(m + j + occ) Space Usage: O(nσk2 logk log n)

33 / 37

slide-34
SLIDE 34

A k-Bounded Wildcard Index with Linear Query Time

General Idea

Consider the previously described unbounded wildcard index A with

◮ linear space usage, and ◮ query time O(m + σj log log n + occ).

Suppose the pattern is restricted to contain a maximum of k wildcards.

◮ If m + j > σk log log n > σj log log n, (i.e., the query pattern is long)

the query time becomes linear: O(m + j + occ).

◮ If m + j ≤ σk log log n, we query a special wildcard index B for

short patterns with query time O(m + j + occ). In any case the query time is O(m + j + occ). The space used by the index is O(|A| + |B|).

34 / 37

slide-35
SLIDE 35

A k-Bounded Wildcard Index with Linear Query Time

A Special Index for Patterns Shorter than σk log log n

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ k G = σk log log n

Tk

1(prefG(C))

T(C) T0

1(prefG(C)) contains at most n strings. Consider a string x in one of the

  • subtries. At most |x| ≤ G suffixes of x appear in tries on the next level.

Consequently, the number of strings in Tk

1(prefG(C)) is bounded by k

  • i=0

nGi = O(n(σk log log n)k) = O(nσk2 logk log n) . Result: A k-bounded wildcard index with linear query time O(m + j + occ) using space O(nσk2 logk log n).

35 / 37

slide-36
SLIDE 36

Conclusions

◮ Three new solutions for string indexing for patterns with

wildcards:

◮ The fastest linear space index. ◮ A trade-off for k-bounded wildcard indexes. ◮ The first non-trivial linear time index.

◮ All solutions generalize to string indexing for patterns with

variable length gaps.

36 / 37

slide-37
SLIDE 37

Conclusions

◮ Three new solutions for string indexing for patterns with

wildcards:

◮ The fastest linear space index. ◮ A trade-off for k-bounded wildcard indexes. ◮ The first non-trivial linear time index.

◮ All solutions generalize to string indexing for patterns with

variable length gaps. Thank you!

37 / 37