string indexing for patterns with wildcards
play

String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li - PowerPoint PPT Presentation

String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Grtz 1 , Hjalte Wedel Vildhj 1 , and Sren Vind 1 1 Technical University of Denmark, DTU Informatics SWAT 2012, Helsinki July 6, 2012 1 / 37 String Indexing for Patterns


  1. String Indexing for Patterns with Wildcards Philip Bille 1 , Inge Li Gørtz 1 , Hjalte Wedel Vildhøj 1 , and Søren Vind 1 1 Technical University of Denmark, DTU Informatics SWAT 2012, Helsinki July 6, 2012 1 / 37

  2. String Indexing for Patterns with Wildcards Problem Definition Build an index for a string t ∈ Σ ∗ , that, given a query pattern p , quickly can report where p occurs in t . p = p 0 ∗ p 1 ∗ . . . ∗ p j Example t = combinatorialpatternmatching p = ∗ at ∗ ∗ ∗ n 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ❝ ♦ ♠ ❜ ✐ ♥ ❛ t ♦ r ✐ ❛ ❧ ♣ ❛ t t ❡ r ♥ ♠ ❛ t ❝ ❤ ✐ ♥ ❣ a t n ∗ ∗ ∗ ∗ a t n ∗ ∗ ∗ ∗ 2 / 37

  3. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 3 / 37

  4. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 4 / 37

  5. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 5 / 37

  6. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 6 / 37

  7. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 7 / 37

  8. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 8 / 37

  9. Two Simple Solutions Suffix Tree Search p = ∗ na ∗ 1 2 3 4 5 6 7 t = bananas bananas n s a a 1 7 n n s s a a s 6 3 5 n a s s 2 4 Time: O ( σ j m + occ ) Space: O ( n ) 9 / 37

  10. Two Simple Solutions Simple Linear Time Index bananas$ na s$ a 1 7 nas$ n s$ s$ a 6 3 5 nas$ s$ 2 4 10 / 37

  11. Two Simple Solutions Simple Linear Time Index bananas$ na s$ a 1 7 nas$ n s$ s$ a 6 3 5 nas$ s$ 2 4 11 / 37

  12. Two Simple Solutions Simple Linear Time Index bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ a $ a 6 3 5 6 7 nas$ nas$ s$ na s$ s$ 2 4 5 2 4 nas$ s $ 1 3 12 / 37

  13. Two Simple Solutions Simple Linear Time Index bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ ∗ ∗ a $ a 6 3 5 6 7 nas$ nas$ as$ s$ na s$ s$ ∗ a $ $ 2 4 6 3 5 5 2 4 nas$ nas$ as$ s$ s $ $ 2 4 2 4 1 3 13 / 37

  14. 14 / 37 6 $ 5 s $ na 3 $ s n a s $ a 1 ∗ 7 4 s $ n $ a s $ 2 s$ 6 4 $ na a s $ 2 ∗ s$ 4 nas$ a 2 5 $ 3 a s $ n a ∗ s $ 1 s$ 5 na 3 $ ∗ 7 a s ∗ $ 1 s $ nas$ 3 1 s$ 3 ∗ s$ $ 5 as$ Two Simple Solutions ∗ s$ 5 3 nas$ na 3 4 s $ bananas$ 1 n a s $ 2 ∗ 4 $ Simple Linear Time Index $ 6 a s a $ ∗ a 2 s$ 4 ∗ nas$ 2 s$ 6 s$ 2 n ∗ a $ 4 as$ ∗ s$ 4 2 nas$ 2

  15. Two Simple Solutions Simple Linear Time Index p = ∗ na ∗ bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ ∗ ∗ a $ ∗ a 6 3 5 6 7 nas$ nas$ as$ s$ na s$ s$ na s ∗ a ∗ ∗ ∗ ∗ a $ $ $ $ 2 4 6 3 5 5 2 4 5 6 nas$ n nas$ n n as$ a s$ a s s$ s a s a s ∗ ∗ $ ∗ a $ s $ s $ $ s $ s $ $ $ $ $ 2 4 2 4 2 4 3 1 3 5 2 4 2 4 1 3 n a a s$ a s s $ s $ s $ $ $ $ 2 2 4 1 3 1 3 15 / 37

  16. Two Simple Solutions Simple Linear Time Index p = ∗ na ∗ bananas$ na s$ a ∗ 1 7 nas$ n s$ s$ na s$ ∗ ∗ a $ ∗ a 6 3 5 6 7 nas$ nas$ as$ s$ na s$ s$ na s ∗ a ∗ ∗ ∗ ∗ a $ $ $ $ 2 4 6 3 5 5 2 4 5 6 nas$ n nas$ n n as$ a s$ a s s$ s a s a s ∗ ∗ $ ∗ a $ s $ s $ $ s $ s $ $ $ $ $ 2 4 2 4 2 4 3 1 3 5 2 4 2 4 1 3 n a a s$ a s s $ s $ s $ $ $ $ Time: O ( m + j + occ ) Space: O ( n k + 1 ) 2 2 4 1 3 1 3 16 / 37

  17. The Longest Common Prefix Data Structure 1 LCP Queries Let C i be a set of substrings of the indexed string. Consider the following query on the compressed trie T ( C i ) storing the strings in C i . LCP ( x , i , ℓ ) : The location where the search for x ∈ Σ ∗ stops when starting in location ℓ ∈ T ( C i ) . Example: x = angry and C i = suff ( bananas ) . ℓ bananas n s a a LCP ( x , i , ℓ ) n n s a s a s n s a s T ( C i ) 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 17 / 37

  18. The Longest Common Prefix Data Structure 1 An Application Search for subpatterns in the suffix tree using the LCP data structure: ◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards: ◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution. 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 18 / 37

  19. The Longest Common Prefix Data Structure 1 An Application Search for subpatterns in the suffix tree using the LCP data structure: ◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards: ◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution. How fast can you answer an LCP query? ◮ O ( log log n ) time and O ( n log n ) space. ⇒ Index with query time O ( m + σ j log log n + occ ) and space O ( n log n ) . ◮ We show that you can also do O ( log n ) time and O ( n ) space. ⇒ Index with query time O ( m + σ j log n + occ ) and space O ( n ) . 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 19 / 37

  20. The Longest Common Prefix Data Structure 1 An Application Search for subpatterns in the suffix tree using the LCP data structure: ◮ Build the LCP data structure for the suffix tree. ◮ Search with a query pattern containing wildcards: ◮ Search for complete subpatterns using LCP queries. ◮ Branch on a wildcard as in the simple suffix tree solution. How fast can you answer an LCP query? ◮ O ( log log n ) time and O ( n log n ) space. ⇒ Index with query time O ( m + σ j log log n + occ ) and space O ( n log n ) . ◮ We show that you can also do O ( log n ) time and O ( n ) space. ⇒ Index with query time O ( m + σ j log n + occ ) and space O ( n ) . 1 R. Cole, L. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares . Proc. 36th STOC, 2004. 20 / 37

  21. S OLUTION 1 An Unbounded Wildcard Index Using Linear Space O ( m + σ j log log n + occ ) Query Time: Space Usage: O ( n ) 21 / 37

  22. An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 22 / 37

  23. An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). B 1 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 23 / 37

  24. An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). B 3 B 2 B 5 B 6 B 7 B 1 B 4 B 8 B 9 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 24 / 37

  25. An Unbounded Wildcard Index Using Linear Space ART Decomposition 2 Definition: ◮ A bottom tree is a maximal subtree with at most log n leaves. ◮ Vertices not in a bottom tree constitute the top tree . Example: A tree with n = 16 leaves (log n = 4). B 3 B 2 B 5 B 6 B 7 B 1 B 4 B 8 B 9 n Property: The top tree has O ( log n ) leaves. 2 S. Alstrup, T. Husfeldt, and T. Rauhe Marked ancestor problems . Proc. 39th FOCS, 1998. 25 / 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend