order preserving incomplete suffix trees and order
play

Order-Preserving Incomplete Suffix Trees and Order-Preserving - PowerPoint PPT Presentation

Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes Maxime Crochemore 3 , 5 Costas S. Iliopoulos 3 , 4 Tomasz Kociumaka 1 Marcin Kubica 1 Alessio Langiu 3 Solon P. Pissis 3 , 4 Jakub Radoszewski 1 Wojciech Rytter 1 , 2 Tomasz


  1. Order-Preserving Incomplete Suffix Trees and Order-Preserving Indexes Maxime Crochemore 3 , 5 Costas S. Iliopoulos 3 , 4 Tomasz Kociumaka 1 Marcin Kubica 1 Alessio Langiu 3 Solon P. Pissis 3 , 4 Jakub Radoszewski 1 Wojciech Rytter 1 , 2 Tomasz Waleń 1 1 University of Warsaw, Warsaw, Poland 2 Copernicus University, Toruń, Poland 3 King’s College London, London, UK 4 University of Western Australia, Perth, Australia 5 Université Paris-Est, France SPIRE 2013, 2013–10–09 1/19

  2. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . 2/19

  3. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . Example ≈ 1 3 2 4 2 6 3 8

  4. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . Example ≈ �≈ 1 3 2 4 2 6 3 8 3 7 4 5 2/19

  5. Order preserving model Relation ≈ Two words x and y are called order-isomorphic , written as x ≈ y , iff: | x | = | y | and for all i , j we have x i ≤ x j ⇔ y i ≤ y j . Example ≈ �≈ 1 3 2 4 2 6 3 8 3 7 4 5 i j i j x i < x j y i > y j but 2/19

  6. Applications Motivation: ◮ melody matching of two musical scores, ◮ recognition of trends in the stock market, ◮ = is boring, ≈ has nice combinatorial definition. Related problems: ◮ suffix trees for quasi-suffix families, ◮ pattern avoidance (as subsequences not as subword!), ◮ parametrized matching, ◮ partial words. 3/19

  7. Previous results Pattern matching in order-preserving model For a pattern of length m and text of length n detect order-preserving occurrences. Known results ◮ single pattern matching: O ( n + m ) , Kubica et al. IPL 2013, ◮ multiple pattern matching: O ( n + M ) , Kim et al. arXiv 2013, ◮ pattern matching with k -mismatches: O ( n ( log log m + k log log k )) , Gawrychowski, Uznański, arXiv 2013. 4/19

  8. Our results Problem Preprocess text w of length n , in such a way that you can answer the occurrence queries efficiently. Our results: ◮ O ( n log log n ) — preprocessing time, ◮ O ( m + Occ ) — query time (for pattern of length m ) 5/19

  9. Algorithm outline ◮ encoding function Code that reduces testing of ≈ relation into regular equality, ◮ relaxation of suffix tree definition to make the implementation easier, ◮ modification of Ukkonen’s algorithm, ◮ algorithmic toolbox for speeding-up the factors encoding and suffix tree navigation. 6/19

  10. Encoding function (1/2) For any i ∈ { 1 , . . . , n } define: α w ( i ) = distance to predecessor of w[i] among values from w[1..(i-1)] β w ( i ) = distance to successor of w[i] among values from w[1..(i-1)] α w ( 6 ) = 4 6 3 2 5 1 4 β w ( 6 ) = 2 7/19

  11. Encoding function (2/2) Code ( w ) = ( α w ( 1 ) , β w ( 1 )) , . . . , ( α w ( | w | ) , β w ( | w | )) . Example w = 1 4 2 3 Code ( w ) = ( − , − ) ( 1 , − ) ( 2 , 1 ) ( 1 , 2 ) Observation x ≈ y ⇔ Code ( x ) = Code ( y ) . 8/19

  12. How to compute Encoding function? Lemma (Off-line Code computation) For a string w of length n , Code ( w ) can be computed in O ( n ) time. Lemma (Arbitrary factor Code computation) For a string w of length n , after O ( n ) preprocessing any element of Code ( v ) for any factor v of w can be computed in O ( log n ) time. Restricted case If we restrict computation of Code to sliding window over w we can reduce computation time to O ( log log n ) per code element. 9/19

  13. Order-preserving suffix trees Order-preserving suffix tree of w (of length n ) is a compacted TRIE of all the sequences in: { Code ( w [ 1 .. n ])# , Code ( w [ 2 .. n ])# , . . . , Code ( w [ n .. n ])# } Example ( 1 , 1 ) # w = ( 1 , 2 , 4 , 4 , 2 , 5 , 5 , 1 ) (1,1) 8 (2,1) # ( 1 , 2 ) 7 ( 2 , 3 ) (1,1) ( 1 , 3 ) ( 2 , 4 ) # ( 3 , 3 ) (4,3) 6 2 5 4 1 3 Additionally each explicit node stores a suffix link. 10/19

  14. Suffix links ◮ in standard suffix trees the suffix links always point to explicit nodes, ◮ in order-preserving suffix trees it may happen that suffix link points to an implicit node. 11/19

  15. Incomplete suffix trees Relaxed definition The incomplete order-preserving suffix tree of w is an order-preserving suffix tree in which each explicit node v can have one outgoing edge that does not store its first character. parent ( v ) ( 2 , 5 ) v ? - this edge misses label ( 3 , 2 ) ( 5 , 10 ) 12/19

  16. Why incomplete edges are not harmful? Lemma Let x and y be two strings of length t and x ′ = x [ 1 . . t − 1 ] , y ′ = y [ 1 . . t − 1 ] . Then: x ≈ y ⇔ x ′ ≈ y ′ ∧ ( y i ≤ y t ≤ y j ) , where i = t − α x ( t ) , j = t − β x ( t ) α x ( t ) y x x j y j x i x t y i y t β x ( t ) So we need Code only for x . 13/19

  17. Algorithm for constructing incomplete suffix tree We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations: Branch ( v , ( p , q )) Create new branch starting in v with code ( p , q ) . If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete). 14/19

  18. Algorithm for constructing incomplete suffix tree We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations: Branch ( v , ( p , q )) Create new branch starting in v with code ( p , q ) . If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete). Example v 14/19

  19. Algorithm for constructing incomplete suffix tree We basically re-implement Ukkonen’s algorithm. The suffix tree is constructed using two basic operations: Branch ( v , ( p , q )) Create new branch starting in v with code ( p , q ) . If v is an implicit node, then the existing edge becomes incomplete (that’s why the suffix tree can be incomplete). Example this edge is incomplete → v ( p , q ) 14/19

  20. How to implement Transition Transition ( v , ( p , q )) Checks if v has a child v ′ such that the edge from v to v ′ represents the code ( p , q ) and returns v ′ in such case or nil if there is no such node. Implementation v ( 1 , 2 ) ( 3 , 1 )

  21. How to implement Transition Transition ( v , ( p , q )) Checks if v has a child v ′ such that the edge from v to v ′ represents the code ( p , q ) and returns v ′ in such case or nil if there is no such node. Implementation v Case 1: ( p , q ) present among child edges v ′ ( 1 , 2 ) ( 3 , 1 ) ( p , q ) = ( 3 , 1 )

  22. How to implement Transition Transition ( v , ( p , q )) Checks if v has a child v ′ such that the edge from v to v ′ represents the code ( p , q ) and returns v ′ in such case or nil if there is no such node. Implementation v Case 2: ( p , q ) not present among child edges v ′ ( 1 , 2 ) ( 3 , 1 ) ( p , q ) = ( 4 , 2 ) we have to verify (single) incomplete edge 15/19

  23. Algorithmic toolbox, continued We also require the following data structures: ◮ Weak Character Oracle – data structure based on y-fast trees (Willard 1983) for computing codes for newly created branches of the tree, ◮ Dynamic Weighted Ancestor data structure (Kopelowitz, Lewenstein 2007) used for fast navigation over constructed suffix tree. 16/19

  24. Example usage Theorem Given word w of length n , the incomplete order-preserving suffix tree can be constructed in O ( n log log n ) expected time. Theorem Given op-suffix tree T ( w ) and pattern x , we can locate all order-preserving occurrences of pattern x in word w in time O ( | x | + Occ ) . ◮ Compute Code ( x ) and traverse tree T ( w ) using successive symbols of the code. At each step we use function Transition . 17/19

  25. Complete suffix trees for op-model Theorem The order-preserving suffix tree of a string of length n can be constructed in O ( n log n / log log n ) expected time. ◮ This can be achieved by slightly different encoding function that allows a character oracle with O ( log n / log log n ) query time and o ( n log n / log log n ) preprocessing. 18/19

  26. Thank you for your attention! 19/19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend