faster subsequence and don t care pattern matching on
play

Faster Subsequence and Dont -Care Pattern Matching on Compressed - PowerPoint PPT Presentation

Faster Subsequence and Dont -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1 Self


  1. Faster Subsequence and Don’t -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1

  2. Self introduction  Name: Shunsuke Inenaga (稲永 俊介)  Affiliation: Kyushu University, Japan  Research interests: String matching, Text compression, Algorithms, Data structures 2

  3. Agenda  Subsequence Pattern Matching  Compressed String Processing  Straight Line Program (SLP)  Algorithms ◦ Minimum Subsequence Occurrences on SLP ◦ Fixed Length Don’t Care Matching on SLP ◦ Variable Length Don’t Care Matching on SLP  Summary 3

  4. Subsequences  String P of length m is a subsequence of string T of length N  ∃ i 0 , ..., i m – 1 s.t. 0 ≤ i 0 < … < i m – 1 ≤ N -1 and P [ j ] = T [ i j ] for all j = 0, ..., m – 1 4

  5. Example 0123456789 accbabbcab T = abc P = 5

  6. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) abc P = 6

  7. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = 7

  8. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) 8

  9. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) 9

  10. Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) ( 4 , 6 , 7 ) 10

  11. There can be too many occurrences 0123456789 ababababab P = aaa T = a a a a a a a a a # of choices of a a a 𝑂 indices is O 𝑛 a a a … 11

  12. Consider only start & end 0123456789 ababababab P = aaa T = a a a two occurrences are a a a equivalent ( 0 , 6 ) a a a  they start and end at the same positions a a a a a a there still exist O ( N 2 ) … non-equivalent occurrences 12

  13. Minimal Subsequence Occurrences  An occurrence ( i 0 , i m – 1 ) of subsequence P in T is minimal , if there is no occurrence of P in T [ i 0 : i m – 1 – 1] or T [ i 0 +1 : i m – 1 ].  In other words, ( i 0 , i m – 1 ) is minimal, if there is no other occurrence of P within T [ i 0 : i m – 1 ]. 13

  14. Minimal Subsequence Occurrences 0123456789 ababababab P = aaa T = a a a ( 0 , 4 ) a a a a a a there are only O ( N ) a a a minimal occurrences a a a ( 2 , 6 ) … 14

  15. Problem setting  We want to solve the problem of computing minimal occurrences of a query pattern when a text is given in a compressed form . 15

  16. Compressed String Processing String Processing Process BIG Compressed Representation compress Compressed String decompress String Light Processing without explicit decompression can dramatically save time and space 16

  17. Straight Line Program [1/2] An SLP S is a sequence of n assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable,   ) ( a a expr k : X i X j ( i, j < k ) . SLP S for string T is a context free grammar in the Chomsky normal form s.t. L ( S ) = { T }.

  18. Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = N N = O (2 n )

  19. Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 X 8 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 7 X 5 X 8 = X 7 X 5 T = N N = O (2 n )

  20. SLP: Abstract model of compression  Output of grammar-based compression algorithms (e.g., Re-pair, Sequitur, LZ78) of size n can be trivially converted to SLPs of size O ( n ) in O ( n ) time.  Output of LZ77 of size r can be converted to an SLP of size O ( r log N ) in O ( r log N ) time.  Therefore, algorithms working on SLPs are so useful that they can be applied to various types of compressed strings. 21

  21. Our contribution Given an SLP-compressed text and an uncompressed pattern, we propose O ( nm ) algorithms for: ◦ Subsequence pattern matching ◦ FLDC (f ixed length don’t care) pattern matching ◦ VLDC (variable length don’t care) pattern matching n = size of SLP m = length of pattern 22

  22. Subsequence matching 23

  23. Subsequence Problems on SLP [Cégielski et al . 2006] Minimal Subsequence Occurrences Input : SLP of size n representing string T , string P Output : # of minimal subsequence occurrences of P in T Several variations, e.g.: Bounded Minimal Subsequence Occurrences Input : SLP of size n representing T , string P , integer w Output : # of minimal subsequence occurrences ( i 0 , i m – 1 ) of P in T satisfying i m – 1 – i 0 ≤ w 24

  24. Comparison to previous work Decomp.& [ Troníček 2001 ]  O ( Nm ) = O (2 n m )  [Cégielski et al . 2006] O ( nm 2 log m )  [Tiskin 2009] O ( nm 1.5 )  [Tiskin 2011] O ( nm log m ) This Work Subsequence problems  O ( nm ) Extensions to pattern matching with Fixed/Variable Length Don’t Care Symbols  O ( nm ) 25

  25. 串: Stabbed occurrences For X i = X l X r , an occurrence ( u , v ) of P is said to be a stabbed occurrence in X i if : 0 ≤ u < | X l | ≤ v ≤ | X i | -1. X i X l X r u v おでん u' v' “ODEN” 串 串 (KUSHI) is a Kanji character meaning “skewer”, used to stab food. 26

  26. Every occurrence is stabbed Observation For any interval [ u , v ] with 0 ≦ u ≦ v ≦ N -1, there exists a variable X i which stabs [ u , v ]. X n X i 0 N -1 u v 27

  27. Counting minimal occurrences  M i : # of minimal occurrences of P in X i  M 串 ( l , r ): # of stabbed minimal occurrences of P in X i =X l X r M n is the solution to our Problem Computing M i • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if m ≠1 or P ≠ a 0 M i = M l + M r + M 串 ( l , r ) M i = 1 if m =1 and P = a X i P = abc M i = 4 X l X r aabcxabaxcbxcxabxc M l = 1 M r = 1 stabbed minimal M 串 ( l , r ) = 2 28 occurrences June 27-29, CPM 2011 @ Palermo

  28. there are at most m – 1 Computing M 串 ( l , r ) stabbed minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 29

  29. there are at most m – 1 Computing M 串 ( l , r ) crossing minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 30

  30. Computing M 串 ( l , r ) Lemma M 串 ( l , r ) for all X i = X l X r can be computed in a total of O ( nm ) time using L and R. C := 0, rmin := R ( l , 0) for k := 1 to m – 1 if rmin > R ( l , k ) and L ( r , m - k )< L ( r , m - k -1) then C := C + 1 X i X l X r rmin := R ( l , k ) end if rmin end for M 串 ( l , r ) := C R ( l , k ) L ( r , m - k ) L ( r , m - k -1) L ( i , j ) : Length of shortest prefix of X i s.t. P [ j :m-1] is subsequence R ( i , j ) : Length of shortest suffix of X i s.t. P [0: m - j -1] is subsequence 31

  31. Computing Q (to compute L ) Q ( i , j ): length of longest prefix of P [ j :] which is also a subsequence of X i . ( i =1, ..., n , j =0, ..., m ) Computing Q ( i , j ) • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if P [ j ] ≠ a 0 Q ( i , j ) = Q ( l , j ) + Q ( r , j' ) Q ( i , j ) = 1 if P [ j ]= a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· ··· j j' Q ( l , j ) characters Q ( r , j' ) characters 32

  32. Computing Q (to compute L ) X i X l X r x x a x b x c d x e x x P [ j :]= abcdef P [ j' :]= cdef Q ( l , j ) = 2 Q ( r , j' ) = 3 Q ( i , j ) := Q ( l , j )+ Q ( r , j' ) j' := j + Q ( l , j ) = 2 + 3 = 5 Lemma [Cégielski et al. ] For all i =1, ..., n and j =0, ..., m Q ( i , j ) can be calculated in O ( nm ) time using DP. 33

  33. Computing L L ( i , j ): length of shortest prefix of X i s.t. P [ j :] is subsequence ( i =1,..., n , j =0,..., m ) ( ∞ if P [ j :] is not subsequence of X i ) Computing L ( i , j ) If X i = a (a ∈ Σ ) • • If X i = X l X r 0 if j = m L ( l , j ) if j' = m L ( i , j )= L ( i , j )= 1 if P [ j :]= a | X l | + L ( r , j' ) if j' < m ∞ if P [ j :] ≠ a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· j j' | X l | L ( r,j' ) L ( i , j ) = | X l | + L ( r , j' ) 34

  34. [Cégielski et al. , 2007] Computing L O ( nm 2 log m ) Lemma L ( i , j ) can be computed for all i =1, ..., n , j =0, ..., m , in a total of O ( nm ) time using Q ( i , j ). P [ j :] = abcdef X i X r X l xabxcxxdexfxx L ( l , j )= ∞ j' := j +3 P [ j' :]= def L ( r , j' )=5 Q ( l , j )=3 |X l | = 6 = 11 L ( i , j )=| X l |+ L ( r , j' ) 35

  35. Result Minimal Subsequence Occurrences Problem Input : SLP of size n representing string T , string P Output : # of minimal occurrences of subsequence P in T Theorem Given an SLP of size n and a pattern of length m , minimal subsequence occurrences can be computed in O ( nm ) time and space. O ( Nm ) = O (2 n m ) Decomp.&[ Troníček 2001] O ( nm 2 log m ) [Cégielski et al. 2007] O ( nm ) O ( nm 1.5 ) [Tiskin 2009] O ( nm log m ) [Tiskin 2011] 36

  36. FLDC matching 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend