Faster Subsequence and Dont -Care Pattern Matching on Compressed - PowerPoint PPT Presentation

Faster Subsequence and Don’t -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1

Self introduction  Name: Shunsuke Inenaga （稲永俊介）  Affiliation: Kyushu University, Japan  Research interests: String matching, Text compression, Algorithms, Data structures 2

Agenda  Subsequence Pattern Matching  Compressed String Processing  Straight Line Program (SLP)  Algorithms ◦ Minimum Subsequence Occurrences on SLP ◦ Fixed Length Don’t Care Matching on SLP ◦ Variable Length Don’t Care Matching on SLP  Summary 3

Subsequences  String P of length m is a subsequence of string T of length N  ∃ i 0 , ..., i m – 1 s.t. 0 ≤ i 0 < … < i m – 1 ≤ N -1 and P [ j ] = T [ i j ] for all j = 0, ..., m – 1 4

Example 0123456789 accbabbcab T = abc P = 5

Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) abc P = 6

Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = 7

Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) 8

Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) 9

Example 0123456789 accbabbcab T = ( 0 , 3 , 7 ) ( 0 , 5 , 7 ) abc P = ( 0 , 6 , 7 ) ( 4 , 5 , 7 ) ( 4 , 6 , 7 ) 10

There can be too many occurrences 0123456789 ababababab P = aaa T = a a a a a a a a a # of choices of a a a 𝑂 indices is O 𝑛 a a a … 11

Consider only start & end 0123456789 ababababab P = aaa T = a a a two occurrences are a a a equivalent ( 0 , 6 ) a a a  they start and end at the same positions a a a a a a there still exist O ( N 2 ) … non-equivalent occurrences 12

Minimal Subsequence Occurrences  An occurrence ( i 0 , i m – 1 ) of subsequence P in T is minimal , if there is no occurrence of P in T [ i 0 : i m – 1 – 1] or T [ i 0 +1 : i m – 1 ].  In other words, ( i 0 , i m – 1 ) is minimal, if there is no other occurrence of P within T [ i 0 : i m – 1 ]. 13

Minimal Subsequence Occurrences 0123456789 ababababab P = aaa T = a a a ( 0 , 4 ) a a a a a a there are only O ( N ) a a a minimal occurrences a a a ( 2 , 6 ) … 14

Problem setting  We want to solve the problem of computing minimal occurrences of a query pattern when a text is given in a compressed form . 15

Compressed String Processing String Processing Process BIG Compressed Representation compress Compressed String decompress String Light Processing without explicit decompression can dramatically save time and space 16

Straight Line Program [1/2] An SLP S is a sequence of n assignments X 1 = expr 1 ; X 2 = expr 2 ; … ; X n = expr n ; X k : variable,   ) ( a a expr k : X i X j ( i, j < k ) . SLP S for string T is a context free grammar in the Chomsky normal form s.t. L ( S ) = { T }.

Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 8 = X 7 X 5 T = N N = O (2 n )

Straight Line Program [2/2] SLP S X 1 = a X 2 = b X 3 = X 1 X 2 X 8 n X 4 = X 3 X 1 X 5 = X 3 X 4 X 6 = X 5 X 5 X 7 = X 4 X 6 X 7 X 5 X 8 = X 7 X 5 T = N N = O (2 n )

SLP: Abstract model of compression  Output of grammar-based compression algorithms (e.g., Re-pair, Sequitur, LZ78) of size n can be trivially converted to SLPs of size O ( n ) in O ( n ) time.  Output of LZ77 of size r can be converted to an SLP of size O ( r log N ) in O ( r log N ) time.  Therefore, algorithms working on SLPs are so useful that they can be applied to various types of compressed strings. 21

Our contribution Given an SLP-compressed text and an uncompressed pattern, we propose O ( nm ) algorithms for: ◦ Subsequence pattern matching ◦ FLDC (f ixed length don’t care) pattern matching ◦ VLDC (variable length don’t care) pattern matching n = size of SLP m = length of pattern 22

Subsequence matching 23

Subsequence Problems on SLP [Cégielski et al . 2006] Minimal Subsequence Occurrences Input : SLP of size n representing string T , string P Output : # of minimal subsequence occurrences of P in T Several variations, e.g.: Bounded Minimal Subsequence Occurrences Input : SLP of size n representing T , string P , integer w Output : # of minimal subsequence occurrences ( i 0 , i m – 1 ) of P in T satisfying i m – 1 – i 0 ≤ w 24

Comparison to previous work Decomp.& [ Troníček 2001 ]  O ( Nm ) = O (2 n m )  [Cégielski et al . 2006] O ( nm 2 log m )  [Tiskin 2009] O ( nm 1.5 )  [Tiskin 2011] O ( nm log m ) This Work Subsequence problems  O ( nm ) Extensions to pattern matching with Fixed/Variable Length Don’t Care Symbols  O ( nm ) 25

串： Stabbed occurrences For X i = X l X r , an occurrence ( u , v ) of P is said to be a stabbed occurrence in X i if : 0 ≤ u < | X l | ≤ v ≤ | X i | -1. X i X l X r u v おでん u' v' “ODEN” 串串 (KUSHI) is a Kanji character meaning “skewer”, used to stab food. 26

Every occurrence is stabbed Observation For any interval [ u , v ] with 0 ≦ u ≦ v ≦ N -1, there exists a variable X i which stabs [ u , v ]. X n X i 0 N -1 u v 27

Counting minimal occurrences  M i : # of minimal occurrences of P in X i  M 串 ( l , r ): # of stabbed minimal occurrences of P in X i =X l X r M n is the solution to our Problem Computing M i • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if m ≠1 or P ≠ a 0 M i = M l + M r + M 串 ( l , r ) M i = 1 if m =1 and P = a X i P = abc M i = 4 X l X r aabcxabaxcbxcxabxc M l = 1 M r = 1 stabbed minimal M 串 ( l , r ) = 2 28 occurrences June 27-29, CPM 2011 @ Palermo

there are at most m – 1 Computing M 串 ( l , r ) stabbed minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 29

there are at most m – 1 Computing M 串 ( l , r ) crossing minimal occurrences X i P = abbba X l X r a a b b a b a b b a b a L ( r , m – k ) k R ( l , k ) a b b a b a b b a b a - ∞ 0 a b b a b a b b a b a 1 5 1 a b b a b a b b a b a 2 5 4 a b b a b a b b a b a 3 2 4 a b b a b a b b a b a 4 2 6 a b b a b a b b a b a - 5 6 shortest suffix of X l shortest prefix of X r containing P [0: m - k -1] containing P [ m - k : m -1] 30

Computing M 串 ( l , r ) Lemma M 串 ( l , r ) for all X i = X l X r can be computed in a total of O ( nm ) time using L and R. C := 0, rmin := R ( l , 0) for k := 1 to m – 1 if rmin > R ( l , k ) and L ( r , m - k )< L ( r , m - k -1) then C := C + 1 X i X l X r rmin := R ( l , k ) end if rmin end for M 串 ( l , r ) := C R ( l , k ) L ( r , m - k ) L ( r , m - k -1) L ( i , j ) : Length of shortest prefix of X i s.t. P [ j :m-1] is subsequence R ( i , j ) : Length of shortest suffix of X i s.t. P [0: m - j -1] is subsequence 31

Computing Q (to compute L ) Q ( i , j ): length of longest prefix of P [ j :] which is also a subsequence of X i . ( i =1, ..., n , j =0, ..., m ) Computing Q ( i , j ) • If X i = X l X r ( l , r < i ) • If X i = a (a ∈ Σ ) if P [ j ] ≠ a 0 Q ( i , j ) = Q ( l , j ) + Q ( r , j' ) Q ( i , j ) = 1 if P [ j ]= a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· ··· j j' Q ( l , j ) characters Q ( r , j' ) characters 32

Computing Q (to compute L ) X i X l X r x x a x b x c d x e x x P [ j :]= abcdef P [ j' :]= cdef Q ( l , j ) = 2 Q ( r , j' ) = 3 Q ( i , j ) := Q ( l , j )+ Q ( r , j' ) j' := j + Q ( l , j ) = 2 + 3 = 5 Lemma [Cégielski et al. ] For all i =1, ..., n and j =0, ..., m Q ( i , j ) can be calculated in O ( nm ) time using DP. 33

Computing L L ( i , j ): length of shortest prefix of X i s.t. P [ j :] is subsequence ( i =1,..., n , j =0,..., m ) ( ∞ if P [ j :] is not subsequence of X i ) Computing L ( i , j ) If X i = a (a ∈ Σ ) • • If X i = X l X r 0 if j = m L ( l , j ) if j' = m L ( i , j )= L ( i , j )= 1 if P [ j :]= a | X l | + L ( r , j' ) if j' < m ∞ if P [ j :] ≠ a ( j' = j + Q ( l , j )) X i X l X r ··· ··· ··· j j' | X l | L ( r,j' ) L ( i , j ) = | X l | + L ( r , j' ) 34

[Cégielski et al. , 2007] Computing L O ( nm 2 log m ) Lemma L ( i , j ) can be computed for all i =1, ..., n , j =0, ..., m , in a total of O ( nm ) time using Q ( i , j ). P [ j :] = abcdef X i X r X l xabxcxxdexfxx L ( l , j )= ∞ j' := j +3 P [ j' :]= def L ( r , j' )=5 Q ( l , j )=3 |X l | = 6 = 11 L ( i , j )=| X l |+ L ( r , j' ) 35

Result Minimal Subsequence Occurrences Problem Input : SLP of size n representing string T , string P Output : # of minimal occurrences of subsequence P in T Theorem Given an SLP of size n and a pattern of length m , minimal subsequence occurrences can be computed in O ( nm ) time and space. O ( Nm ) = O (2 n m ) Decomp.&[ Troníček 2001] O ( nm 2 log m ) [Cégielski et al. 2007] O ( nm ) O ( nm 1.5 ) [Tiskin 2009] O ( nm log m ) [Tiskin 2011] 36

FLDC matching 37

Faster Subsequence and Dont -Care Pattern Matching on Compressed - PowerPoint PPT Presentation

Faster Subsequence and Dont -Care Pattern Matching on Compressed Texts Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Department of Informatics, Kyushu University, JAPAN Originally presented at CPM 2011 1 Self

Longest Common Subsequence C=c 1 c g is a subsequence of A=a 1 a m if C can be obtained

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Pattern matching and lexing Informatics 2A: Lecture 6 John Longley School of Informatics

LPEG: a new approach to pattern LPEG: a new approach to pattern matching in Lua matching in Lua

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Simpler and efficient LZW-compressed multiple pattern matching Pawe Gawrychowski July 4, 2012

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern

Quantum pattern matching fast on average Ashley Montanaro Department of Computer Science,

Globbing, pattern matching Globbing is the term used for bashs form of pattern matching in

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Concurrent Pattern Matching: combining discovery, privacy and symmetry using pattern matching

Pattern-Matching Spi-Calculus A Type System for Cryptographic Protocols Christian Haack and Alan

CS 126 Lecture T1: Pattern Matching Outline Introduction Pattern matching in Unix

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

Handwritten Recognition of Chinese Characters Analysis on CNN working principles and best

ASSESSMENT OF VULNERABILITY THROUGH PARTICIPATION Jeevan Madapala, Dr. Repaul Kanji, Sangeeta

Informa(onRetrieval CS276:Informa*onRetrievalandWebSearch

Information Retrieval Chapter 2: The term vocabulary and postings p y p g lists Slides:

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai

Transitions Seminar Inclusive Commissioning Monday 17 June 2019 Agenda About Transitions

sequence

Thomas Sherlock John Gregory China and Russia: Political Values and Perspectives on Sino-Russian