Streaming and property testing algorithms for string processing - PowerPoint PPT Presentation

Streaming and property testing algorithms for string processing Tatiana Starikovskaya Based on joint work with: R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach 1 / 31

▸ Pattern matching has been studied for 40+ years ▸ More than 85 algorithms ▸ KMP algorithm uses O (∣ P ∣) space and O (∣ T ∣) time, and Aho-Corasick achieves similar bounds for dictionary matching ▸ We can’t do better : we must store a description of the pattern(s) and we must read the whole text 2 / 31

3 / 31

Intrusion Detection Systems ▸ Large number of patterns ▸ Search patterns represent portions of known attack patterns and have length 1 − 30 ▸ If only cache memory is used, the algorithm can benefit most from a high performance cache 4 / 31

Outline of today’s talk Streaming model ▸ Exact pattern matching ▸ Approximate pattern matching (Hamming distance) ▸ Approximate pattern matching (edit distance) ▸ Preprocessing Property testing model ▸ Exact pattern matching 5 / 31

Streaming model We want to process the stream on-the-fly & in small space 6 / 31

Part I: Exact pattern matching 7 / 31

Exact pattern matching NO text T c c a a b c a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

Exact pattern matching NO text T c a a b c a a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

Exact pattern matching NO text T c a a b c a a a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

Exact pattern matching YES text T c a a b c a a a c b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

Exact pattern matching NO text T c a a b c a a a c a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T 8 / 31

Karp-Rabin algorithm Karp-Rabin fingerprint m s i r m − i mod p ϕ ( s 1 s 2 ... s m ) = ∑ i = 1 where p is a prime and r is a random integer ∈ [ 0 , p − 1 ] It’s a good hash function S 1 , S 2 are two strings of length m , the prime p is large 1. S 1 = S 2 ⇒ ϕ ( S 1 ) = ϕ ( S 2 ) 2. S 1 ≠ S 2 , lengths of S 1 , S 2 are equal ⇒ ϕ ( S 1 ) ≠ ϕ ( S 2 ) w.h.p. 9 / 31

Karp-Rabin algorithm YES text T c a a b c a a a c a b c a a a c pattern P When a new character t i = a arrives: 1. Compute the fingerprint ϕ ( t i − m + 1 ... t i − 1 t i ) in O ( 1 ) time ϕ ( caaacc ) = (( ϕ ( bcaaac ) − br m − 1 ) ⋅ r + a mod p 2. If ϕ ( t i − m + 1 ... t i − 1 t i ) = ϕ ( P ) , output “YES” We need t i − m to update the fingerprint ⇒ we must store t i − m ,..., t i − 1 10 / 31

Karp-Rabin algorithm YES text T c a a b c a a a c a b c a a a c pattern P K.-R. algorithm is a streaming pattern matching algorithm that uses Θ ( m ) space and O ( 1 ) time per character of T It finds all occurrences of P in T correctly w.h.p. 10 / 31

Exact pattern matching Space 1 Authors Time Single pattern Θ ( m ) O ( 1 ) Karp & Rabin, 1987 O ( log m ) O ( log m ) Porat & Porat, 2009 O ( log m ) O ( 1 ) Breslauer & Galil, 2011 Dictionary of d patterns Clifford, Fontaine, Porat O ( d log m ) O ( loglog ( m + d )) Sach, S., 2015 O ( d log m ) O ( loglog ∣ Σ ∣) Golan & Porat, 2017 O (∣ Σ ∣ ε d log ( m / ε )) O ( 1 / ε ) 1 In words 11 / 31

Exact pattern matching Space 1 Authors Time Single pattern Θ ( m ) O ( 1 ) Karp & Rabin, 1987 O ( log m ) O ( log m ) Porat & Porat, 2009 ★ O ( log m ) O ( 1 ) Breslauer & Galil, 2011 Dictionary of d patterns Clifford, Fontaine, Porat O ( d log m ) O ( loglog ( m + d )) Sach, S., 2015 O ( d log m ) O ( loglog ∣ Σ ∣) Golan & Porat, 2017 O (∣ Σ ∣ ε d log ( m / ε )) O ( 1 / ε ) 1 In words 11 / 31

Porat & Porat, 2009 ★ text T ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

Porat & Porat, 2009 ★ text T t i If i is an occ. of p 1 , push it to level 0 ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

Porat & Porat, 2009 ★ text T t i If i is an occ. of p 1 , push it to level 0 ✖ ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

Porat & Porat, 2009 ★ text T t i ✖ ✖ occurrences of p 1 If lp is an occ. of ✖ ✖ p 1 p 2 , promote it occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 If lp is an occ. of ✖ ✖ ✖ p 1 p 2 , promote it occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 ✖ ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m Lemma If there are ≥ 3 occurrences of a 2 j -length string in a 2 j + 1 -length string, the occurrences form a run For each level we store: ▸ The leftmost and the second leftmost positions lp , lp ′ ▸ The fingerprints of t 1 t 2 ... t lp , t lp + 1 ... t lp ′ , and t 1 ... t i 13 / 31

Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 ✖ ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m For each level we need: ▸ O ( 1 ) space ▸ O ( 1 ) time for updating and extracting ϕ ( t lp ... t i ) Theorem Porat & Porat algorithm is a streaming pattern matching algorithm that uses O ( log m ) space and O ( log m ) time per character 13 / 31

Part II: Approximate pattern matching 14 / 31

Approximate pattern matching dist ( P , T ) text T c a a b c a a a c a b c a a a c pattern P ▸ Query = “Distance between P and T ” ▸ Distance: Hamming, edit, . . . 15 / 31

Approximate pattern matching (Hamming distance) Any streaming algorithm for computing exact Hamming distances must use Ω ( m ) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs text T 1 0 1 1 0 0 0 0 0 0 0 0 T [ 1 , m ] is random 0 0 0 0 0 0 pattern P After reading T [ m ] , the algorithm cannot go back and read one of the letters T [ 1 ] , T [ 2 ] ,..., T [ m ] , but can restore T [ 1 , m ] Therefore, it stores a full description of T [ 1 , m ] ⇒ Ω ( m ) space by information-theoretic ideas 16 / 31

Approximate pattern matching (Hamming distance) Any streaming algorithm for computing exact Hamming distances must use Ω ( m ) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs dist ( P , T ) = 3 text T 1 0 1 1 0 0 0 0 0 0 0 0 T [ 1 , m ] is random 0 0 0 0 0 0 pattern P After reading T [ m ] , the algorithm cannot go back and read one of the letters T [ 1 ] , T [ 2 ] ,..., T [ m ] , but can restore T [ 1 , m ] Therefore, it stores a full description of T [ 1 , m ] ⇒ Ω ( m ) space by information-theoretic ideas 16 / 31

Streaming and property testing algorithms for string processing - PowerPoint PPT Presentation

Streaming and property testing algorithms for string processing Tatiana Starikovskaya Based on joint work with: R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach 1 / 31 Pattern matching has been studied for 40+ years More

The String Class Trace Code Constructing a String String s = "Java"; String

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching String matching problem: string T (text) and string P (pattern) over an

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

String Objectives Discuss string handling System.String class

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

HashMap Friday Four Square Today! Outside Gates at 4:15PM Not All Data is Linear

Character String 1 What we should learn about strings Representation in C String Literals

61A Lecture 16 Announcements String Representations String Representations 4 String

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Evaluation of Techniques to Detect Wrong Interaction Based Trace Links Paul Hbner and Barbara

CS 188: Artificial Intelligence Spring 2006 Lecture 13: Clustering and Similarity 2/28/2006 Dan

Recognition continued: discriminative classifiers Tues Nov 17 Kristen Grauman UT Austin

Segmental Semi-Markov Models for Endpoint Detection in Plasma Etching Xianping Ge and Padhraic

PhishHook: A tool to detect and prevent phishing attacks Michael Stepp steppm@cs.arizona.edu

OmniUpdate Training Tuesday Strategies for Training Your Editors Zoom Event # 255-546-486 Audio

Wha t s Yo ur Pro b le m; Wha t s Yo ur Po int? An E a rly-Ca re e r Wo rksho p o n

Creating Multilingual Apps for Windows Phone 8 Jan Anders Nelson Senior Program Manager Lead

Streaming and property testing algorithms for string processing - PowerPoint PPT Presentation

Streaming and property testing algorithms for string processing Tatiana Starikovskaya Based on joint work with: R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach 1 / 31 Pattern matching has been studied for 40+ years More

The String Class Trace Code Constructing a String String s = &quot;Java&quot;; String

1 2 3+4 2 type Parser = String Tree type Parser = String ( Tree, String) type Parser =

String Matching String matching problem: string T (text) and string P (pattern) over an

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Streaming algorithms Jeremy Gibbons University of Oxford APPSEM II, April 2004 Streaming

String Objectives Discuss string handling System.String class

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

String Theory Ideology Or Tool Box Plan What is string theory? Unification ideology.

HashMap Friday Four Square Today! Outside Gates at 4:15PM Not All Data is Linear

Character String 1 What we should learn about strings Representation in C String Literals

61A Lecture 16 Announcements String Representations String Representations 4 String

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Evaluation of Techniques to Detect Wrong Interaction Based Trace Links Paul Hbner and Barbara

CS 188: Artificial Intelligence Spring 2006 Lecture 13: Clustering and Similarity 2/28/2006 Dan

Recognition continued: discriminative classifiers Tues Nov 17 Kristen Grauman UT Austin

Segmental Semi-Markov Models for Endpoint Detection in Plasma Etching Xianping Ge and Padhraic

PhishHook: A tool to detect and prevent phishing attacks Michael Stepp steppm@cs.arizona.edu

OmniUpdate Training Tuesday Strategies for Training Your Editors Zoom Event # 255-546-486 Audio

Wha t s Yo ur Pro b le m; Wha t s Yo ur Po int? An E a rly-Ca re e r Wo rksho p o n

Creating Multilingual Apps for Windows Phone 8 Jan Anders Nelson Senior Program Manager Lead

The String Class Trace Code Constructing a String String s = "Java"; String