streaming and property testing algorithms for string
play

Streaming and property testing algorithms for string processing - PowerPoint PPT Presentation

Streaming and property testing algorithms for string processing Tatiana Starikovskaya Based on joint work with: R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach 1 / 31 Pattern matching has been studied for 40+ years More


  1. Streaming and property testing algorithms for string processing Tatiana Starikovskaya Based on joint work with: R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach 1 / 31

  2. ▸ Pattern matching has been studied for 40+ years ▸ More than 85 algorithms ▸ KMP algorithm uses O (∣ P ∣) space and O (∣ T ∣) time, and Aho-Corasick achieves similar bounds for dictionary matching ▸ We can’t do better : we must store a description of the pattern(s) and we must read the whole text 2 / 31

  3. 3 / 31

  4. Intrusion Detection Systems ▸ Large number of patterns ▸ Search patterns represent portions of known attack patterns and have length 1 − 30 ▸ If only cache memory is used, the algorithm can benefit most from a high performance cache 4 / 31

  5. Outline of today’s talk Streaming model ▸ Exact pattern matching ▸ Approximate pattern matching (Hamming distance) ▸ Approximate pattern matching (edit distance) ▸ Preprocessing Property testing model ▸ Exact pattern matching 5 / 31

  6. Streaming model We want to process the stream on-the-fly & in small space 6 / 31

  7. Part I: Exact pattern matching 7 / 31

  8. Exact pattern matching NO text T c c a a b c a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

  9. Exact pattern matching NO text T c a a b c a a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

  10. Exact pattern matching NO text T c a a b c a a a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

  11. Exact pattern matching YES text T c a a b c a a a c b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

  12. Exact pattern matching NO text T c a a b c a a a c a b c a a a c pattern P ▸ Query = “Is there an occurrence of P ?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T 8 / 31

  13. Karp-Rabin algorithm Karp-Rabin fingerprint m s i r m − i mod p ϕ ( s 1 s 2 ... s m ) = ∑ i = 1 where p is a prime and r is a random integer ∈ [ 0 , p − 1 ] It’s a good hash function S 1 , S 2 are two strings of length m , the prime p is large 1. S 1 = S 2 ⇒ ϕ ( S 1 ) = ϕ ( S 2 ) 2. S 1 ≠ S 2 , lengths of S 1 , S 2 are equal ⇒ ϕ ( S 1 ) ≠ ϕ ( S 2 ) w.h.p. 9 / 31

  14. Karp-Rabin algorithm YES text T c a a b c a a a c a b c a a a c pattern P When a new character t i = a arrives: 1. Compute the fingerprint ϕ ( t i − m + 1 ... t i − 1 t i ) in O ( 1 ) time ϕ ( caaacc ) = (( ϕ ( bcaaac ) − br m − 1 ) ⋅ r + a mod p 2. If ϕ ( t i − m + 1 ... t i − 1 t i ) = ϕ ( P ) , output “YES” We need t i − m to update the fingerprint ⇒ we must store t i − m ,..., t i − 1 10 / 31

  15. Karp-Rabin algorithm YES text T c a a b c a a a c a b c a a a c pattern P K.-R. algorithm is a streaming pattern matching algorithm that uses Θ ( m ) space and O ( 1 ) time per character of T It finds all occurrences of P in T correctly w.h.p. 10 / 31

  16. Exact pattern matching Space 1 Authors Time Single pattern Θ ( m ) O ( 1 ) Karp & Rabin, 1987 O ( log m ) O ( log m ) Porat & Porat, 2009 O ( log m ) O ( 1 ) Breslauer & Galil, 2011 Dictionary of d patterns Clifford, Fontaine, Porat O ( d log m ) O ( loglog ( m + d )) Sach, S., 2015 O ( d log m ) O ( loglog ∣ Σ ∣) Golan & Porat, 2017 O (∣ Σ ∣ ε d log ( m / ε )) O ( 1 / ε ) 1 In words 11 / 31

  17. Exact pattern matching Space 1 Authors Time Single pattern Θ ( m ) O ( 1 ) Karp & Rabin, 1987 O ( log m ) O ( log m ) Porat & Porat, 2009 ★ O ( log m ) O ( 1 ) Breslauer & Galil, 2011 Dictionary of d patterns Clifford, Fontaine, Porat O ( d log m ) O ( loglog ( m + d )) Sach, S., 2015 O ( d log m ) O ( loglog ∣ Σ ∣) Golan & Porat, 2017 O (∣ Σ ∣ ε d log ( m / ε )) O ( 1 / ε ) 1 In words 11 / 31

  18. Porat & Porat, 2009 ★ text T ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

  19. Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

  20. Porat & Porat, 2009 ★ text T t i If i is an occ. of p 1 , push it to level 0 ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

  21. Porat & Porat, 2009 ★ text T t i If i is an occ. of p 1 , push it to level 0 ✖ ✖ occurrences of p 1 ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

  22. Porat & Porat, 2009 ★ text T t i ✖ ✖ occurrences of p 1 If lp is an occ. of ✖ ✖ p 1 p 2 , promote it occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

  23. Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 If lp is an occ. of ✖ ✖ ✖ p 1 p 2 , promote it occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m for each character t i do if t i = p 1 then push i to level 0 for each j = 0 ,..., log m − 1 lp ← leftmost position in level j if i − lp + 1 = 2 j + 1 then Pop lp from level j if ϕ ( t lp ... t i ) = ϕ ( p 1 ... p 2 j + 1 ) then push lp to level j + 1 12 / 31

  24. Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 ✖ ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m Lemma If there are ≥ 3 occurrences of a 2 j -length string in a 2 j + 1 -length string, the occurrences form a run For each level we store: ▸ The leftmost and the second leftmost positions lp , lp ′ ▸ The fingerprints of t 1 t 2 ... t lp , t lp + 1 ... t lp ′ , and t 1 ... t i 13 / 31

  25. Porat & Porat, 2009 ★ text T t i ✖ occurrences of p 1 ✖ ✖ ✖ occurrences of p 1 p 2 ✖ ✖ ✖ occurrences of p 1 p 2 p 3 p 4 ⋮ occurrences of P = p 1 p 2 ... p m For each level we need: ▸ O ( 1 ) space ▸ O ( 1 ) time for updating and extracting ϕ ( t lp ... t i ) Theorem Porat & Porat algorithm is a streaming pattern matching algorithm that uses O ( log m ) space and O ( log m ) time per character 13 / 31

  26. Part II: Approximate pattern matching 14 / 31

  27. Approximate pattern matching dist ( P , T ) text T c a a b c a a a c a b c a a a c pattern P ▸ Query = “Distance between P and T ” ▸ Distance: Hamming, edit, . . . 15 / 31

  28. Approximate pattern matching (Hamming distance) Any streaming algorithm for computing exact Hamming distances must use Ω ( m ) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs text T 1 0 1 1 0 0 0 0 0 0 0 0 T [ 1 , m ] is random 0 0 0 0 0 0 pattern P After reading T [ m ] , the algorithm cannot go back and read one of the letters T [ 1 ] , T [ 2 ] ,..., T [ m ] , but can restore T [ 1 , m ] Therefore, it stores a full description of T [ 1 , m ] ⇒ Ω ( m ) space by information-theoretic ideas 16 / 31

  29. Approximate pattern matching (Hamming distance) Any streaming algorithm for computing exact Hamming distances must use Ω ( m ) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs dist ( P , T ) = 3 text T 1 0 1 1 0 0 0 0 0 0 0 0 T [ 1 , m ] is random 0 0 0 0 0 0 pattern P After reading T [ m ] , the algorithm cannot go back and read one of the letters T [ 1 ] , T [ 2 ] ,..., T [ m ] , but can restore T [ 1 , m ] Therefore, it stores a full description of T [ 1 , m ] ⇒ Ω ( m ) space by information-theoretic ideas 16 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend