Streaming and property testing algorithms for string processing - - PowerPoint PPT Presentation

streaming and property testing algorithms for string
SMART_READER_LITE
LIVE PREVIEW

Streaming and property testing algorithms for string processing - - PowerPoint PPT Presentation

Streaming and property testing algorithms for string processing Tatiana Starikovskaya Based on joint work with: R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach 1 / 31 Pattern matching has been studied for 40+ years More


slide-1
SLIDE 1

Streaming and property testing algorithms for string processing

Tatiana Starikovskaya

Based on joint work with:

  • R. Clifford, P. Gawrychowski, A. Fontaine, E. Porat, B. Sach

1 / 31

slide-2
SLIDE 2

▸ Pattern matching has been studied for 40+ years ▸ More than 85 algorithms ▸ KMP algorithm uses O(∣P∣) space and O(∣T∣) time, and

Aho-Corasick achieves similar bounds for dictionary matching

▸ We can’t do better: we must store a description of the

pattern(s) and we must read the whole text

2 / 31

slide-3
SLIDE 3

3 / 31

slide-4
SLIDE 4

Intrusion Detection Systems

▸ Large number of patterns ▸ Search patterns represent

portions of known attack patterns and have length 1−30

▸ If only cache memory is used,

the algorithm can benefit most from a high performance cache

4 / 31

slide-5
SLIDE 5

Outline of today’s talk

Streaming model

▸ Exact pattern matching ▸ Approximate pattern matching (Hamming distance) ▸ Approximate pattern matching (edit distance) ▸ Preprocessing

Property testing model

▸ Exact pattern matching

5 / 31

slide-6
SLIDE 6

Streaming model

We want to process the stream on-the-fly & in small space

6 / 31

slide-7
SLIDE 7

Part I: Exact pattern matching

7 / 31

slide-8
SLIDE 8

Exact pattern matching

c

b c a a a c

pattern P c NO text T a a b c a

▸ Query = “Is there an occurrence of P?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

slide-9
SLIDE 9

Exact pattern matching

c

b c a a a c

pattern P NO text T a a b c a a

▸ Query = “Is there an occurrence of P?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

slide-10
SLIDE 10

Exact pattern matching

c

b c a a a c

pattern P NO text T a a b c a a a

▸ Query = “Is there an occurrence of P?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

slide-11
SLIDE 11

Exact pattern matching

c

b c a a a c

pattern P YES text T a a b c a a a c

▸ Query = “Is there an occurrence of P?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

slide-12
SLIDE 12

Exact pattern matching

c

b c a a a c

pattern P NO text T a a b c a a a c a

▸ Query = “Is there an occurrence of P?” ▸ Space = total space used by the stream processor ▸ Time = time per position of T

8 / 31

slide-13
SLIDE 13

Karp-Rabin algorithm

Karp-Rabin fingerprint ϕ(s1s2 ...sm) =

m

i=1

sirm−i mod p where p is a prime and r is a random integer ∈ [0,p − 1] It’s a good hash function S1,S2 are two strings of length m, the prime p is large

  • 1. S1 = S2 ⇒ ϕ(S1) = ϕ(S2)
  • 2. S1 ≠ S2, lengths of S1,S2 are equal ⇒ ϕ(S1) ≠ ϕ(S2) w.h.p.

9 / 31

slide-14
SLIDE 14

Karp-Rabin algorithm

b c a a a c pattern P YES text T c a a b c a a a c a When a new character ti = a arrives:

  • 1. Compute the fingerprint ϕ(ti−m+1 ...ti−1ti) in O(1) time

ϕ(caaacc) = ((ϕ(bcaaac) − brm−1) ⋅ r + a mod p

  • 2. If ϕ(ti−m+1 ...ti−1ti) = ϕ(P), output “YES”

We need ti−m to update the fingerprint ⇒ we must store ti−m,...,ti−1

10 / 31

slide-15
SLIDE 15

Karp-Rabin algorithm

b c a a a c pattern P YES text T c a a b c a a a c a K.-R. algorithm is a streaming pattern matching algorithm that uses Θ(m) space and O(1) time per character of T It finds all occurrences of P in T correctly w.h.p.

10 / 31

slide-16
SLIDE 16

Exact pattern matching

Authors Space 1 Time Single pattern Karp & Rabin, 1987 Θ(m) O(1) Porat & Porat, 2009 O(logm) O(logm) Breslauer & Galil, 2011 O(logm) O(1) Dictionary of d patterns Clifford, Fontaine, Porat O(dlogm) O(loglog(m + d)) Sach, S., 2015 Golan & Porat, 2017 O(dlogm) O(loglog∣Σ∣) O(∣Σ∣εdlog(m/ε)) O(1/ε)

1In words 11 / 31

slide-17
SLIDE 17

Exact pattern matching

Authors Space 1 Time Single pattern Karp & Rabin, 1987 Θ(m) O(1) Porat & Porat, 2009 ★ O(logm) O(logm) Breslauer & Galil, 2011 O(logm) O(1) Dictionary of d patterns Clifford, Fontaine, Porat O(dlogm) O(loglog(m + d)) Sach, S., 2015 Golan & Porat, 2017 O(dlogm) O(loglog∣Σ∣) O(∣Σ∣εdlog(m/ε)) O(1/ε)

1In words 11 / 31

slide-18
SLIDE 18

Porat & Porat, 2009 ★

text T

  • ccurrences of p1

  • ccurrences of p1p2

✖ ✖

  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

for each character ti do if ti = p1 then push i to level 0 for each j = 0,...,logm − 1 lp ← leftmost position in level j if i − lp + 1 = 2j+1 then Pop lp from level j if ϕ(tlp ...ti) = ϕ(p1 ...p2j+1) then push lp to level j + 1

12 / 31

slide-19
SLIDE 19

Porat & Porat, 2009 ★

text T

  • ccurrences of p1

  • ccurrences of p1p2

✖ ✖

  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮ for each character ti do if ti = p1 then push i to level 0 for each j = 0,...,logm − 1 lp ← leftmost position in level j if i − lp + 1 = 2j+1 then Pop lp from level j if ϕ(tlp ...ti) = ϕ(p1 ...p2j+1) then push lp to level j + 1

12 / 31

slide-20
SLIDE 20

Porat & Porat, 2009 ★

text T

  • ccurrences of p1

  • ccurrences of p1p2

✖ ✖

  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮ If i is an occ. of p1, push it to level 0 for each character ti do if ti = p1 then push i to level 0 for each j = 0,...,logm − 1 lp ← leftmost position in level j if i − lp + 1 = 2j+1 then Pop lp from level j if ϕ(tlp ...ti) = ϕ(p1 ...p2j+1) then push lp to level j + 1

12 / 31

slide-21
SLIDE 21

Porat & Porat, 2009 ★

text T

  • ccurrences of p1

  • ccurrences of p1p2

✖ ✖

  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮

If i is an occ. of p1, push it to level 0 for each character ti do if ti = p1 then push i to level 0 for each j = 0,...,logm − 1 lp ← leftmost position in level j if i − lp + 1 = 2j+1 then Pop lp from level j if ϕ(tlp ...ti) = ϕ(p1 ...p2j+1) then push lp to level j + 1

12 / 31

slide-22
SLIDE 22

Porat & Porat, 2009 ★

text T

  • ccurrences of p1

  • ccurrences of p1p2

✖ ✖

  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮

If lp is an occ. of p1p2, promote it for each character ti do if ti = p1 then push i to level 0 for each j = 0,...,logm − 1 lp ← leftmost position in level j if i − lp + 1 = 2j+1 then Pop lp from level j if ϕ(tlp ...ti) = ϕ(p1 ...p2j+1) then push lp to level j + 1

12 / 31

slide-23
SLIDE 23

Porat & Porat, 2009 ★

text T

  • ccurrences of p1
  • ccurrences of p1p2

✖ ✖

  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮

If lp is an occ. of p1p2, promote it

for each character ti do if ti = p1 then push i to level 0 for each j = 0,...,logm − 1 lp ← leftmost position in level j if i − lp + 1 = 2j+1 then Pop lp from level j if ϕ(tlp ...ti) = ϕ(p1 ...p2j+1) then push lp to level j + 1

12 / 31

slide-24
SLIDE 24

Porat & Porat, 2009 ★

text T

  • ccurrences of p1
  • ccurrences of p1p2
  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮

✖ ✖ ✖ ✖

Lemma If there are ≥ 3 occurrences of a 2j-length string in a 2j+1-length string, the occurrences form a run For each level we store:

▸ The leftmost and the second leftmost positions lp,lp′ ▸ The fingerprints of t1t2 ...tlp, tlp+1 ...tlp′, and t1 ...ti

13 / 31

slide-25
SLIDE 25

Porat & Porat, 2009 ★

text T

  • ccurrences of p1
  • ccurrences of p1p2
  • ccurrences of p1p2p3p4

✖ ✖ ✖

  • ccurrences of P = p1p2 ...pm

ti ⋮

✖ ✖ ✖ ✖

For each level we need:

▸ O(1) space ▸ O(1) time for updating and extracting ϕ(tlp ...ti)

Theorem Porat & Porat algorithm is a streaming pattern matching algorithm that uses O(logm) space and O(logm) time per character

13 / 31

slide-26
SLIDE 26

Part II: Approximate pattern matching

14 / 31

slide-27
SLIDE 27

Approximate pattern matching

dist(P,T) b c a a a c pattern P text T c a a b c a a a c a

▸ Query = “Distance between P and T” ▸ Distance: Hamming, edit, . . .

15 / 31

slide-28
SLIDE 28

Approximate pattern matching (Hamming distance)

Any streaming algorithm for computing exact Hamming distances must use Ω(m) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs pattern P text T 1 0 1 1 0 0 0 0 0 0 0 0 T[1,m] is random After reading T[m], the algorithm cannot go back and read one

  • f the letters T[1],T[2],...,T[m], but can restore T[1,m]

Therefore, it stores a full description of T[1,m] ⇒ Ω(m) space by information-theoretic ideas

16 / 31

slide-29
SLIDE 29

Approximate pattern matching (Hamming distance)

Any streaming algorithm for computing exact Hamming distances must use Ω(m) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs pattern P text T 1 0 1 1 0 0 0 0 0 0 0 0 dist(P,T) = 3 T[1,m] is random After reading T[m], the algorithm cannot go back and read one

  • f the letters T[1],T[2],...,T[m], but can restore T[1,m]

Therefore, it stores a full description of T[1,m] ⇒ Ω(m) space by information-theoretic ideas

16 / 31

slide-30
SLIDE 30

Approximate pattern matching (Hamming distance)

Any streaming algorithm for computing exact Hamming distances must use Ω(m) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs pattern P text T 1 0 1 1 0 0 0 0 0 0 0 0 dist(P,T) = 2, T[1] = 3 − 2 T[1,m] is random After reading T[m], the algorithm cannot go back and read one

  • f the letters T[1],T[2],...,T[m], but can restore T[1,m]

Therefore, it stores a full description of T[1,m] ⇒ Ω(m) space by information-theoretic ideas

16 / 31

slide-31
SLIDE 31

Approximate pattern matching (Hamming distance)

Any streaming algorithm for computing exact Hamming distances must use Ω(m) space By Yao’s minimax principle it suffices to consider deterministic algorithms on “hard” distribution of the inputs pattern P text T 1 0 1 1 0 0 0 0 0 0 0 0 dist(P,T) = 2, T[2] = 2 − 2 T[1,m] is random After reading T[m], the algorithm cannot go back and read one

  • f the letters T[1],T[2],...,T[m], but can restore T[1,m]

Therefore, it stores a full description of T[1,m] ⇒ Ω(m) space by information-theoretic ideas

16 / 31

slide-32
SLIDE 32

Approximate pattern matching (Hamming distance)

Authors Space 2 Time Single pattern, only distances ≤ k Porat & Porat, 2009 ˜ O(k3) ˜ O(k2) Clifford, Fontaine, Porat, Sach, S., 2016 ˜ O(k2) ˜ O( √ k) Clifford, Kociumaka, Porat, 2018 O(klog m

k )

O(klog3 mlog m

k )

Single pattern, (1 + ε)-approx. Clifford, S., 2016 O(ε−5√mlog4 m) O(ε−4 log3 m)

2In words 17 / 31

slide-33
SLIDE 33

Approximate pattern matching (Hamming distance)

Authors Space 2 Time Single pattern, only distances ≤ k Porat & Porat, 2009 ★ ˜ O(k3) ˜ O(k2) Clifford, Fontaine, Porat, Sach, S., 2016 ˜ O(k2) ˜ O( √ k) Clifford, Kociumaka, Porat, 2018 O(klog m

k )

O(klog3 mlog m

k )

Single pattern, (1 + ε)-approx. Clifford, S., 2016 O(ε−5√mlog4 m) O(ε−4 log3 m)

2In words 17 / 31

slide-34
SLIDE 34

Porat & Porat, 2009 ★

dist(P,T) b c a a a c pattern P text T c a a b c a a a c a

▸ If HAM(P,T) > k, output “NO” ▸ Otherwise, output HAM(P,T)

18 / 31

slide-35
SLIDE 35

From 1 mismatch to exact pattern matching

string1 string2

▸ Is HAM (string1, string2) = 1?

19 / 31

slide-36
SLIDE 36

From 1 mismatch to exact pattern matching

a b a a c b a b a a b b a b a c c b a b a a a b string1 string2 ✖ ✖

▸ Is HAM(string1, string2) = 1? ▸ Partition the strings into substrings of q colors ▸ One mismatch ⇒ one pair of substrings does not match ▸ Hope: If there are ≥ 2 mismatches, they will end up in

substrings of different colors ⇒ at least 2 pairs of substrings do not match

19 / 31

slide-37
SLIDE 37

From 1 mismatch to exact pattern matching

a b a a c b a b a a b b a b a c c b a b a a a b string1 string2 ✖ ✖ For each prime q ∈ [logm,log2 m]: Partition string1 into q equi-spaced substrings Partition string2 into q equi-spaced substrings In total: O(logm) primes, and for each prime there are O(log2 m) pairs of substrings

19 / 31

slide-38
SLIDE 38

From 1 mismatch to exact pattern matching

a b a a c b a b a a b b a b a c c b a b a a a b string1 string2 ✖ ✖ Lemma There are ≥ 2 mismatches ✖1,✖2 ⇒ there exists a prime q such that at least two pairs of substrings do not match

▸ ✖1,✖2 in the same pair ⇔ ✖1 − ✖2 = 0 (mod q) ▸ m ≥ ✖1 − ✖2 cannot be a multiple of logm distinct primes

19 / 31

slide-39
SLIDE 39

From 1 mismatch to exact pattern matching

text T pattern P Is HAM(P, T) = 1? for each position of the text T do for each prime q in [logm,log2 m] do h ← number of (substream, subpattern) that mismatch if h = 0 OR h > 1 return “NO” return “YES”

20 / 31

slide-40
SLIDE 40

From 1 mismatch to exact pattern matching

text T pattern P Compute number of mismatching pairs for each prime q in [logm,log2 m] do for each (substream, subpattern) do run streaming exact pattern matching

20 / 31

slide-41
SLIDE 41

From 1 mismatch to exact pattern matching

text T pattern P Complexity Space = O( logm

  • # of primes

⋅ log2 m ÜÜÜÜÜÜÜÜÜÜÜ # of substr. ⋅ log2 m ÜÜÜÜÜÜÜÜÜÜÜ # of subpatterns ⋅logm) Time = O( logm

  • # of primes

⋅ log2 m ÜÜÜÜÜÜÜÜÜÜÜ # of substr. ⋅ log2 m ÜÜÜÜÜÜÜÜÜÜÜ # of subpatterns )

20 / 31

slide-42
SLIDE 42

Approximate pattern matching (Hamming distance)

Porat & Porat, 2009 ˜ O(k3) space, ˜ O(k2) time Same as for k = 1 but take more primes Clifford, Fontaine, Porat, Sach, S., 2016 ˜ O(k2) space, ˜ O( √ k) time We can take fewer primes if we choose them at random + periodicity to improve time Clifford, Kociumaka, Porat, 2018 O(klog m

k ) space, O(klog3 mlog m k ) time

New encoding for mismatch information + periodicity + exponentially growing prefixes

21 / 31

slide-43
SLIDE 43

Approximate pattern matching (edit distance)

ED(P,T) b c a a a c pattern P text T c a a b c a a a c a ED(P,S) = minimum number of insertions, deletions, and replacements that transform P into S Example: P = aaac, S = abacb, edit distance = 2

▸ If ED(P,T) > k, output “NO” ▸ Otherwise, output ED(P,T)

22 / 31

slide-44
SLIDE 44

Approximate pattern matching (edit distance)

ED(P,T) b c a a a c pattern P text T c a a b c a a a c a ED(P,S) = minimum number of insertions, deletions, and replacements that transform P into S Example: P = aaac, S = abacb, edit distance = 2

▸ Hybrid dynamic programming: O(m) space, O(k) time ▸ S., 2017: O(√m ⋅ poly(k,logm)) space,

O(√m ⋅ poly(k,logm)) time

22 / 31

slide-45
SLIDE 45

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ text position = 1,j = 1

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-46
SLIDE 46

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ text position = 1,j = 1

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-47
SLIDE 47

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ text position = 1,j = 1

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-48
SLIDE 48

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ text position = 1,j = 2

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-49
SLIDE 49

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 text position = 1,j = 2

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-50
SLIDE 50

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

1 Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 text position = 1,j = 2

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-51
SLIDE 51

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 1 text position = 2,j = 3

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-52
SLIDE 52

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 1 1 text position = 2,j = 3

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-53
SLIDE 53

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

1 Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 1 1 text position = 2,j = 3

  • 1. Copy S[i]. If hj(S[i]) = 1, move to the right;
  • 2. j = j + 1.

23 / 31

slide-54
SLIDE 54

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 1 . . . text position = 2,j = 3 If ED(S,T) = k, then k/2 ≤ HD(S′,T′) ≤ O(k2) w/ prob. 0.99

23 / 31

slide-55
SLIDE 55

Embedding from edit to Hamming distance

Chakraborty, Goldenberg, Koucky, 2016 Pick 3n random functions hj ∶ {0,1} → {0,1}

1

1

2

1

3 4

1

5

1

6 7 8

. . .

3n

1 1 1 1 0 1 0 1 1 . . .

1

Copy letters of S to S′:

1

1

2 3 n

S ∶ . . . S′ ∶ 0 0 1 . . . text position = 2,j = 3 Belazzougui, Zhang, 2016

▸ Embedding + streaming alg’m for k2-mismatch ⇒ a good

estimate for edit distance

▸ If ED(S,T) ≤ k, ˜

O(k2) embeddings + streaming alg’m for k2-mismatch ⇒ exact value!

23 / 31

slide-56
SLIDE 56

Approximate pattern matching (edit distance)

P[1,B − r] P[B − r + 1,m] B ≃ √m B ≃ √m B ≃ √m B ≃ √m B ≃ √m

Belazzougui & Zhang, 2016

Starting from each block i, run Belazzougui & Zhang, 2016 ED[j] = min

i∈[r−k,r+k]ED(P[1,B − i],T1) + ED(P[B − i + 1,m],T2)

We compute ED(P[1,B − i],T1) while reading T1 using dynamic programming, then encode the distances to restore later

24 / 31

slide-57
SLIDE 57

Part III: Preprocessing

25 / 31

slide-58
SLIDE 58

Preprocessing for pattern matching

Can we preprocess the patterns in a streaming way? If yes, do we need to read them several times? How much space do we need? Periodicity — Erg¨ un, Jowhari, Saglam, 2010

▸ Periodic patterns: O(logm) space, O(logm) time ▸ Non-periodic patterns: Ω(m) space ▸ 2 passes (periodic and non-periodic patterns): O(logm)

space, O(logm) time Periodicity with mismatches — Erg¨ un et al., 2017

▸ Periodic patterns: O(k4 log9 n) space ▸ 2-pass algorithm for non-periodic patterns, lower bounds

26 / 31

slide-59
SLIDE 59

Part IV: Property testing model

27 / 31

slide-60
SLIDE 60

Pattern matching

pattern P text T Is T free from occurrences of P? Same question when T and P are of dimension d ≥ 2

28 / 31

slide-61
SLIDE 61

Property testing model

If Sherlock wants to solve the problem fast, he can only query a few characters of T

29 / 31

slide-62
SLIDE 62

Property testing model

Task: develop an ultra-efficient randomised algorithm to decide whether T is free from occurrences of P We must

▸ accept, if T is ε1-close to being P-free ▸ reject, if T is ε2-far from being P-free ▸ accept or reject otherwise

ε1-close = we can fix ≤ ε1n characters of T so that the property is satisfied ε2-far = we must fix ≥ ε2n characters of T so that the property is satisfied

30 / 31

slide-63
SLIDE 63

Property testing model

Task: develop an ultra-efficient randomised algorithm to decide whether T is free from occurrences of P We must

▸ accept, if T is ε1-close to being P-free ▸ reject, if T is ε2-far from being P-free ▸ accept or reject otherwise

Ben-Eliezer, Korman, Reichman, 2017 There is an algorithm which queries O(ε−1) letters of T and distinguishes between ε/2-close and ε-far (for almost all patterns)

30 / 31

slide-64
SLIDE 64

Summary of today’s talk

It’s all about pattern matching Randomisation and approximation ⇒ more efficient algorithms Many open questions

Thank you!

31 / 31