Streaming and communication complexity of Hamming distance Tatiana - - PowerPoint PPT Presentation

streaming and communication complexity of hamming distance
SMART_READER_LITE
LIVE PREVIEW

Streaming and communication complexity of Hamming distance Tatiana - - PowerPoint PPT Presentation

Streaming and communication complexity of Hamming distance Tatiana Starikovskaya IRIF, Universit e Paris-Diderot (Joint work with Rapha el Clifford, ICALP16) Approximate pattern matching Problem Pattern P of length n , text T Find the


slide-1
SLIDE 1

Streaming and communication complexity of Hamming distance

Tatiana Starikovskaya

IRIF, Universit´ e Paris-Diderot

(Joint work with Rapha¨ el Clifford, ICALP’16)

slide-2
SLIDE 2

Approximate pattern matching

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

slide-3
SLIDE 3

Approximate pattern matching

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

“Big Data” Applications

▸ Computational biology ▸ Signal processing ▸ Text retrieval

Standard algorithms: Ω(n) space

slide-4
SLIDE 4

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c Text T

slide-5
SLIDE 5

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c Text T a

slide-6
SLIDE 6

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c Text T a a

slide-7
SLIDE 7

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c Text T a a b

slide-8
SLIDE 8

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c Text T a a b c

slide-9
SLIDE 9

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c b c a a a c Pattern P c Text T a a b c a

slide-10
SLIDE 10

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c b c a a a c Pattern P Text T a a b c a a

slide-11
SLIDE 11

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c b c a a a c Pattern P Text T a a b c a a a

slide-12
SLIDE 12

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c b c a a a c Pattern P Text T a a b c a a a c

slide-13
SLIDE 13

Model of computation

Problem

Pattern P of length n, text T Find the Hamming distance between P and each n-length substring of T

Model

▸ T = stream of characters ▸ Length of the text and size of the universe are extremely large ▸ Can’t store a copy of T or P ▸ Space = total space used; Time = time per character of T

c b c a a a c Pattern P Text T a a b c a a a c a

slide-14
SLIDE 14

What is known: Hamming distance

▸ All distances

▸ Space Ω(n) [Folklore] ▸ Time O(log2 n) [Clifford et al., CPM’11]

slide-15
SLIDE 15

What is known: Hamming distance

▸ All distances

▸ Space Ω(n) [Folklore] ▸ Time O(log2 n) [Clifford et al., CPM’11]

▸ Only distances ≤ k [Clifford et al., SODA’16]

▸ Exact values: space O(k2 polylog n), time O(

√ k log k + polylog n)

▸ (1 + ε)-approx.: space O(ε−2k2 polylog n), time O(ε−2 polylog n)

slide-16
SLIDE 16

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem

slide-17
SLIDE 17

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem Let's discuss that!

slide-18
SLIDE 18

Lower bound for all HDs, approximate

Alice a a a a a a Bob Charlie b a a b a b a a a a a a 3-parties CC problem

▸ Alice holds the pattern, Bob holds T[1,n], Charlie holds

T[n + 1,2n]

▸ Charlie’s output: (1 + ε)-HD for each alignment of P and T

  • Min. communication between Alice, Bob, and Charlie?
slide-19
SLIDE 19

Lower bound for all HDs, approximate

a a a a a a b a a b a b a a a a a a Bob Charlie Alice

▸ Streaming algorithm: T = stream, not allowed to store a copy of P

  • r T, output = (1 + ε)-HDs

▸ At time = n it stores all the information needed to compute the

(1 + ε)-HDs

▸ Comm. protocol: send this information from A and B to C ▸ Lower bound for the CC problem ⇒ streaming lower bound

slide-20
SLIDE 20

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem

slide-21
SLIDE 21

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern

slide-22
SLIDE 22

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern Upper bounds Upper bounds

slide-23
SLIDE 23

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern Upper bounds Upper bounds

slide-24
SLIDE 24

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern Upper bounds Upper bounds

slide-25
SLIDE 25

Communication complexity

slide-26
SLIDE 26

Simpler CC problem: B and C know the pattern

Lower bound: Ω(ε−1 log2 ε−1n) b a a b a b a a a a a a Bob Charlie

▸ Window counting: (1 + ε)-approx. of #(b) in a sliding window of

width n = (1 + ε)-approx. of HD between P = aa...a and T

▸ Ω(ε−1 log2 ε−1n) bits [Datar et al., 2013]

slide-27
SLIDE 27

3-parties CC problem

Lower bound: Ω(ε−1 log2 ε−1n + ε−2 logn) b a a b a b a a a a a a Bob Charlie

▸ Output = (1 + ε)-HD between T[1,n] and T[n + 1,2n] =

(1 + ε)-approx. of HD between T = T[1,n]00...0 (Bob and Charlie) and P = T[n + 1,2n] (Alice)

▸ Ω(ε−2 logn) bits [Jayram & Woordruff, 2013]

slide-28
SLIDE 28

Important notion: (1 + ε)-approximate sketch for HD

Intuition

▸ Sketch of a string is a very short vector ▸ L2-distance between sketches ≈ HD between strings

slide-29
SLIDE 29

Important notion: (1 + ε)-approximate sketch for HD

Intuition

▸ Sketch of a string is a very short vector ▸ L2-distance between sketches ≈ HD between strings

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = ⎛ ⎜ ⎝ ±1 ±1 ... ±1 ⋱ ⋮ ⎞ ⎟ ⎠ Y ⎛ ⎜ ⎝ S[1] S[2] ⋮ ⎞ ⎟ ⎠ S

slide-30
SLIDE 30

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS

slide-31
SLIDE 31

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof

slide-32
SLIDE 32

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2⋅∣sketch(S1)−sketch(S2)∣2

2] = E[ε2⋅∣Y(S1−S2)∣2 2] = ε2⋅E[∣Y(S1−S2)∣2 2] =

slide-33
SLIDE 33

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2⋅∣sketch(S1)−sketch(S2)∣2

2] = E[ε2⋅∣Y(S1−S2)∣2 2] = ε2⋅E[∣Y(S1−S2)∣2 2] =

= ε2 ⋅ E[∑

1/ε2 j=1 (Yj(S1 − S2)) 2] = E[(Y1(S1 − S2)) 2] = ∣S1 − S2∣2 2

slide-34
SLIDE 34

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ∣S1 − S2∣2 2

slide-35
SLIDE 35

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ∣S1 − S2∣2 2

Var[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ε2 ⋅ Var[(Y1(S1 − S2)) 2] ≤

slide-36
SLIDE 36

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ∣S1 − S2∣2 2

Var[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ε2 ⋅ Var[(Y1(S1 − S2)) 2] ≤

≤ ε2 ⋅ E[(Y1(S1 − S2))

4] ≤ ε2C ⋅ E[(Y1(S1 − S2)) 2]2 = ε2C ⋅ ∣S1 − S2∣4 2

slide-37
SLIDE 37

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ∣S1 − S2∣2 2

Var[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] ≤ ε2C ⋅ ∣S1 − S2∣4 2

slide-38
SLIDE 38

Important notion: (1 + ε)-approximate sketch for HD

Formal definition (binary alphabets)

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables

sketch(S) ÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜÜ length = 1/ε2 = YS Lemma (1 − ε) ⋅ HD(S1,S2) ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2 ≤ (1 + ε) ⋅ HD(S1,S2)

Proof E[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] = ∣S1 − S2∣2 2

Var[ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2

2] ≤ ε2C ⋅ ∣S1 − S2∣4 2

By Chebyshev’s inequality, with constant probability: (1 − ε) ⋅ ∣S1 − S2∣2

2 ≤ ε2 ⋅ ∣sketch(S1) − sketch(S2)∣2 2 ≤ (1 + ε) ⋅ ∣S1 − S2∣2 2

slide-39
SLIDE 39

Important notion: (1 + ε)-approximate sketch for HD

One more trick

▸ Y can be generated from O(logn) random bits (random →

preudorandom)

slide-40
SLIDE 40

Important notion: (1 + ε)-approximate sketch for HD

One more trick

▸ Y can be generated from O(logn) random bits (random →

preudorandom) Summary

▸ Sketch of a string is a vector of length O(ε−2 logn) bits ▸ Sketches give (1 + ε)-approximation of HD

slide-41
SLIDE 41

Simpler CC problem: B and C know the pattern

B & C a a a a a a Bob Charlie b a a b a b a a a a a a

▸ B knows T[1,n], C knows T[n + 1,2n], B and C know P ▸ Observation: C doesn’t need any information to compute HDs

between suffixes of P and T[n + 1,2n]

slide-42
SLIDE 42

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

▸ Select O(logε n) prefixes of the pattern ▸ First prefix: Prefix of maximal length ℓ1 with HD ≤ (1/ε)2 ▸ Second prefix: Prefix of maximal length ℓ2 ≥ ℓ1 with HD ≤ (1/ε)3 ▸ . . .

slide-43
SLIDE 43

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

▸ Divide prefix j into 1/ε2 blocks with HD ≤ (1/ε)j−1

slide-44
SLIDE 44

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

▸ Divide prefix j into 1/ε2 blocks with HD ≤ (1/ε)j−1 ▸ Compute O(1/ε2) sketches for the text

slide-45
SLIDE 45

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3 sketch sketch sketch sketch

. . .

HD ≤ (1/ε)10

Bob

▸ Divide prefix j into 1/ε2 blocks with HD ≤ (1/ε)j−1 ▸ Compute O(1/ε2) sketches for the text ▸ Send the block borders and the sketches to Charlie

slide-46
SLIDE 46

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

slide-47
SLIDE 47

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3 sketch sketch sketch sketch

. . .

HD ≤ (1/ε)10

Bob

P1 P2

▸ Find the shortest prefix containing P

slide-48
SLIDE 48

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

P1 P2 sketch sketch sketch sketch

▸ Find the shortest prefix containing P ▸ HD(P2, T): use sketches — (1 + ε)-approximation

slide-49
SLIDE 49

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

P1 P2 sketch sketch sketch sketch

▸ Find the shortest prefix containing P ▸ HD(P2, T): use sketches — (1 + ε)-approximation ▸ HD(P1, T): use the prefix’s block — additive error ≤ ε ⋅ HD(P,T)

slide-50
SLIDE 50

Simpler CC problem: B and C know the pattern

HD ≤ (1/ε)2 HD ≤ (1/ε)3

. . .

HD ≤ (1/ε)10

Bob

P1 P2 sketch sketch sketch sketch

▸ Find the shortest prefix containing P ▸ HD(P2, T): use sketches — (1 + ε)-approximation ▸ HD(P1, T): use the prefix’s block — additive error ≤ ε ⋅ HD(P,T) ▸ CC = O(ε−4 log2 n) [Lower bound: Ω(ε−1 log2 ε−1n)]

slide-51
SLIDE 51

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern Upper bounds Upper bounds

slide-52
SLIDE 52

3-parties CC problem

Alice a a a a a a Bob Charlie c a a b a b a a a a a a

▸ B knows T[1,n], C knows T[n + 1,2n], only A knows P ▸ Observation: C doesn’t need any information to compute HDs

between suffixes of P and his part of the text

▸ Can’t use prefixes of P to approximate T — C doesn’t know P

slide-53
SLIDE 53

3-parties CC problem

Bob

B sketch B sketch B sketch B sketch B sketch B sketch B sketch B sketch sketch

P1 P2

▸ Divide the text T into blocks of length B = √n ▸ Compute a sketch of each block ▸ Large Hamming distance: HD (prefix of P, T) ≥ B/ε

▸ HD(P1, T): use sketches to compute (1 + ε)-approx. H′ ▸ HD(P2, T): ignore

slide-54
SLIDE 54

3-parties CC problem

Bob

B sketch B sketch B sketch B sketch B sketch B sketch B sketch B sketch sketch

P1 P2

▸ Divide the text T into blocks of length B = √n ▸ Compute a sketch of each block ▸ Large Hamming distance: HD (prefix of P, T) ≥ B/ε

▸ HD(P1, T): use sketches to compute (1 + ε)-approx. H′ ▸ HD(P2, T): ignore

Lemma H′ is a good approximation of HD Proof

  • 1. H′ ≤ (1 + ε) ⋅ HD(P2,T) ≤ (1 + ε) ⋅ HD
  • 2. H′ ≥ (1 − ε) ⋅ HD(P2,T) ≥ (1 − ε) ⋅ HD − HD(P1,T) ≥ (1 − 2ε) ⋅ HD
slide-55
SLIDE 55

3-parties CC problem

Bob

B B B B B B B B B

⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

P1 P2

▸ Small Hamming distance: HD (prefix of P, T) ≥ B/ε

▸ If #(⊗) in a block ≤ 1, B sends it to C ▸ Starting from the first block where #(⊗) ≥ 2, T and P can be

encoded in small space (periodicity)

▸ C can restore P and T from the encoding and compute HDs

▸ CC = O(1/ε2√nlogn)

[Lower bound: Ω(ε−2 logn + ε−1 log2 ε−1n)]

slide-56
SLIDE 56

Upper bounds: show a streaming algorithm Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern Upper bounds Upper bounds

slide-57
SLIDE 57

Streaming algorithm

slide-58
SLIDE 58

Streaming algorithm

Text T sketch sketch Pattern P

Reminder

▸ Y = 1/ε2 × n matrix of IID unbiased ±1 random variables ▸ sketch(S) = Y ⋅ S

Problem

▸ How to maintain the sketch of T? ▸ We don’t have random access to T and we can’t store many of its

characters

slide-59
SLIDE 59

Streaming algorithm

sketch B sketch B sketch B sketch B super-sketch

Reminder

▸ Y = (1/ε2) × n matrix of IID unbiased ±1 random variables ▸ sketch(S) = Y ⋅ S

New notion: super-sketch

▸ σi — IID unbiased ±1 variables ▸ super-sketch = ∑σi ⋅ sketchi ▸ Analysis: similar to sketches

slide-60
SLIDE 60

Streaming algorithm

P[1, B − i] P[B − i + 1, n − i] P[n − i + 1, n] B B B B B sketch sketch sketch super-sketch

▸ HD between P[B − i + 1,n − i] and T: super-sketch ▸ Store a super-sketch for each (n − B)-length substring of P

▸ B = √n/ε super-sketches in total

▸ At each block border compute a super-sketch of the last n/B

blocks from their sketches

▸ O(n/B) = O(ε√n) time, can be de-amortized

slide-61
SLIDE 61

Streaming algorithm

P[1, B − i] P[B − i + 1, n − i] P[n − i + 1, n] B B B B B simpler CC problem super-sketch sketch

▸ HD between the suffix of P and T: sketch

slide-62
SLIDE 62

Streaming algorithm

P[1, B − i] P[B − i + 1, n − i] P[n − i + 1, n] B B B B B simpler CC problem super-sketch sketch

▸ HD between the suffix of P and T: sketch ▸ HD between the prefix of P and T: similar to the simpler CC

problem for the pattern P[1,B]

P[1, B − i] B B simpler CC problem

Complexity: O(1/ε3√nlog2 n) bits of space, O(1/ε2 log2 n) time

slide-63
SLIDE 63

Upper bounds: Lower bounds: reduction to a CC problem This work: (1+ε)-Approximate HDs problem 3-parties CC problem Simpler CC problem: B and C know the pattern Upper bounds Upper bounds

O(ε

−3√n log 1.5n)

O(ε

−1log 2(ε −1n))

O(ε

−4log 2n)

O(ε

−2logn+ε −1log 2(ε −1n))

O(ε

−2√nlog 2n)