Algorithms for Big Data (V) Chihao Zhang Shanghai Jiao Tong - - PowerPoint PPT Presentation

algorithms for big data v
SMART_READER_LITE
LIVE PREVIEW

Algorithms for Big Data (V) Chihao Zhang Shanghai Jiao Tong - - PowerPoint PPT Presentation

Algorithms for Big Data (V) Chihao Zhang Shanghai Jiao Tong University Oct. 18, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review of the Last Lecture . . . .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Algorithms for Big Data (V)

Chihao Zhang

Shanghai Jiao Tong University

  • Oct. 18, 2019
slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Review of the Last Lecture

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Review of the Last Lecture

Last time, we learnt Misra-Gries and Count Sketch for Frequency Estimation.

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Review of the Last Lecture

Last time, we learnt Misra-Gries and Count Sketch for Frequency Estimation. The later has the advantage of being a linear sketch.

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Review of the Last Lecture

Last time, we learnt Misra-Gries and Count Sketch for Frequency Estimation. The later has the advantage of being a linear sketch. It also generalize to turnstile model.

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Count Sketch

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Count Sketch

Algorithm Count Sketch Init: An array C[j] for j ∈ [k] where k =

3 ε2 .

A random Hash function h : [n] → [k] from a 2-universal family. A random Hash function g : [n] → {−1, 1} from a 2-universal family. On Input (y, ∆): C[h(y)] ← C[h(y)] + ∆ · g(y) Output: On query a: Output fa = g(a) · C[h(a)].

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Performance

We can apply the median trick to obtain: ▶ Pr [

  • fa − fa
  • ⩾ ε∥f∥2

] ⩽ δ; ▶ it costs O ( 1

ε2 log 1 δ (log m + log n)

) bits of memeory.

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Performance

We can apply the median trick to obtain: ▶ Pr [

  • fa − fa
  • ⩾ ε∥f∥2

] ⩽ δ; ▶ it costs O ( 1

ε2 log 1 δ (log m + log n)

) bits of memeory. Today we will see another simple sketch algorithm.

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Count-Min

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Count-Min

We assume that for each entry (y, ∆), it holds that ∆ ⩾ 0.

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Count-Min

We assume that for each entry (y, ∆), it holds that ∆ ⩾ 0. Algorithm Count-Min Init: An array C[i][j] for i ∈ [t] and j ∈ [k] where t = log(1/δ) and k = 2/ε. Choose t independent random Hash function h1, . . . , ht : [n] → [k] from a 2-universal family. On Input (y, ∆): For each i ∈ [t], C[i][hi(y)] ← C[i][hi(y)] + ∆. Output: On query a: Output fa = min1⩽i⩽t C[i][h(a)].

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Obviously we have fa ⩽ fa.

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Obviously we have fa ⩽ fa. Our algorithm overestimates only if for some b ̸= a, hi(b) = hi(a). Let Yi,b be the indicator of this event.

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Obviously we have fa ⩽ fa. Our algorithm overestimates only if for some b ̸= a, hi(b) = hi(a). Let Yi,b be the indicator of this event. Let Xi be C[i][hi(a)]. Then E [

  • Xi

] = ∑

b∈[n]

fbE [Yi,b] = fa + ∑

b∈[n]:b̸=a

fbE [Yi,b] = fa + ∑

b∈[n]:b̸=a fb

k ⩽ fa + ∥f∥1 k .

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Obviously we have fa ⩽ fa. Our algorithm overestimates only if for some b ̸= a, hi(b) = hi(a). Let Yi,b be the indicator of this event. Let Xi be C[i][hi(a)]. Then E [

  • Xi

] = ∑

b∈[n]

fbE [Yi,b] = fa + ∑

b∈[n]:b̸=a

fbE [Yi,b] = fa + ∑

b∈[n]:b̸=a fb

k ⩽ fa + ∥f∥1 k . Thus, Pr [|Xi − fa| ⩾ ε∥f∥1] ⩽ ∥f∥1 kε∥f∥1 = 1 2.

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Since our output is the minimum out of t independent Xi’s,

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Since our output is the minimum out of t independent Xi’s, Pr [

  • fa − fa ⩾ ε∥f∥1

] = Pr [|min {X1, . . . , Xt} − fa| ⩾ ∥f∥1] = Pr [ t ∧

i=1

(|Xi − fa| ⩾ ε∥f∥1) ] =

t

i=1

Pr [|Xi − fa| ⩾ ε∥f∥1] ⩽ 2−t = δ.

slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Since our output is the minimum out of t independent Xi’s, Pr [

  • fa − fa ⩾ ε∥f∥1

] = Pr [|min {X1, . . . , Xt} − fa| ⩾ ∥f∥1] = Pr [ t ∧

i=1

(|Xi − fa| ⩾ ε∥f∥1) ] =

t

i=1

Pr [|Xi − fa| ⩾ ε∥f∥1] ⩽ 2−t = δ. The algorithm computes a linear sketch using O (1 ε log 1 δ · (log m + log n) ) bits of memory.

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Since our output is the minimum out of t independent Xi’s, Pr [

  • fa − fa ⩾ ε∥f∥1

] = Pr [|min {X1, . . . , Xt} − fa| ⩾ ∥f∥1] = Pr [ t ∧

i=1

(|Xi − fa| ⩾ ε∥f∥1) ] =

t

i=1

Pr [|Xi − fa| ⩾ ε∥f∥1] ⩽ 2−t = δ. The algorithm computes a linear sketch using O (1 ε log 1 δ · (log m + log n) ) bits of memory. It can be generalized to turnstile model (Exercise).

slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Frequency Moments

slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Frequency Moments

The k-th frequency moment of a stream is Fk ≜ ∑

j∈[n]

fk

j = ∥f∥k k.

slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Frequency Moments

The k-th frequency moment of a stream is Fk ≜ ∑

j∈[n]

fk

j = ∥f∥k k.

For example, F2 is the size of self-join of a relation r.

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Frequency Moments

The k-th frequency moment of a stream is Fk ≜ ∑

j∈[n]

fk

j = ∥f∥k k.

For example, F2 is the size of self-join of a relation r. Many problems we met before can be viewed as estimating Fk for some special k.

slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AMS Estimator for Fk

slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AMS Estimator for Fk

Given ⟨a1, . . . , am⟩, then algorithm first sample a uniform index J ∈ [m].

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AMS Estimator for Fk

Given ⟨a1, . . . , am⟩, then algorithm first sample a uniform index J ∈ [m]. It then count the number of entries aj with aj = aJ and j ⩾ J.

slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

AMS Estimator for Fk

Given ⟨a1, . . . , am⟩, then algorithm first sample a uniform index J ∈ [m]. It then count the number of entries aj with aj = aJ and j ⩾ J. Algorithm AMS Estimator for Fk Init: (m, r, a) ← (0, 0, 0). On Input (y, ∆): m ← m + 1, β ∼ Ber( 1

m);

if β = 1 then a ← y,r ← 0; end if if y = a then r ← r + 1 end if Output: m ( rk − (r − 1)k) .

slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

We first compute the expectation of the output X.

slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

We first compute the expectation of the output X. Assuming a = j at the end of algorithm, then E [X | a = j] = E [ m(rk − (r − 1)k)

  • a = j

] =

fj

i=1

1 fj · m ( ik − (i − 1)k) = m fj · fk

j .

Therefore, E [X] =

n

j=1

Pr [a = j] · E [X | a = j] =

n

j=1

fj m · m fj · fk

j = Fk.

slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Variance

slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Variance

Var [X] ⩽ E [ X2] =

n

j=1

fj m

fj

i=1

1 fj · m2 ( ik − (i − 1)k)2 ⩽ m

n

j=1 fj

i=1

kik−1 ( ik − (i − 1)k) ⩽ m

n

j=1

kfk−1

j fj

i=1

( ik − (i − 1)k) = m

n

j=1

kfk−1

j

· fk

j = k

 

n

j=1

fj    

n

j=1

f2k−1

j

  .

slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Assume k ⩾ 1 and let f∗ ≜ maxj∈[n] fj.

slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Assume k ⩾ 1 and let f∗ ≜ maxj∈[n] fj. Var [X] ⩽ k

n

j=1

fj ·  fk−1

∗ n

j=1

fk

j

  ⩽ k

n

j=1

fj ·  ( fk

) k−1

k

n

j=1

fk

j

  ⩽ k

n

j=1

fj ·  

n

j=1

fk

j

 

k−1 k

n

j=1

fk

j

slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Assume k ⩾ 1 and let f∗ ≜ maxj∈[n] fj. Var [X] ⩽ k

n

j=1

fj ·  fk−1

∗ n

j=1

fk

j

  ⩽ k

n

j=1

fj ·  ( fk

) k−1

k

n

j=1

fk

j

  ⩽ k

n

j=1

fj ·  

n

j=1

fk

j

 

k−1 k

n

j=1

fk

j

Applying Jensen’s inequality on g(z) = z1/k, we can bound above by

slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Assume k ⩾ 1 and let f∗ ≜ maxj∈[n] fj. Var [X] ⩽ k

n

j=1

fj ·  fk−1

∗ n

j=1

fk

j

  ⩽ k

n

j=1

fj ·  ( fk

) k−1

k

n

j=1

fk

j

  ⩽ k

n

j=1

fj ·  

n

j=1

fk

j

 

k−1 k

n

j=1

fk

j

Applying Jensen’s inequality on g(z) = z1/k, we can bound above by k

n

j=1

( fk

j

) 1

k

 

n

j=1

fk

j

 

k−1 k

n

j=1

fk

j ⩽ kn1−1/k

 

n

j=1

fk

j

 

1 k 

n

j=1

fk

j

 

k−1 k

n

j=1

fk

j = kn1−1/kF2 k.

slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Therefore, Pr [|X − Fk| ⩾ εFk] ⩽ kn1−1/k ε2 .

slide-40
SLIDE 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Therefore, Pr [|X − Fk| ⩾ εFk] ⩽ kn1−1/k ε2 . Now we can apply the standard averaging trick and median trick.

slide-41
SLIDE 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Therefore, Pr [|X − Fk| ⩾ εFk] ⩽ kn1−1/k ε2 . Now we can apply the standard averaging trick and median trick. To kill the n1−1/k factor in the variance, we need to average Ω ( n1−1/k) estimates.

slide-42
SLIDE 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Therefore, Pr [|X − Fk| ⩾ εFk] ⩽ kn1−1/k ε2 . Now we can apply the standard averaging trick and median trick. To kill the n1−1/k factor in the variance, we need to average Ω ( n1−1/k) estimates. An (ε, δ) estimator requires O ( 1 ε2 log 1 δkn1−1/k (log m + log n) ) bits of memory.

slide-43
SLIDE 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Tug-of-War Sketch

slide-44
SLIDE 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Tug-of-War Sketch

The following simple algorithm for F2 outperforms AMS by using only O(log n + log m) bits. Algorithm Tug-of-War Sketch Init: A random Hash function h : [n] → {−1, 1} from a 4-universal family. x ← 0. On Input (y, ∆): x ← x + ∆ · h(y) Output: Output x2.

slide-45
SLIDE 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

slide-46
SLIDE 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Let X be the value of x at the end of our algorithm.

slide-47
SLIDE 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Let X be the value of x at the end of our algorithm. E [ X2] = E     ∑

j∈[n]

fjh(j)  

2

 = E   ∑

j∈[n]

f2

jh(j)2 +

i,j∈[n]:i̸=j

fifjh(i)h(j)   = F2.

slide-48
SLIDE 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

Let X be the value of x at the end of our algorithm. E [ X2] = E     ∑

j∈[n]

fjh(j)  

2

 = E   ∑

j∈[n]

f2

jh(j)2 +

i,j∈[n]:i̸=j

fifjh(i)h(j)   = F2. Using the property of 4-universal Hash family, we have E [ X4] = ∑

i,j,k,ℓ∈[n]

fifjfkfℓE [h(i)h(j)h(k)h(ℓ)] = ∑

j∈[n]

f4

jE

[ h(j)4] + 6 ∑

i,j∈[n]:j>i

f2

if2 jE

[ h(i)2h(j)2] = F4 + 6 ∑

i,j∈[n]:j>i

f2

if2 j.

slide-49
SLIDE 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Therefore Var [ X2] = E [ X4] − ( E [ X2])2 = F4 − F2

2 + 6

i,j∈[n]:j>i

f2

if2 j

= F4 − F2

2 + 3(F2 2 − F4) ⩽ 2F2 2.

slide-50
SLIDE 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Therefore Var [ X2] = E [ X4] − ( E [ X2])2 = F4 − F2

2 + 6

i,j∈[n]:j>i

f2

if2 j

= F4 − F2

2 + 3(F2 2 − F4) ⩽ 2F2 2.

Finally, we apply the median trick and it costs O ( 1 ε2 log 1 δ (log n + log m) ) bits of memory.