Estimating Frequency Moments Estimating F 0 Algorithm Correctness - - PowerPoint PPT Presentation

estimating frequency moments
SMART_READER_LITE
LIVE PREVIEW

Estimating Frequency Moments Estimating F 0 Algorithm Correctness - - PowerPoint PPT Presentation

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari Improvements Estimating F 2 School of Computer Science Correctness Carleton


slide-1
SLIDE 1

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Estimating Frequency Moments

Anil Maheshwari

School of Computer Science Carleton University Canada

slide-2
SLIDE 2

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Outline

1

Frequency Moments

2

Estimating F0

3

Algorithm

4

Correctness

5

Further Improvements

6

Estimating F2

7

Correctness

8

Improving Variance

9

Complexity

slide-3
SLIDE 3

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Frequency Moments

Definition

Let A = (a1, a2, . . . , an) be a stream, where elements are from universe U = {1, . . . , u}. Let mi = # of elements in A that are equal to i. The k−th frequency moment Fk =

u

  • i=1

mk

i , where 00 = 0.

An example for n = 19 and u = 7 A = (3, 2, 4, 7, 2, 2, 3, 2, 2, 1, 4, 2, 2, 2, 1, 1, 2, 3, 2) m1 = 3, m2 = 10, m3 = 3, m4 = 2, m5 = 0, m6 = 0, m7 = 1

slide-4
SLIDE 4

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Example contd.

A = (3, 2, 4, 7, 2, 2, 3, 2, 2, 1, 4, 2, 2, 2, 1, 1, 2, 3, 2) and m1 = m3 = 3, m2 = 10, m4 = 2, m7 = 1, m5 = m6 = 0 F0 =

7

  • i=1

m0

i = 30 + 100 + 30 + 20 + 00 + 00 + 10 = 5

(# of Distinct Elements in A) F1 =

7

  • i=1

m1

i = 31 + 101 + 31 + 21 + 01 + 01 + 11 = 19

(# of Elements in A) F2 =

7

  • i=1

m2

i = 32 + 102 + 32 + 22 + 02 + 02 + 12 = 123

(Surprise Number)

slide-5
SLIDE 5

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Streaming Problem

Find frequency moments in a stream

Input: A stream A consisting of n elements from universe U = {1, . . . , u}. Output: Estimate Frequency Moments Fk’s for different values of k. Our Task: Estimate F0 and F2 using sublinear space Reference: The space complexity of estimating frequency moments by Noga Alon, Yossi Matias, and Mario Szegedy, Journal of Computer Systems and Science, 1999.

slide-6
SLIDE 6

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Estimating F0

Computation of F0

Input: Stream A = (a1, a2, . . . , an), where each ai ∈ U = {1, . . . , u}. Output: An estimate ˆ F0 of number of distinct elements F0 in A such that Pr

  • 1

c ≤ ˆ F0 F0 ≤ c

  • ≥ 1 − 2

c for some

constant c using sublinear space.

slide-7
SLIDE 7

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Algorithm for Estimating F0

Input: Stream A and a hash function h : U → U Output: Estimate ˆ F0 Step 1: Initialize R := 0 Step 2: For each elements ai ∈ A do:

1

Compute binary representation of h(ai)

2

Let r be the location of the rightmost 1 in the binary representation

3

if r > R, R := r Step 3: Return ˆ F0 = 2R Space Requirements = O(log u) bits

slide-8
SLIDE 8

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observations

Let d to be smallest integer such that 2d ≥ u (d-bits are sufficient to represent numbers in U) Observation 1: Pr(rightmost 1 in h(ai) is at location ≥ r + 1) = 1

2r

slide-9
SLIDE 9

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observations contd.

Observation 2: For ai = aj, Pr(rightmost 1 in h(ai) ≥ r + 1 and rightmost 1 in h(aj) ≥ r + 1) =

1 22r

Fix r ∈ {1, . . . , d}. ∀x ∈ A, define indicator r.v: Ir

x =

  • 1,

if the rightmost 1 is at location ≥ r + 1 in h(x) 0,

  • therwise

Let Zr = Ir

x (sum is over distinct elements of A)

Observation 3: The following holds:

1

E[Ir

x] = Pr(Ir x = 1) = 1 2r (see Observation 1)

2

V ar[Ir

x] = E[Ir x 2] − E[Ir x]2 = 1 2r

  • 1 − 1

2r

  • 3

E[Zr] = F0

2r

4

V ar[Zr] = F0 1

2r

  • 1 − 1

2r

  • ≤ F0

2r = E[Zr]

slide-10
SLIDE 10

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observations contd.

Observation 4: If 2r > cF0, Pr(Zr > 0) < 1

c

Proof: Markov’s Inequality states: Pr(X ≥ a) ≤ E[X]

a .

Pr(Zr > 0) = Pr(Zr ≥ 1) ≤ E[Zr] = F0

2r < 1 c.

Observation 5: If c2r < F0, Pr(Zr = 0) < 1

c

Proof: Chebyshev’s Inequality states: Pr(|X − E[X]| ≥ α) ≤ V ar[X]

α2

. Note Pr(Zr = 0) ≤ Pr(|Zr − E[Zr]| ≥ E[Zr]). Thus Pr(Zr = 0) ≤ V ar[Zr]

E[Zr]2 ≤ 1 E[Zr] ≤ 2r F0 < 1 c

slide-11
SLIDE 11

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observations contd.

Observation 6: In our algorithm, we set ˆ F0 = 2R. We have Pr

  • 1

c ≤ 2R F0 ≤ c

  • ≥ 1 − 2

c.

Proof: From Observation 4, if 2R > cF0, Pr(Zr > 0) < 1

c.

From Observation 5, if c2R < F0, Pr(Zr = 0) < 1

c.

With Pr ≤ 2

c, 2R > cF0 or c2R < F0. (Failure)

Thus, with Pr ≥ 1 − 2

c, 1 c ≤ 2R F0 ≤ c. (Success)

slide-12
SLIDE 12

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Improving success probability

Execute the algorithm s times in parallel with independent hash functions. Let R to the median value among these runs. Return ˆ F0 = 2R. Claim: For c > 4, there exists s = O(log 1

ǫ), ǫ > 0, such

that 1

c ≤ ˆ F0 F0 ≤ c with Pr ≥ 1 − ǫ and the algorithm uses

O(s log u) bits. Proof uses Chernoff Bounds: If r.v. X is sum of independent identical indicator r.v. and 0 < δ < 1, Pr(X ≥ (1 + δ)E[X]) ≤ e− δ2E[X]

3

slide-13
SLIDE 13

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Improving success probability contd.

Define indicator r.v. X1, . . . , Xs: Xi =

  • 0,

if success, i.e. 1

c ≤ 2Ri F0 ≤ c

1,

  • therwise

Note

1

E[Xi] = Pr(Xi = 1) ≤ 2

c = β < 1 2

2

Let X =

s

  • i=1

Xi =# Failures in s runs

3

E[X] ≤ sβ < s

2

We apply Chernoff Bounds by setting s = O(log 1

ǫ).

Calculations will show that Pr( 1

c ≤ 2R F0 ≤ c) ≥ 1 − ǫ

slide-14
SLIDE 14

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Estimating F2

Input: Stream A and hash function h : U → {−1, +1} Output: Estimate ˆ F2 of F2 =

u

  • i=1

m2

i

Algorithm (Tug of War)

Step 1: Initialize Y := 0. Step 2: For each element x ∈ U, evaluate rx = h(x). Step 3: For each element ai ∈ A, Y := Y + rai Step 4: Return ˆ F2 = Y 2

slide-15
SLIDE 15

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observations

Observation 1: Y =

u

  • i=1

rimi and E[ri] = 0. Observation 2: E[Y 2] =

u

  • i=1

m2

i = F2

1

Y 2 = u

  • i=1

rimi 2 =

u

  • i=1

u

  • j=1

rirjmimj

2

E[Y 2] = E

  • u
  • i=1

u

  • j=1

rirjmimj

  • 3

By Linearity of Expectation E[Y 2] =

u

  • i=1

u

  • j=1

mimjE [rirj]

4

By independence: E[rirj] = E[ri]E[rj]. We have E[Y 2] =

u

  • i=1

m2

i = F2

slide-16
SLIDE 16

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observations contd.

Observation 3: Pr

  • |Y 2 − E[Y 2]| ≥

√ 2cE[Y 2]

  • ≤ 1

c2 for

any positive constant c. Proof Sketch:

1

Chebyshev’s Inequality: Pr(|X − E[X]| ≥ α) ≤ V ar[X]

α2

2

Pr

  • |Y 2 − E[Y 2]| ≥ c
  • V ar[Y 2]
  • ≤ 1

c2

3

V ar[Y 2] = E[Y 4] − E[Y 2]2

slide-17
SLIDE 17

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observation 3 contd.

E[Y 4] = E[

  • i,j,k,l

mimjmkmlrirjrkrl] =

  • i,j,k,l

mimjmkmlE [rirjrkrl] =

u

  • i=1

m4

i + 6

  • 1≤i<j≤u

m2

i m2 j

V ar[Y 2] = E[Y 4] − E[Y 2]2 =

u

  • i=1

m4

i + 6

  • 1≤i<j≤u

m2

i m2 j −

u

  • i=1

m2

i

2 = 4

  • 1≤i<j≤u

m2

i m2 j

slide-18
SLIDE 18

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Observation 3 contd.

V ar[Y 2] = 4

  • i<j

m2

i m2 j

≤ 2 u

  • 1

m2

i

2 = 2

  • E[Y 2]

2 Thus, Pr

  • |Y 2 − E[Y 2]| ≥ c
  • V ar[Y 2]

1 c2 Pr

  • |Y 2 − E[Y 2]| ≥

√ 2cE[Y 2]

1 c2 Therefore Y 2 approximates F2 = E[Y 2] within a constant factor with Pr ≥ 1 − 1

c2 .

slide-19
SLIDE 19

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Improving the Variance

Execute the algorithm k times (using independent hash functions) resulting in Y 2

1 , Y 2 2 , . . . , Y 2 k .

Output ¯ Y 2 = 1

k k

  • i=1

Y 2

i

Observations:

1

E[ ¯ Y 2] = E[Y 2] = F2

2

V ar[ ¯ Y 2] = 1

kV ar[Y 2]

3

Pr

  • | ¯

Y 2 − E[ ¯ Y 2]| ≥

  • 2

kcE[ ¯

Y 2]

  • ≤ 1

c2

4

Set k = O( 1

ǫ2 ), we have

Pr

  • | ¯

Y 2 − E[ ¯ Y 2]| ≥ ǫcE[ ¯ Y 2]

  • ≤ 1

c2

slide-20
SLIDE 20

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

Space Complexity

Algorithm (Tug of War)

Step 1: Initialize Y := 0. Step 2: For each element x ∈ U, evaluate rx = h(x). Step 3: For each element ai ∈ A, Y := Y + rai Step 4: Return ˆ F2 = Y 2 Need to store Y and (r1, r2, . . . , ru). Y requires O(log n) bits. We needed ri’s to be 2-wise and 4-wise independent hash functions. 4-wise independent functions can be maintained using O(log u) bits. Total space required is O(log n + log u).

slide-21
SLIDE 21

Estimating Frequency Moments Anil Maheshwari Frequency Moments Estimating F0 Algorithm Correctness Further Improvements Estimating F2 Correctness Improving Variance Complexity

References

1

The space complexity of estimating frequency moments by Noga Alon, Yossi Matias, and Mario Szegedy, Journal of Computer Systems and Science, 1999.

2

Probabilistic Counting by Philippe Flajolet and G. Nigel Martin, 24th Annual Symposium on Foundations of Computer Science, 1983.

3

Notes on Algorithm Design by A.M

4

Several Lecture Notes (Tim Roughgarden, Ankush Moitra, Lap Chi Lau, Yufei Tao, John Augustine,...)