Data Streams & Communication Complexity
Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst
1/25
Data Streams & Communication Complexity Lecture 1: Simple Stream - - PowerPoint PPT Presentation
Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n , e.g., x 1 , x 2 , . . . , x m = 3
1/25
◮ Stream: m elements from universe of size n, e.g.,
2/25
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute some function of stream, e.g., number of distinct
2/25
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute some function of stream, e.g., number of distinct
◮ Catch:
2/25
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute some function of stream, e.g., number of distinct
◮ Catch:
2/25
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute some function of stream, e.g., number of distinct
◮ Catch:
2/25
◮ Stream: m elements from universe of size n, e.g.,
◮ Goal: Compute some function of stream, e.g., number of distinct
◮ Catch:
◮ Origins in seventies but has become popular in last ten years. . .
2/25
◮ Practical Appeal:
◮ Faster networks, cheaper data storage, ubiquitous data-logging
◮ Applications to network monitoring, query planning, I/O efficiency
3/25
◮ Practical Appeal:
◮ Faster networks, cheaper data storage, ubiquitous data-logging
◮ Applications to network monitoring, query planning, I/O efficiency
◮ Theoretical Appeal:
◮ Easy to state problems but hard to solve. ◮ Links to communication complexity, compressed sensing, metric
3/25
◮ Given a stream of m elements from universe [n] = {1, 2, . . . , n}, e.g.,
4/25
◮ Given a stream of m elements from universe [n] = {1, 2, . . . , n}, e.g.,
◮ Problems: What can we approximate in sub linear space?
◮ Frequency moments: Fk =
i f k i .
◮ Max frequency: F∞ = maxi fi. ◮ Number of distinct element: F0 =
i f 0 i
◮ Median: j such that f1 + f2 + . . . + fj ≈ m/2
4/25
◮ Given a stream of m elements from universe [n] = {1, 2, . . . , n}, e.g.,
◮ Problems: What can we approximate in sub linear space?
◮ Frequency moments: Fk =
i f k i .
◮ Max frequency: F∞ = maxi fi. ◮ Number of distinct element: F0 =
i f 0 i
◮ Median: j such that f1 + f2 + . . . + fj ≈ m/2
◮ Keep things simple: Could consider fi’s being increased or decreased
4/25
5/25
◮ Sampling is a general technique for tackling massive amounts of data
6/25
◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To find an ǫ-approximate median, i.e., j such that
6/25
◮ Sampling is a general technique for tackling massive amounts of data ◮ Example: To find an ǫ-approximate median, i.e., j such that
◮ Beyond basic sampling: There are more powerful forms of sampling
6/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0
7/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
7/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
7/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
◮ Expectation:
7/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
◮ Expectation:
7/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
◮ Expectation:
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
◮ Expectation:
7/25
◮ Problem: Estimate i g(fi) for some function g with g(0) = 0 ◮ Basic Estimator: Sample xJ where J ∈R [m] and compute
◮ Expectation:
◮ For high confidence: Compute t estimators in parallel and average.
7/25
◮ Frequency Moments: Define Fk = i f k i
8/25
◮ Frequency Moments: Define Fk = i f k i
◮ Use AMS estimator with X = m(r k − (r − 1)k).
8/25
◮ Frequency Moments: Define Fk = i f k i
◮ Use AMS estimator with X = m(r k − (r − 1)k). ◮ Expectation: E [X] = Fk
8/25
◮ Frequency Moments: Define Fk = i f k i
◮ Use AMS estimator with X = m(r k − (r − 1)k). ◮ Expectation: E [X] = Fk ◮ Range: 0 ≤ X ≤ kmF k−1 ∞
8/25
◮ Frequency Moments: Define Fk = i f k i
◮ Use AMS estimator with X = m(r k − (r − 1)k). ◮ Expectation: E [X] = Fk ◮ Range: 0 ≤ X ≤ kmF k−1 ∞
◮ Repeat t times and let ˜
◮ Frequency Moments: Define Fk = i f k i
◮ Use AMS estimator with X = m(r k − (r − 1)k). ◮ Expectation: E [X] = Fk ◮ Range: 0 ≤ X ≤ kmF k−1 ∞
◮ Repeat t times and let ˜
8/25
◮ Frequency Moments: Define Fk = i f k i
◮ Use AMS estimator with X = m(r k − (r − 1)k). ◮ Expectation: E [X] = Fk ◮ Range: 0 ≤ X ≤ kmF k−1 ∞
◮ Repeat t times and let ˜
◮ Thm: In ˜
8/25
9/25
◮ Many stream algorithms use a random projection Z ∈ Rw×n, w ≪ n
10/25
◮ Many stream algorithms use a random projection Z ∈ Rw×n, w ≪ n
◮ Updatable: We can maintain sketch s in ˜
10/25
◮ Many stream algorithms use a random projection Z ∈ Rw×n, w ≪ n
◮ Updatable: We can maintain sketch s in ˜
10/25
◮ Many stream algorithms use a random projection Z ∈ Rw×n, w ≪ n
◮ Updatable: We can maintain sketch s in ˜
◮ Useful: Choose a distribution for zi,j such that relevant function of f
10/25
◮ If zi,j ∈R {−1, 1}, can estimate F2 with w = O(ǫ−2 log δ−1).
11/25
◮ If zi,j ∈R {−1, 1}, can estimate F2 with w = O(ǫ−2 log δ−1). ◮ If zi,j ∼ D where D is p-stable p ∈ (0, 2], can estimate Fp with
11/25
◮ If zi,j ∈R {−1, 1}, can estimate F2 with w = O(ǫ−2 log δ−1). ◮ If zi,j ∼ D where D is p-stable p ∈ (0, 2], can estimate Fp with
◮ Note that F0 = (1 ± ǫ)Fp if p = log(1 + ǫ)/ log m.
11/25
◮ If zi,j ∈R {−1, 1}, can estimate F2 with w = O(ǫ−2 log δ−1). ◮ If zi,j ∼ D where D is p-stable p ∈ (0, 2], can estimate Fp with
◮ Note that F0 = (1 ± ǫ)Fp if p = log(1 + ǫ)/ log m. ◮ For the rest of lecture we’ll focus on “hash-based” sketches. Given a
11/25
12/25
◮ Maintain vector s ∈ Nw via random hash function h : [n] → [w]
f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... s[1] s[2] s[3] s[w] ...
13/25
◮ Maintain vector s ∈ Nw via random hash function h : [n] → [w]
f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... s[1] s[2] s[3] s[w] ...
◮ Update: For each increment of fi, increment shi. Hence,
13/25
◮ Maintain vector s ∈ Nw via random hash function h : [n] → [w]
f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... s[1] s[2] s[3] s[w] ...
◮ Update: For each increment of fi, increment shi. Hence,
13/25
◮ Maintain vector s ∈ Nw via random hash function h : [n] → [w]
f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... s[1] s[2] s[3] s[w] ...
◮ Update: For each increment of fi, increment shi. Hence,
◮ Query: Use ˜
13/25
◮ Maintain vector s ∈ Nw via random hash function h : [n] → [w]
f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... s[1] s[2] s[3] s[w] ...
◮ Update: For each increment of fi, increment shi. Hence,
◮ Query: Use ˜
◮ Lemma: fi ≤ ˜
13/25
◮ Maintain vector s ∈ Nw via random hash function h : [n] → [w]
f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... s[1] s[2] s[3] s[w] ...
◮ Update: For each increment of fi, increment shi. Hence,
◮ Query: Use ˜
◮ Lemma: fi ≤ ˜
◮ Thm: Let w = 2/ǫ. Repeat the hashing lg(δ−1) times in parallel and
13/25
◮ Define E by ˜
14/25
◮ Define E by ˜
◮ Since all fj ≥ 0, we have E ≥ 0.
14/25
◮ Define E by ˜
◮ Since all fj ≥ 0, we have E ≥ 0. ◮ Since P [hi = hj] = 1/w,
14/25
◮ Define E by ˜
◮ Since all fj ≥ 0, we have E ≥ 0. ◮ Since P [hi = hj] = 1/w,
14/25
◮ Define E by ˜
◮ Since all fj ≥ 0, we have E ≥ 0. ◮ Since P [hi = hj] = 1/w,
◮ By an application of the Markov bound,
14/25
◮ Range Query: For i, j ∈ [n], estimate f[i,j] = fi + fi+1 + . . . + fj
15/25
◮ Range Query: For i, j ∈ [n], estimate f[i,j] = fi + fi+1 + . . . + fj ◮ Dyadic Intervals: Restrict attention to intervals of the form
15/25
◮ Range Query: For i, j ∈ [n], estimate f[i,j] = fi + fi+1 + . . . + fj ◮ Dyadic Intervals: Restrict attention to intervals of the form
15/25
◮ Range Query: For i, j ∈ [n], estimate f[i,j] = fi + fi+1 + . . . + fj ◮ Dyadic Intervals: Restrict attention to intervals of the form
◮ To support dyadic intervals, construct Count-Min sketches
15/25
◮ Range Query: For i, j ∈ [n], estimate f[i,j] = fi + fi+1 + . . . + fj ◮ Dyadic Intervals: Restrict attention to intervals of the form
◮ To support dyadic intervals, construct Count-Min sketches
◮ E.g., for intervals of width 2 we have:
g[1] g[2] g[3] ... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ... f[n-1] g[n/2]
15/25
16/25
◮ Quantiles: Find j such that
16/25
◮ Quantiles: Find j such that
16/25
◮ Quantiles: Find j such that
◮ Heavy Hitter Problem: Find a set S ⊂ [n] where
16/25
◮ Quantiles: Find j such that
◮ Heavy Hitter Problem: Find a set S ⊂ [n] where
16/25
17/25
◮ Maintain s ∈ Nw via hash functions h : [n] → [w], r : [n] → {−1, 1}
f[2]
f[4] f[5]
... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ...
18/25
◮ Maintain s ∈ Nw via hash functions h : [n] → [w], r : [n] → {−1, 1}
f[2]
f[4] f[5]
... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ...
◮ Update: For each increment of fi, shi ← shi + ri. Hence,
18/25
◮ Maintain s ∈ Nw via hash functions h : [n] → [w], r : [n] → {−1, 1}
f[2]
f[4] f[5]
... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ...
◮ Update: For each increment of fi, shi ← shi + ri. Hence,
18/25
◮ Maintain s ∈ Nw via hash functions h : [n] → [w], r : [n] → {−1, 1}
f[2]
f[4] f[5]
... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ...
◮ Update: For each increment of fi, shi ← shi + ri. Hence,
◮ Query: Use ˜
18/25
◮ Maintain s ∈ Nw via hash functions h : [n] → [w], r : [n] → {−1, 1}
f[2]
f[4] f[5]
... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ...
◮ Update: For each increment of fi, shi ← shi + ri. Hence,
◮ Query: Use ˜
◮ Lemma: E
18/25
◮ Maintain s ∈ Nw via hash functions h : [n] → [w], r : [n] → {−1, 1}
f[2]
f[4] f[5]
... s[1] s[2] s[3] s[w] ... f[1] f[2] f[3] f[4] f[5] f[n] f[6] ...
◮ Update: For each increment of fi, shi ← shi + ri. Hence,
◮ Query: Use ˜
◮ Lemma: E
◮ Thm: Let w = O(1/ǫ2). Repeating O(lg δ−1) in parallel and taking
18/25
◮ Define E by ˜
19/25
◮ Define E by ˜
◮ Expectation: Since E [rj] = 0,
19/25
◮ Define E by ˜
◮ Expectation: Since E [rj] = 0,
◮ Variance: Similarly,
19/25
◮ Define E by ˜
◮ Expectation: Since E [rj] = 0,
◮ Variance: Similarly,
19/25
◮ Define E by ˜
◮ Expectation: Since E [rj] = 0,
◮ Variance: Similarly,
hi=hj=hk
19/25
◮ Define E by ˜
◮ Expectation: Since E [rj] = 0,
◮ Variance: Similarly,
hi=hj=hk
j P [hi = hj] ≤ F2/w
19/25
20/25
21/25
◮ ℓp Sampling: Return random values I ∈ [n] and R ∈ R where
21/25
◮ ℓp Sampling: Return random values I ∈ [n] and R ∈ R where
◮ Applications:
◮ Will use ℓ2 sampling to get optimal algorithm for Fk, k > 2. ◮ Will use ℓ0 sampling for processing graph streams. ◮ Many other stream problems can be solved via ℓp sampling, e.g.,
21/25
◮ ℓp Sampling: Return random values I ∈ [n] and R ∈ R where
◮ Applications:
◮ Will use ℓ2 sampling to get optimal algorithm for Fk, k > 2. ◮ Will use ℓ0 sampling for processing graph streams. ◮ Many other stream problems can be solved via ℓp sampling, e.g.,
◮ Let’s see algorithm for p = 2. . .
21/25
22/25
◮ Weight fi by γi =
22/25
◮ Weight fi by γi =
◮ Return (i, fi) if g 2 i ≥ t := F2(f )/ǫ
22/25
◮ Weight fi by γi =
◮ Return (i, fi) if g 2 i ≥ t := F2(f )/ǫ ◮ Probability (i, fi) is returned:
i ≥ t
◮ Weight fi by γi =
◮ Return (i, fi) if g 2 i ≥ t := F2(f )/ǫ ◮ Probability (i, fi) is returned:
i ≥ t
i /t
i /t
22/25
◮ Weight fi by γi =
◮ Return (i, fi) if g 2 i ≥ t := F2(f )/ǫ ◮ Probability (i, fi) is returned:
i ≥ t
i /t
i /t ◮ Probability some value is returned is i f 2 i /t = ǫ so repeating
22/25
◮ Weight fi by γi =
◮ Return (i, fi) if g 2 i ≥ t := F2(f )/ǫ ◮ Probability (i, fi) is returned:
i ≥ t
i /t
i /t ◮ Probability some value is returned is i f 2 i /t = ǫ so repeating
◮ Lemma: Using a Count-Sketch of size O(ǫ−1 log2 n) ensures a
22/25
23/25
◮ Exercise: P [F2(g)/F2(f ) ≤ c log n] ≥ 99/100 for some large c > 0
23/25
◮ Exercise: P [F2(g)/F2(f ) ≤ c log n] ≥ 99/100 for some large c > 0
◮ Set w = 9cǫ−1 log n. Count-Sketch in O(w log2 n) space ensures
23/25
◮ Exercise: P [F2(g)/F2(f ) ≤ c log n] ≥ 99/100 for some large c > 0
◮ Set w = 9cǫ−1 log n. Count-Sketch in O(w log2 n) space ensures
◮ Then ˜
i ≥ F2(f )/ǫ implies
i /(9ǫ−1) = ǫ˜
i = (1 ± ǫ/3)2g 2 i = (1 ± ǫ)g 2 i as required.
23/25
◮ Exercise: P [F2(g)/F2(f ) ≤ c log n] ≥ 99/100 for some large c > 0
◮ Set w = 9cǫ−1 log n. Count-Sketch in O(w log2 n) space ensures
◮ Then ˜
i ≥ F2(f )/ǫ implies
i /(9ǫ−1) = ǫ˜
i = (1 ± ǫ/3)2g 2 i = (1 ± ǫ)g 2 i as required. ◮ Under-the-rug: Need to ensure that conditioning doesn’t affect
23/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k.
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
i
i
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
i
i
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
i
i
◮ Range: 0 ≤ T ≤ (1 + γ)F2F k−2 ∞
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
i
i
◮ Range: 0 ≤ T ≤ (1 + γ)F2F k−2 ∞
◮ Averaging over t = O(ǫ−2n1−2/k log δ−1) parallel repetitions gives,
24/25
◮ Earlier we used O(n1−1/k) space to approximate Fk = i |fi|k. ◮ Algorithm: Let (I, R) be an (1 + γ)-approximate ℓ2 sample. Return
◮ Expectation: Setting γ = ǫ/(4k),
i
i
◮ Range: 0 ≤ T ≤ (1 + γ)F2F k−2 ∞
◮ Averaging over t = O(ǫ−2n1−2/k log δ−1) parallel repetitions gives,
◮ Thm: In ˜
24/25
◮ Basic Sampling: Can do basic sampling where i is selected with
◮ Count-Min: fi ≤ ˜
◮ Count-Sketch: fi − ǫ√F2 ≤ ˜
◮ ℓp-Sampling: Selecting i with probability ∝ f p i
25/25