1-1
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang - - PowerPoint PPT Presentation
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang - - PowerPoint PPT Presentation
Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling in distributed data streams (based on a paper with Cormode, Muthukrishnan and Yi, PODS10, JACM12) 2-1 Distributed streaming Motivated by
2-1
Topic 3: Random sampling in distributed data streams
(based on a paper with Cormode, Muthukrishnan and Yi, PODS’10, JACM’12)
3-1
Distributed streaming
Adaptive filters [Olston, Jiang, Widom, SIGMOD’03] A generic geometric approach [Scharfman et al. SIGMOD’06] Prediction models [Cormode, Garofalakis, Muthukrishnan,
Rastogi, SIGMOD’05]
Motivated by database/networking applications
environment monitoring network monitoring sensor networks cloud computing
4-1
Reservoir sampling [Waterman ’??; Vitter ’85]
Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample
4-2
Reservoir sampling [Waterman ’??; Vitter ’85]
Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom
5-1
Reservoir sampling from distributed streams
· · · S1 S2 S3 Sk time C When k = 1, reservoir sampling has cost Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i
5-2
Reservoir sampling from distributed streams
· · · S1 S2 S3 Sk time C When k = 1, reservoir sampling has cost Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform
5-3
Reservoir sampling from distributed streams
· · · S1 S2 S3 Sk time C When k = 1, reservoir sampling has cost Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform Key observation: We don’t have to know the size of the population in order to sample!
6-1
Basic idea: binary Bernoulli sampling
6-2
1 1 1 1 1 1 1 1 1 1 1 1
Basic idea: binary Bernoulli sampling
6-3
1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items
Basic idea: binary Bernoulli sampling
6-4
1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size between s and O(s)
Basic idea: binary Bernoulli sampling
17-1
Random sampling – Algorithm
Initialize i = 0 In epoch i:
Sites send in every item w.pr. 2−i
· · ·
S1 S2 S3 Sk
C
coordinator sites [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]
17-2
Random sampling – Algorithm
Initialize i = 0 In epoch i:
Sites send in every item w.pr. 2−i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.
· · ·
S1 S2 S3 Sk
C
coordinator sites upper lower (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]
17-3
Random sampling – Algorithm
Initialize i = 0 In epoch i:
Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.
· · ·
S1 S2 S3 Sk
C
coordinator sites upper lower (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]
17-4
Random sampling – Algorithm
Initialize i = 0 In epoch i:
Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.
· · ·
S1 S2 S3 Sk
C
coordinator sites upper lower
Correctness:
(1): In epoch i, each item is maintained in C w. pr. 2−i (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]
17-5
Random sampling – Algorithm
Initialize i = 0 In epoch i:
Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.
· · ·
S1 S2 S3 Sk
C
coordinator sites upper lower
Correctness:
(2): Always ≥ s items are maintained in C (1): In epoch i, each item is maintained in C w. pr. 2−i (Each item is included in lower sample w.pr. 2−(i+1)) [with Cormode, Muthu & Yi , PODS ’10 JACM ’11]
18-1
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
Maintain s = 3 samples
18-2
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples
18-3
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1
18-4
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2
18-5
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2 2
18-6
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2 2 3 3 4 4
18-7
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2 2 3 3 4 4 5
18-8
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2 2 3 3 4 4 5 5
18-9
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2 2 3 3 4 4 5 5
Now |lower sample| = 3
- discard upper sample
- split lower sample
- advance to Epoch 1
18-10
A running example
S1 S2 S3 S4 C
coordinator sites Epoch 0 (p = 1)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5
Now |lower sample| = 3
- discard upper sample
- split lower sample
- advance to Epoch 1
4
19-1
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4
19-2
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard)
19-3
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7
19-4
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8
19-5
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8 9 (discard)
19-6
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8 9 (discard) 10
19-7
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8 9 (discard) 10 10
19-8
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
1 Maintain s = 3 samples 1 2 3 4 5 5 4 6 (discard) 7 7 8 8
Again |lower sample| = 3
- discard upper sample
- split lower sample
- advance to Epoch 2
9 (discard) 10 10
20-1
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites Epoch 1 (p = 1/2)
upper lower
Maintain s = 3 samples 1 5 10 1 2 3 4 5 6 (discard) 7 8 9 (discard) 10
Again |lower sample| = 3
- discard upper sample
- split lower sample
- advance to Epoch 2
20-2
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites
upper lower
Maintain s = 3 samples 1 5 10 1 2 3 4 5 6 (discard) 7 8 9 (discard) 10 Epoch 2 (p = 1/4)
More items will be discarded locally
20-3
A running example (cont.)
S1 S2 S3 S4 C
coordinator sites
upper lower
Maintain s = 3 samples 1 5 10 1 2 3 4 5 6 (discard) 7 8 9 (discard) 10 Epoch 2 (p = 1/4)
More items will be discarded locally
Intuition: maintain a sample prob. at each site p ≈ s/n (n: total # items) without knowing n.
21-1
Random sampling – Analysis
Analysis: (Msgs sent)
Messages sent per epoch O(k + s) × # epochs O(log n) = O((k + s) log n)
Initialize i = 0 In epoch i:
Sites send in every item w.pr. 2−i When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i ← i + 1 Discards the upper sample Splits the lower sample into a new lower sample and an upper sample Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob.
· · ·
S1 S2 S3 Sk
C
coordinator sites upper lower (Each item is included in lower sample w.pr. 2−(i+1))
22-1
Random sampling – Analysis and experiments
- 1. improved to Θ(k logk/s n + s log n) and
- 2. extended to sliding window cases.
Can be
22-2
Random sampling – Analysis and experiments
0 1 2 3 4 5 6 7 8 9 1011 9 10 11 12 13 14 log2(cost) log2 s 15
practice theory Experiments on the real data set from 1998 world cup logs.
Basic case cost VS sample size
n = 7000000, k = 128 0 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 log2(cost) log2 s
Time-based sliding cost VS sample size
n = 320000, k = 128 cost ×103 1 2 3 4 5 6 log2
- w
104
- 7
8 9 10 6 11
Time-based sliding cost VS window size
s = 128, k = 128
- 1. improved to Θ(k logk/s n + s log n) and
- 2. extended to sliding window cases.
Can be
22-3
Random sampling – Analysis and experiments
0 1 2 3 4 5 6 7 8 9 1011 9 10 11 12 13 14 log2(cost) log2 s 15
practice theory Experiments on the real data set from 1998 world cup logs.
Basic case cost VS sample size
n = 7000000, k = 128
- total # items n = 7, 000, 000
- # items sent ≈ 4, 000
- size of sample s = 128
- # sites k = 128
- 1. improved to Θ(k logk/s n + s log n) and
- 2. extended to sliding window cases.
Can be
10-1
Sampling from a (time-based) sliding window
sliding window expired windows frozen window current window t
10-2
Sampling from a (time-based) sliding window
sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR t
10-3
Sampling from a (time-based) sliding window
sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine. t
10-4
Sampling from a (time-based) sliding window
sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine.
The key issue: how to guarantee “both have sizes ≥ s”?
as items in the frozen window are expiring ... t
10-5
Sampling from a (time-based) sliding window
sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine.
The key issue: how to guarantee “both have sizes ≥ s”?
as items in the frozen window are expiring ... Solution: In the frozen window, find a good sample rate such that the sample size ≥ s. t
11-1
Dealing with the frozen window
sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication t
11-2
Dealing with the frozen window
sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication
Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)
t (s = 2)
11-3
Dealing with the frozen window
sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication
Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)
Guaranteed: There is a blue window with ≥ s sampled items that covers the unexpired portion of the frozen window t (s = 2)
12-1
Dealing with the frozen window: The algorithm
Each site builds its own level-sampling structure for the current window until it freezes
(s = 2) Needs O(s log w) space and O(1) time per item
12-2
Dealing with the frozen window: The algorithm
Each site builds its own level-sampling structure for the current window until it freezes
(s = 2) Needs O(s log w) space and O(1) time per item
When the current window freezes
For each level, do a k-way merge to build the level of the global structure at the coordinator. Total communication O((k + s) log w)