Optimal Sampling from Distributed Streams Graham Cormode AT&T - - PowerPoint PPT Presentation

optimal sampling from distributed streams
SMART_READER_LITE
LIVE PREVIEW

Optimal Sampling from Distributed Streams Graham Cormode AT&T - - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman ??; Vitter 85] Maintain a (uniform) sample (w/o


slide-1
SLIDE 1

1-1

Optimal Sampling from Distributed Streams

Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST)

slide-2
SLIDE 2

2-1

Reservoir sampling [Waterman ’??; Vitter ’85]

Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom Every subset of size s has equal probability to be the sample

slide-3
SLIDE 3

2-2

Reservoir sampling [Waterman ’??; Vitter ’85]

Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom Correctness: intuitive Every subset of size s has equal probability to be the sample

slide-4
SLIDE 4

2-3

Reservoir sampling [Waterman ’??; Vitter ’85]

Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom Correctness: intuitive Every subset of size s has equal probability to be the sample Space: O(s), time O(1)

slide-5
SLIDE 5

3-1

Sampling from a sliding window

time

[Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09]

slide-6
SLIDE 6

3-2

Sampling from a sliding window

time window length: W Time based window and sequence based window

[Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09]

slide-7
SLIDE 7

3-3

Sampling from a sliding window

Space: Θ(s log w) time window length: W Time based window and sequence based window

[Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09]

w: number of items in the sliding window Time: Θ(log w)

slide-8
SLIDE 8

4-1

Sampling from distributed streams

Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items · · · S1 S2 S3 Sk time C coordinator sites Primary goal: communication Secondary goal: space/time at coordinator/site

slide-9
SLIDE 9

4-2

Sampling from distributed streams

Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items · · · S1 S2 S3 Sk time C coordinator sites Primary goal: communication Secondary goal: space/time at coordinator/site Applications: Internet routers Sensor networks Distributed computing

slide-10
SLIDE 10

5-1

Why existing solutions don’t work

· · · S1 S2 S3 Sk time C coordinator sites When k = 1, reservoir sampling has communication Θ(s log n)

slide-11
SLIDE 11

5-2

Why existing solutions don’t work

· · · S1 S2 S3 Sk time C coordinator sites When k = 1, reservoir sampling has communication Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i

slide-12
SLIDE 12

5-3

Why existing solutions don’t work

· · · S1 S2 S3 Sk time C coordinator sites When k = 1, reservoir sampling has communication Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform

slide-13
SLIDE 13

5-4

Why existing solutions don’t work

· · · S1 S2 S3 Sk time C coordinator sites When k = 1, reservoir sampling has communication Θ(s log n) When k ≥ 2, reservoir sampling has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform Key observation: We don’t have to know the size of the population in order to sample!

slide-14
SLIDE 14

6-1

Previous results on distributed streaming

A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically

slide-15
SLIDE 15

6-2

Previous results on distributed streaming

A lot of heuristics in the database/networking literature Threshold monitoring, frequency moments [Cormode, Muthukrish-

nan, Yi, SODA’08]

Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan,

Lam, Lee, Ting, STACS’10]

But random sampling has not been studied, even heuristically

slide-16
SLIDE 16

6-3

Previous results on distributed streaming

A lot of heuristics in the database/networking literature Threshold monitoring, frequency moments [Cormode, Muthukrish-

nan, Yi, SODA’08]

Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan,

Lam, Lee, Ting, STACS’10]

All of them are deterministic algorithms, or use randomized sketches as black boxes But random sampling has not been studied, even heuristically

slide-17
SLIDE 17

7-1

Our results on random sampling

window upper bounds lower bounds infinite O((k + s) log n) Ω(k + s log n) sequence-based O(ks log(w/s)) Ω(ks log(w/ks)) time-based O((k + s) log w) Ω(k + s log w)

(per window)

slide-18
SLIDE 18

7-2

Our results on random sampling

window upper bounds lower bounds infinite O((k + s) log n) Ω(k + s log n) sequence-based O(ks log(w/s)) Ω(ks log(w/ks)) time-based O((k + s) log w) Ω(k + s log w) Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows

(per window)

slide-19
SLIDE 19

7-3

Our results on random sampling

window upper bounds lower bounds infinite O((k + s) log n) Ω(k + s log n) sequence-based O(ks log(w/s)) Ω(ks log(w/ks)) time-based O((k + s) log w) Ω(k + s log w) Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ) . . .

(per window)

slide-20
SLIDE 20

8-1

The basic idea: Binary Bernoulli sampling

slide-21
SLIDE 21

8-2

The basic idea: Binary Bernoulli sampling

1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-22
SLIDE 22

8-3

The basic idea: Binary Bernoulli sampling

1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items

slide-23
SLIDE 23

8-4

The basic idea: Binary Bernoulli sampling

1 1 1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size be- tween s and O(s)

slide-24
SLIDE 24

9-1

Sampling from an infinite window

· · · S1 S2 S3 Sk C Initialize i = 0 In round i: Sites send in every item w.p. 2−i

(This is a Bernoulli sample with prob. 2−i)

slide-25
SLIDE 25

9-2

Sampling from an infinite window

· · · S1 S2 S3 Sk C Initialize i = 0 In round i: Sites send in every item w.p. 2−i

(This is a Bernoulli sample with prob. 2−i)

Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob.

(The lower sample is a Bernoulli sample with prob. 2−i−1)

slide-26
SLIDE 26

9-3

Sampling from an infinite window

· · · S1 S2 S3 Sk C Initialize i = 0 In round i: Sites send in every item w.p. 2−i

(This is a Bernoulli sample with prob. 2−i)

Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob.

(The lower sample is a Bernoulli sample with prob. 2−i−1)

When the lower sample reaches size s, the coordi- nator broadcasts to advance to round i ← i + 1 Discard the upper sample Split the lower sample into a new lower sample and a higher sample

slide-27
SLIDE 27

10-1

Sampling from an infinite window: Analysis

Communication cost of round i: O(k + s) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O(s) sampled items before round ends

slide-28
SLIDE 28

10-2

Sampling from an infinite window: Analysis

Communication cost of round i: O(k + s) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k)

slide-29
SLIDE 29

10-3

Sampling from an infinite window: Analysis

Communication cost of round i: O(k + s) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k) Number of rounds: O(log(n/s)) In round i, need Θ(s) items being sampled to end round Each item has prob. 2−i to contribute: need Θ(2is) items

slide-30
SLIDE 30

10-4

Sampling from an infinite window: Analysis

Communication cost of round i: O(k + s) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k) Number of rounds: O(log(n/s)) In round i, need Θ(s) items being sampled to end round Each item has prob. 2−i to contribute: need Θ(2is) items Communication: O((k + s) log n) Lower bound: Ω(k + s log n)

slide-31
SLIDE 31

10-5

Sampling from an infinite window: Analysis

Communication cost of round i: O(k + s) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k) Number of rounds: O(log(n/s)) In round i, need Θ(s) items being sampled to end round Each item has prob. 2−i to contribute: need Θ(2is) items Communication: O((k + s) log n) Lower bound: Ω(k + s log n) Site space: O(1), time: O(1) Coordinator space: O(s), total time: O((k + s) log n)

slide-32
SLIDE 32

11-1

Sampling from a sliding window: Idea

time sliding window expired windows frozen window current window Sample for sliding window = a subsample of the (unexpired) sample of frozen window + a subsample of the sample of current window

slide-33
SLIDE 33

11-2

Sampling from a sliding window: Idea

time sliding window expired windows frozen window current window Sample for sliding window = a subsample of the (unexpired) sample of frozen window + a subsample of the sample of current window Key: As long as either Bernoulli sample has size ≥ s, we can subsample the sample with the larger probability to match up their probabilities

slide-34
SLIDE 34

11-3

Sampling from a sliding window: Idea

time sliding window expired windows frozen window current window Sample for sliding window = a subsample of the (unexpired) sample of frozen window + a subsample of the sample of current window Key: As long as either Bernoulli sample has size ≥ s, we can subsample the sample with the larger probability to match up their probabilities Current window: Run our infinite-window algorithm A Bernoulli sample with prob. 2−i such that size ≥ s

slide-35
SLIDE 35

11-4

Sampling from a sliding window: Idea

time sliding window expired windows frozen window current window Sample for sliding window = a subsample of the (unexpired) sample of frozen window + a subsample of the sample of current window Key: As long as either Bernoulli sample has size ≥ s, we can subsample the sample with the larger probability to match up their probabilities Current window: Run our infinite-window algorithm A Bernoulli sample with prob. 2−i such that size ≥ s Frozen window: Need to have the same

slide-36
SLIDE 36

12-1

Dealing with the frozen window

time sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication

slide-37
SLIDE 37

12-2

Dealing with the frozen window

time sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication s = 2 Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)

slide-38
SLIDE 38

12-3

Dealing with the frozen window

time sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication s = 2 Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w) Guaranteed: There is a blue window with ≥ s sampled items that covers the unexpired portion of the frozen window

slide-39
SLIDE 39

13-1

Dealing with the frozen window: The algorithm

s = 2 Each site builds its own level-sampling structure for the current window until it freezes The level-sampling structure Needs O(s log w) space and O(1) time per item Necessary unless communication is Ω(w)

slide-40
SLIDE 40

13-2

Dealing with the frozen window: The algorithm

s = 2 Each site builds its own level-sampling structure for the current window until it freezes The level-sampling structure Needs O(s log w) space and O(1) time per item When the current window freezes For each level, do a k-way merge to build the level of the global structure at the coordinator Total communication O((k + s) log w) Necessary unless communication is Ω(w)

slide-41
SLIDE 41

14-1

Future directions

Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ) . . .

slide-42
SLIDE 42

14-2

Future directions

Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ) . . . Is random sampling the best way to solve these problems?

slide-43
SLIDE 43

14-3

Future directions

Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ) . . . Is random sampling the best way to solve these problems? . . . New result: Heavy hitters and quantiles can be tracked in ˜ O(k + √ k/ǫ), using a different sampling method

slide-44
SLIDE 44

15-1

The End

T HANK YOU

Q and A