Optimal Sampling from Distributed Streams Qin Zhang Joint work with - - PowerPoint PPT Presentation

optimal sampling from distributed streams
SMART_READER_LITE
LIVE PREVIEW

Optimal Sampling from Distributed Streams Qin Zhang Joint work with - - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1 Reservoir sampling [Waterman ??; Vitter 85] Problem: Maintain a (uniform)


slide-1
SLIDE 1

1-1

Optimal Sampling from Distributed Streams

Qin Zhang Joint work with Graham Cormode (AT&T)

  • S. Muthukrishnan (Rutgers)

Ke Yi (HKUST)

  • Sept. 17, 2010

MSRA

slide-2
SLIDE 2

2-1

Reservoir sampling [Waterman ’??; Vitter ’85]

Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample

slide-3
SLIDE 3

2-2

Reservoir sampling [Waterman ’??; Vitter ’85]

Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom

slide-4
SLIDE 4

2-3

Reservoir sampling [Waterman ’??; Vitter ’85]

Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Correctness: intuitive Every subset of size s has equal probability to be the sample Solution: When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom

slide-5
SLIDE 5

2-4

Reservoir sampling [Waterman ’??; Vitter ’85]

Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Correctness: intuitive Every subset of size s has equal probability to be the sample Cost: Space: O(s), time O(1) Solution: When the i-th item arrives With probability 1 − s/i, throw it away With probability s/i, use it to replace an item in the current sample chosen uniformly at ranfom

slide-6
SLIDE 6

3-1

Sampling from a sliding window

time

[Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09]

slide-7
SLIDE 7

3-2

Sampling from a sliding window

time

window length: w

Time based window and sequence based window

[Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09]

slide-8
SLIDE 8

3-3

Sampling from a sliding window

Space: Θ(s log w) time

window length: w

Time based window and sequence based window

[Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09]

w: number of items in the sliding window Time: Θ(log w)

slide-9
SLIDE 9

4-1

Sampling from distributed streams

Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items · · · S1 S2 S3 Sk time C coordinator sites Primary goal: communication Secondary goal: space/time at coordinator/site

slide-10
SLIDE 10

4-2

Sampling from distributed streams

Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items · · · S1 S2 S3 Sk time C coordinator sites Primary goal: communication Secondary goal: space/time at coordinator/site Applications: Internet routers Sensor networks Distributed computing

slide-11
SLIDE 11

5-1

Why existing solutions don’t work

· · · S1 S2 S3 Sk C coordinator When k = 1, reservoir sampling has communication Θ(s log n) time sites

slide-12
SLIDE 12

5-2

Why existing solutions don’t work

· · · S1 S2 S3 Sk C coordinator When k = 1, reservoir sampling has communication Θ(s log n) When k ≥ 2, it has cost O(n) because it’s costly to track i time sites

slide-13
SLIDE 13

5-3

Why existing solutions don’t work

· · · S1 S2 S3 Sk C coordinator When k = 1, reservoir sampling has communication Θ(s log n) When k ≥ 2, it has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform time sites

slide-14
SLIDE 14

5-4

Why existing solutions don’t work

· · · S1 S2 S3 Sk C coordinator When k = 1, reservoir sampling has communication Θ(s log n) When k ≥ 2, it has cost O(n) because it’s costly to track i Tracking i approximately? Sampling won’t be uniform Key observation: We don’t have to know the exact size of the population in order to sample! time sites

slide-15
SLIDE 15

6-1

Previous results on distributed streaming

A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically

slide-16
SLIDE 16

6-2

Previous results on distributed streaming

A lot of heuristics in the database/networking literature Threshold monitoring, frequency moments [Cormode, Muthukrish-

nan, Yi, SODA’08]

Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan,

Lam, Lee, Ting, STACS’10]

But random sampling has not been studied, even heuristically

slide-17
SLIDE 17

6-3

Previous results on distributed streaming

A lot of heuristics in the database/networking literature Threshold monitoring, frequency moments [Cormode, Muthukrish-

nan, Yi, SODA’08]

Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan,

Lam, Lee, Ting, STACS’10]

All of them are deterministic algorithms, or use randomized sketches as black boxes. And the trackings are “approximate”. But random sampling has not been studied, even heuristically

slide-18
SLIDE 18

7-1

Our results on random sampling

window upper bounds lower bounds infinite O(k logk/s n + s log n) Ω(k logk/s n + s log n) sequence-based O(ks log(w/s)) Ω(ks log(w/ks)) time-based O((k + s) log w) Ω(k + s log w)

(per window)

slide-19
SLIDE 19

7-2

Our results on random sampling

window upper bounds lower bounds infinite O(k logk/s n + s log n) Ω(k logk/s n + s log n) sequence-based O(ks log(w/s)) Ω(ks log(w/ks)) time-based O((k + s) log w) Ω(k + s log w) Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows

(per window)

slide-20
SLIDE 20

7-3

Our results on random sampling

window upper bounds lower bounds infinite O(k logk/s n + s log n) Ω(k logk/s n + s log n) sequence-based O(ks log(w/s)) Ω(ks log(w/ks)) time-based O((k + s) log w) Ω(k + s log w) Applications Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ) . . .

(per window)

slide-21
SLIDE 21

8-1

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Rank: for each item coming, generate a random number in [0, 1] as its rank.

slide-22
SLIDE 22

8-2

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u].

slide-23
SLIDE 23

8-3

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u]. s = 4 l = 0 u = 1 m = (l + u)/2

slide-24
SLIDE 24

8-4

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u]. s = 4 l = 0 u = 1 m = (l + u)/2

slide-25
SLIDE 25

8-5

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u]. s = 4 l u m

slide-26
SLIDE 26

8-6

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u]. s = 4 l u m

slide-27
SLIDE 27

8-7

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u]. s = 4 l u m

slide-28
SLIDE 28

8-8

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u].

Like Binary Search :)

s = 4 l u m

slide-29
SLIDE 29

8-9

ISWoR

The protocol

Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and

  • nly sends those items with rank in the range [l, u].

Coordinator: let m = (l + u)/2, waits until

  • # items receiced in the range [l, m] becomes ≥ s,

updates each site with u = m.

  • # items receiced in the range [m, u] becomes ≥ s,

updates each site with l = m. Report: subsamples s items from all items in [l, u].

Communication cost: O((k + s) log n)

s = 4 l u m

slide-30
SLIDE 30

9-1

The basic idea: Binary Bernoulli sampling

slide-31
SLIDE 31

9-2

The basic idea: Binary Bernoulli sampling

1 1 1 1 1 1 1 1 1 1 1 1 1

slide-32
SLIDE 32

9-3

The basic idea: Binary Bernoulli sampling

1 1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items

slide-33
SLIDE 33

9-4

The basic idea: Binary Bernoulli sampling

1 1 1 1 1 1 1 1 1 1 1 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size be- tween s and O(s)

slide-34
SLIDE 34

10-1

Sampling from a sliding window: Idea

sliding window expired windows frozen window current window t

slide-35
SLIDE 35

10-2

Sampling from a sliding window: Idea

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR t

slide-36
SLIDE 36

10-3

Sampling from a sliding window: Idea

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine. t

slide-37
SLIDE 37

10-4

Sampling from a sliding window: Idea

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine.

The key issue: how to guarantee “both have sizes ≥ s”?

as items in the frozen window are expiring ... t

slide-38
SLIDE 38

10-5

Sampling from a sliding window: Idea

sliding window expired windows frozen window current window Sample for sliding window = (1) a subsample of the (unexpired) sample of frozen window + (2) a subsample of the sample of current window need new ideas by ISWoR (1), (2) may be sampled by different rates. But as long as both have sizes ≥ min{s, # live items}, fine.

The key issue: how to guarantee “both have sizes ≥ s”?

as items in the frozen window are expiring ... Solution: In the frozen window, find a good sample rate such that the sample size ≥ s. t

slide-39
SLIDE 39

11-1

Dealing with the frozen window

sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication t

slide-40
SLIDE 40

11-2

Dealing with the frozen window

sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication

Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)

t (s = 2)

slide-41
SLIDE 41

11-3

Dealing with the frozen window

sliding window expired windows frozen window current window Keep all the levels? Need O(w) communication

Keep most recent sampled items in a level until s of them are also sampled at the next level. Total size: O(s log w)

Guaranteed: There is a blue window with ≥ s sampled items that covers the unexpired portion of the frozen window t (s = 2)

slide-42
SLIDE 42

12-1

Dealing with the frozen window: The algorithm

Each site builds its own level-sampling structure for the current window until it freezes

(s = 2) Needs O(s log w) space and O(1) time per item

slide-43
SLIDE 43

12-2

Dealing with the frozen window: The algorithm

Each site builds its own level-sampling structure for the current window until it freezes

(s = 2) Needs O(s log w) space and O(1) time per item

When the current window freezes

For each level, do a k-way merge to build the level of the global structure at the coordinator. Total communication O((k + s) log w)

slide-44
SLIDE 44

13-1

Other results

Similar results hold for sampling with replacement (WR)

There is a simple reduction from sampling WR to sampling WoR, but need to know n.

slide-45
SLIDE 45

13-2

Other results

Similar results hold for sampling with replacement (WR)

There is a simple reduction from sampling WR to sampling WoR, but need to know n. Need some new ideas

slide-46
SLIDE 46

13-3

Other results

Similar results hold for sampling with replacement (WR)

There is a simple reduction from sampling WR to sampling WoR, but need to know n. Need some new ideas

Processing time per item is another complicated issue for WR. Finally we can get O(1) (but complicated). Experiments show that our algorithms work well.

slide-47
SLIDE 47

14-1

Future directions

Direct applications

Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ)

. . .

slide-48
SLIDE 48

14-2

Future directions

Direct applications

Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ)

Is random sampling the best way to solve these problems? . . .

slide-49
SLIDE 49

14-3

Future directions

Direct applications

Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ)

Is random sampling the best way to solve these problems?

New result: Heavy hitters and quantiles can be tracked in ˜ O(k + √ k/ǫ), using a different sampling method

. . .

slide-50
SLIDE 50

14-4

Future directions

Direct applications

Heavy hitters and quantiles can be tracked in ˜ O(k + 1/ǫ2) Beats deterministic bound ˜ Θ(k/ǫ) for k ≫ 1/ǫ Also for sliding windows ǫ-approximations in bounded VC dimensions: ˜ O(k + 1/ǫ2) ǫ-nets: ˜ O(k + 1/ǫ)

Is random sampling the best way to solve these problems?

New result: Heavy hitters and quantiles can be tracked in ˜ O(k + √ k/ǫ), using a different sampling method

Other problems: range-counting, extent measures, etc. . . .

slide-51
SLIDE 51

15-1

Multiparty private message CC

Before, multiparty communication complexities are mainly used for other applications.

Number on the forehead Public message One-way communication

slide-52
SLIDE 52

15-2

Multiparty private message CC

Before, multiparty communication complexities are mainly used for other applications.

Number on the forehead Public message

But surprisingly, the most general, natural setting – “private mes- sage model” – has not been studied!

Possible reason: before “distributed streaming model”, no direct application. One-way communication

slide-53
SLIDE 53

15-3

Multiparty private message CC

Before, multiparty communication complexities are mainly used for other applications.

Number on the forehead Public message

But surprisingly, the most general, natural setting – “private mes- sage model” – has not been studied!

Possible reason: before “distributed streaming model”, no direct application.

Now, it is the time!

One-way communication

slide-54
SLIDE 54

16-1

The End

T HANK YOU

Q and A