optimal sampling from distributed streams
play

Optimal Sampling from Distributed Streams Graham Cormode AT&T - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman ??; Vitter 85] Maintain a (uniform) sample (w/o


  1. Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1

  2. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away 2-1

  3. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive 2-2

  4. Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive Space: O ( s ) , time O (1) 2-3

  5. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] time 3-1

  6. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: W time Time based window and sequence based window 3-2

  7. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: W time Time based window and sequence based window Space: Θ( s log w ) w : number of items in the sliding window Time: Θ(log w ) 3-3

  8. Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 S 2 coordinator S 1 time sites 4-1

  9. Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 Applications: Internet routers S 2 Sensor networks coordinator S 1 Distributed computing time sites 4-2

  10. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) S k · · · C S 3 S 2 coordinator S 1 time sites 5-1

  11. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , reservoir sampling has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 coordinator S 1 time sites 5-2

  12. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , reservoir sampling has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 S 2 coordinator S 1 time sites 5-3

  13. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , reservoir sampling has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 Key observation: We don’t have to know the S 2 size of the population in order coordinator S 1 to sample! time sites 5-4

  14. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically 6-1

  15. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] 6-2

  16. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] All of them are deterministic algorithms, or use randomized sketches as black boxes 6-3

  17. Our results on random sampling window upper bounds lower bounds infinite O (( k + s ) log n ) Ω( k + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) 7-1

  18. Our results on random sampling window upper bounds lower bounds infinite O (( k + s ) log n ) Ω( k + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows 7-2

  19. Our results on random sampling window upper bounds lower bounds infinite O (( k + s ) log n ) Ω( k + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows ǫ -approximations in bounded VC dimensions: ˜ O ( k + 1 /ǫ 2 ) ǫ -nets: ˜ O ( k + 1 /ǫ ) . . . 7-3

  20. The basic idea: Binary Bernoulli sampling 8-1

  21. The basic idea: Binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 8-2

  22. The basic idea: Binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items 8-3

  23. The basic idea: Binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size be- tween s and O ( s ) 8-4

  24. Sampling from an infinite window Initialize i = 0 In round i : S k Sites send in every item w.p. 2 − i (This is a Bernoulli sample with prob. 2 − i ) · · · C S 3 S 2 S 1 9-1

  25. Sampling from an infinite window Initialize i = 0 In round i : S k Sites send in every item w.p. 2 − i (This is a Bernoulli sample with prob. 2 − i ) · · · Coordinator maintains a lower sample and a higher sample: each received item goes to either with C S 3 equal prob. S 2 (The lower sample is a Bernoulli sample with prob. 2 − i − 1 ) S 1 9-2

  26. Sampling from an infinite window Initialize i = 0 In round i : S k Sites send in every item w.p. 2 − i (This is a Bernoulli sample with prob. 2 − i ) · · · Coordinator maintains a lower sample and a higher sample: each received item goes to either with C S 3 equal prob. S 2 (The lower sample is a Bernoulli sample with prob. 2 − i − 1 ) When the lower sample reaches size s , the coordi- S 1 nator broadcasts to advance to round i ← i + 1 Discard the upper sample Split the lower sample into a new lower sample and a higher sample 9-3

  27. Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends 10-1

  28. Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends Broadcast to end round: O ( k ) 10-2

  29. Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends Broadcast to end round: O ( k ) Number of rounds: O (log( n/s )) In round i , need Θ( s ) items being sampled to end round Each item has prob. 2 − i to contribute: need Θ(2 i s ) items 10-3

  30. Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends Broadcast to end round: O ( k ) Number of rounds: O (log( n/s )) In round i , need Θ( s ) items being sampled to end round Each item has prob. 2 − i to contribute: need Θ(2 i s ) items Communication: O (( k + s ) log n ) Lower bound: Ω( k + s log n ) 10-4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend