optimal sampling from distributed streams
play

Optimal Sampling from Distributed Streams Qin Zhang Joint work with - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1 Reservoir sampling [Waterman ??; Vitter 85] Problem: Maintain a (uniform)


  1. Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1

  2. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 2-1

  3. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away 2-2

  4. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive 2-3

  5. Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive Cost: Space: O ( s ) , time O (1) 2-4

  6. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] time 3-1

  7. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window 3-2

  8. Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window Space: Θ( s log w ) w : number of items in the sliding window Time: Θ(log w ) 3-3

  9. Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 S 2 coordinator S 1 time sites 4-1

  10. Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 Applications: S 2 Internet routers coordinator Sensor networks S 1 Distributed computing time sites 4-2

  11. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) S k · · · C S 3 S 2 coordinator S 1 time sites 5-1

  12. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 coordinator S 1 time sites 5-2

  13. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 S 2 coordinator S 1 time sites 5-3

  14. Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 Key observation: We don’t have to know the S 2 exact size of the population coordinator S 1 in order to sample! time sites 5-4

  15. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically 6-1

  16. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] 6-2

  17. Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] All of them are deterministic algorithms, or use randomized sketches as black boxes. And the trackings are “approximate”. 6-3

  18. Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) 7-1

  19. Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows 7-2

  20. Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows ǫ -approximations in bounded VC dimensions: ˜ O ( k + 1 /ǫ 2 ) ǫ -nets: ˜ O ( k + 1 /ǫ ) . . . 7-3

  21. ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Rank: for each item coming, generate a random number in [0 , 1] as its rank. 8-1

  22. ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-2

  23. ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-3

  24. ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-4

  25. ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-5

  26. ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-6

  27. ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend