Optimal Sampling from Distributed Streams Graham Cormode AT&T - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1

Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away 2-1

Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive 2-2

Reservoir sampling [Waterman ’??; Vitter ’85] Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive Space: O ( s ) , time O (1) 2-3

Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] time 3-1

Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: W time Time based window and sequence based window 3-2

Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: W time Time based window and sequence based window Space: Θ( s log w ) w : number of items in the sliding window Time: Θ(log w ) 3-3

Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 S 2 coordinator S 1 time sites 4-1

Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 Applications: Internet routers S 2 Sensor networks coordinator S 1 Distributed computing time sites 4-2

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) S k · · · C S 3 S 2 coordinator S 1 time sites 5-1

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , reservoir sampling has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 coordinator S 1 time sites 5-2

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , reservoir sampling has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 S 2 coordinator S 1 time sites 5-3

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , reservoir sampling has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 Key observation: We don’t have to know the S 2 size of the population in order coordinator S 1 to sample! time sites 5-4

Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically 6-1

Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] 6-2

Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] All of them are deterministic algorithms, or use randomized sketches as black boxes 6-3

Our results on random sampling window upper bounds lower bounds infinite O (( k + s ) log n ) Ω( k + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) 7-1

Our results on random sampling window upper bounds lower bounds infinite O (( k + s ) log n ) Ω( k + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows 7-2

Our results on random sampling window upper bounds lower bounds infinite O (( k + s ) log n ) Ω( k + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows ǫ -approximations in bounded VC dimensions: ˜ O ( k + 1 /ǫ 2 ) ǫ -nets: ˜ O ( k + 1 /ǫ ) . . . 7-3

The basic idea: Binary Bernoulli sampling 8-1

The basic idea: Binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 8-2

The basic idea: Binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items 8-3

The basic idea: Binary Bernoulli sampling 0 1 0 1 1 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 1 Conditioned upon a row having ≥ s active items, we can draw a sample from the active items The coordinator could maintain a Bernoulli sample of size be- tween s and O ( s ) 8-4

Sampling from an infinite window Initialize i = 0 In round i : S k Sites send in every item w.p. 2 − i (This is a Bernoulli sample with prob. 2 − i ) · · · C S 3 S 2 S 1 9-1

Sampling from an infinite window Initialize i = 0 In round i : S k Sites send in every item w.p. 2 − i (This is a Bernoulli sample with prob. 2 − i ) · · · Coordinator maintains a lower sample and a higher sample: each received item goes to either with C S 3 equal prob. S 2 (The lower sample is a Bernoulli sample with prob. 2 − i − 1 ) S 1 9-2

Sampling from an infinite window Initialize i = 0 In round i : S k Sites send in every item w.p. 2 − i (This is a Bernoulli sample with prob. 2 − i ) · · · Coordinator maintains a lower sample and a higher sample: each received item goes to either with C S 3 equal prob. S 2 (The lower sample is a Bernoulli sample with prob. 2 − i − 1 ) When the lower sample reaches size s , the coordi- S 1 nator broadcasts to advance to round i ← i + 1 Discard the upper sample Split the lower sample into a new lower sample and a higher sample 9-3

Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends 10-1

Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends Broadcast to end round: O ( k ) 10-2

Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends Broadcast to end round: O ( k ) Number of rounds: O (log( n/s )) In round i , need Θ( s ) items being sampled to end round Each item has prob. 2 − i to contribute: need Θ(2 i s ) items 10-3

Sampling from an infinite window: Analysis Communication cost of round i : O ( k + s ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. Expect to receive O ( s ) sampled items before round ends Broadcast to end round: O ( k ) Number of rounds: O (log( n/s )) In round i , need Θ( s ) items being sampled to end round Each item has prob. 2 − i to contribute: need Θ(2 i s ) items Communication: O (( k + s ) log n ) Lower bound: Ω( k + s log n ) 10-4

Optimal Sampling from Distributed Streams Graham Cormode AT&T - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman ??; Vitter 85] Maintain a (uniform) sample (w/o

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T)

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due

Triangle counting in dynamic graph streams Konstantin Kutzkov and Rasmus Pagh Work supported by:

Transports and TCP Transports and TCP Adolfo Rodriguez CPS 214 Host- -to to- -Host vs. Host

CS 457 Lecture 5 Reliable Delivery Part 2 Fall 2011 Stop and Wait in Action Stop and Wait

Transport Layer (TCP/UDP) Where we are in the Course Moving on up to the Transport Layer!

Asynchronous Multi-Party Computation Vassilis Zikas RPI MPC School IIT Mumbai Secure

Supplemental Slides Third Quarter 2019 Earnings October 30, 2019 Forward-Looking Statements

Congestion Avoidance and Control Van Jacobson, Michael J. Karels Presented by Naveen Cherukuri

The TO Window and Me: Confessions of a (sometimes) Restorative Leader Thomas S. Fertal

British Shipping British Shipping Thank-you for attending todays talk. The objectives are:

Optimal Sampling from Distributed Streams Graham Cormode AT&T - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman ??; Vitter 85] Maintain a (uniform) sample (w/o

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&amp;T)

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due

Triangle counting in dynamic graph streams Konstantin Kutzkov and Rasmus Pagh Work supported by:

Transports and TCP Transports and TCP Adolfo Rodriguez CPS 214 Host- -to to- -Host vs. Host

CS 457 Lecture 5 Reliable Delivery Part 2 Fall 2011 Stop and Wait in Action Stop and Wait

Transport Layer (TCP/UDP) Where we are in the Course Moving on up to the Transport Layer!

Asynchronous Multi-Party Computation Vassilis Zikas RPI MPC School IIT Mumbai Secure

Supplemental Slides Third Quarter 2019 Earnings October 30, 2019 Forward-Looking Statements

Congestion Avoidance and Control Van Jacobson, Michael J. Karels Presented by Naveen Cherukuri

The TO Window and Me: Confessions of a (sometimes) Restorative Leader Thomas S. Fertal

British Shipping British Shipping Thank-you for attending todays talk. The objectives are:

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T)

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling