Optimal Sampling from Distributed Streams Qin Zhang Joint work with - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1

Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample 2-1

Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away 2-2

Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive 2-3

Reservoir sampling [Waterman ’??; Vitter ’85] Problem: Maintain a (uniform) sample (w/o replacement) of size s from a stream of n items Every subset of size s has equal probability to be the sample Solution: When the i -th item arrives With probability s/i , use it to replace an item in the current sample chosen uniformly at ranfom With probability 1 − s/i , throw it away Correctness: intuitive Cost: Space: O ( s ) , time O (1) 2-4

Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] time 3-1

Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window 3-2

Sampling from a sliding window [Babcock, Datar, Motwani, SODA’02; Gemulla, Lehner, SIGMOD’08; Braverman, Ostrovsky, Zaniolo, PODS’09] window length: w time Time based window and sequence based window Space: Θ( s log w ) w : number of items in the sliding window Time: Θ(log w ) 3-3

Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 S 2 coordinator S 1 time sites 4-1

Sampling from distributed streams Maintain a (uniform) sample (w/o replacement) of size s from k streams of a total of n items Primary goal: S k communication Secondary goal: · · · space/time at coordinator/site C S 3 Applications: S 2 Internet routers coordinator Sensor networks S 1 Distributed computing time sites 4-2

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) S k · · · C S 3 S 2 coordinator S 1 time sites 5-1

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k · · · C S 3 S 2 coordinator S 1 time sites 5-2

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 S 2 coordinator S 1 time sites 5-3

Why existing solutions don’t work When k = 1 , reservoir sampling has communication Θ( s log n ) When k ≥ 2 , it has cost O ( n ) because it’s costly to track i S k Tracking i approximately? · · · Sampling won’t be uniform C S 3 Key observation: We don’t have to know the S 2 exact size of the population coordinator S 1 in order to sample! time sites 5-4

Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically 6-1

Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] 6-2

Previous results on distributed streaming A lot of heuristics in the database/networking literature But random sampling has not been studied, even heuristically Threshold monitoring, frequency moments [Cormode, Muthukrish- nan, Yi, SODA’08] Entropy [Arackaparambil, Brody, Chakrabarti, ICALP’08] Heavy hitters and quantiles [Yi, Zhang, PODS’09] Basic counting, heavy hitters, quantiles in sliding windows [Chan, Lam, Lee, Ting, STACS’10] All of them are deterministic algorithms, or use randomized sketches as black boxes. And the trackings are “approximate”. 6-3

Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) 7-1

Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows 7-2

Our results on random sampling window upper bounds lower bounds infinite O ( k log k/s n + s log n ) Ω( k log k/s n + s log n ) sequence-based O ( ks log( w/s )) Ω( ks log( w/ks )) time-based O (( k + s ) log w ) Ω( k + s log w ) (per window) Applications Heavy hitters and quantiles can be tracked in ˜ O ( k + 1 /ǫ 2 ) Beats deterministic bound ˜ Θ( k/ǫ ) for k ≫ 1 /ǫ Also for sliding windows ǫ -approximations in bounded VC dimensions: ˜ O ( k + 1 /ǫ 2 ) ǫ -nets: ˜ O ( k + 1 /ǫ ) . . . 7-3

ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Rank: for each item coming, generate a random number in [0 , 1] as its rank. 8-1

ISWoR The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-2

ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-3

ISWoR s = 4 m = ( l + u ) / 2 u = 1 l = 0 The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-4

ISWoR s = 4 m u l The protocol Site: always maintains an upper bound u (initialized to be 1) and lower bound l (initialized to be 0), and only sends those items with rank in the range [ l, u ] . Coordinator: let m = ( l + u ) / 2 , waits until • # items receiced in the range [ l, m ] becomes ≥ s , updates each site with u = m . • # items receiced in the range [ m, u ] becomes ≥ s , updates each site with l = m . Report: subsamples s items from all items in [ l, u ] . 8-5

Optimal Sampling from Distributed Streams Qin Zhang Joint work with - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1 Reservoir sampling [Waterman ??; Vitter 85] Problem: Maintain a (uniform)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due

Triangle counting in dynamic graph streams Konstantin Kutzkov and Rasmus Pagh Work supported by:

Rare/other processes working group summary Motto: 100% uncertainties sound conservative,

A Measurement-based Model for Parallel Real-Time Tasks KUNAL AGRAWAL & SANJOY

Lessons from privacy measurement Arvind Narayanan Princeton University @random_walker Caveat:

Best Practices for Working with Undocumented and DACA-mented Students Ellen Badger , Retired,

HRMS Student Positions Fall 2009 Overview/Class Objectives Part 1: Presentation -

In 1800, 95% of the world lived in extreme poverty. Thomas Malthus, 1798 800 m. Data source: US

More Relatively-Poor People in a Less Absolutely- Poor World? Martin Ravallion Director of the

www.nzadds.org.nz Definitions: Economic development: economic change in countries and regions

Optimal Sampling from Distributed Streams Qin Zhang Joint work with - PowerPoint PPT Presentation

Optimal Sampling from Distributed Streams Qin Zhang Joint work with Graham Cormode (AT&T) S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Sept. 17, 2010 MSRA 1-1 Reservoir sampling [Waterman ??; Vitter 85] Problem: Maintain a (uniform)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Optimal Sampling from Distributed Streams Graham Cormode AT&amp;T Labs-Research Joint work with

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due

Triangle counting in dynamic graph streams Konstantin Kutzkov and Rasmus Pagh Work supported by:

Rare/other processes working group summary Motto: 100% uncertainties sound conservative,

A Measurement-based Model for Parallel Real-Time Tasks KUNAL AGRAWAL &amp; SANJOY

Lessons from privacy measurement Arvind Narayanan Princeton University @random_walker Caveat:

Best Practices for Working with Undocumented and DACA-mented Students Ellen Badger , Retired,

HRMS Student Positions Fall 2009 Overview/Class Objectives Part 1: Presentation -

In 1800, 95% of the world lived in extreme poverty. Thomas Malthus, 1798 800 m. Data source: US

More Relatively-Poor People in a Less Absolutely- Poor World? Martin Ravallion Director of the

www.nzadds.org.nz Definitions: Economic development: economic change in countries and regions

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

A Measurement-based Model for Parallel Real-Time Tasks KUNAL AGRAWAL & SANJOY