On Biased Reservoir Sampling in the Presence of Stream Evolution - PowerPoint PPT Presentation

Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006

Synopsis Construction in Data Streams • Synopsis maintenance is an important problem in massive volume applications such as data streams. • Many synopsis methods such as wavelets, histograms and sketches are designed for use with specific applications such as approximate query answering. • An important class of stream synopsis construction methods is that of reservoir sampling (Vitter 1985). • Great appeal because it generates a sample of the original multi-dimensional data representation. • Can be used with arbitrary data mining applications with little changes to the underlying algorithms.

Reservoir Sampling (Vitter 1985) • In the case of a fixed data set of known size N , it is trivial to construct a sample of size n , since all points have an inclusion probability of n/N . • However, a data stream is a continuous process, and it is not known in advance how many points may elapse before an analyst may need to use a representative sample. • The base data size N is not known in advance. • A reservoir or dynamic sample is maintained by probabilistic insertions and deletions on arrival of new stream points. • Challenge: Probabilistic insertions and deletions always need to maintain unbiased sample.

Reservoir Sampling • The first n points in the data stream are added to the reservoir for initialization. • Subsequently, when the ( t +1)th point from the data stream is received, it is added to the reservoir with probability n/ ( t + 1). • This point replaces a randomly chosen point in the reservoir. • Note: Probability of insertion reduces with stream progression. • Property: The reservoir sampling method maintains an unbiased sample of the history of the data stream (proof by induction).

Observations • In an evolving data stream only the more recent data may be relevant for many queries. • For example, if an application is queried for the statistics for the past hour of stream arrivals, then for a data stream which has been running over one year, only about 0 . 01% of an unbiased sample may be relevant. • The imposition of range selectivity or other constraints on the query will reduce the relevant estimated sample further. • In many cases, this may return a null or wildly inaccurate result.

Observations • In general, the quality of the result for the same query will only degrade with progression of the stream , as a smaller and smaller portion of the sample remains relevant with time. • This is also the most important case for stream analytics, since the same query over recent behavior may be repeatedly used with progression of the stream.

Potential Solutions • One solution is to use a sliding window approach for restrict- ing the horizon of the sample. • The use of a pure sliding window to pick a sample of the immediately preceding points may represent another extreme and rather unstable solution. • This is because one may not wish to completely lose the entire history of past stream data. • While analytical techniques such as query estimation may be performed more frequently for recent time horizons, distant historical behavior may also be queried periodically.

Biased Reservoir Sampling • A practical solution is to use a temporal bias function in order to regulate the choice of the stream sample. • Such a solution helps in cases where it is desirable to obtain both biased and unbiased results. • In some data mining applications, it may be desirable to bias the result to represent more recent behavior of the stream. • In other applications such as query estimation, while it may be desirable to obtain unbiased query results, it is more criti- cal to obtain accurate results for queries over recent horizons. • The biased sampling method allows us to achieve both goals.

Contributions • In general, it is non-trivial to extend reservoir maintenance algorithms to the biased case. In fact, it is an open problem to determine whether reservoir maintenance can be achieved in one-pass with arbitrary bias functions. • We theoretically show that in the case of an important class of memory-less bias functions (exponential bias functions), the reservoir maintenance algorithm reduces to a form which is simple to implement in a one-pass approach. • The inclusion of a bias function imposes a maximum requirement on the sample size. Any sample satisfying the bias requirements will not have size larger than a function of N .

Contributions • This function of N defines a maximum requirement on the reservoir size which is significantly less than N . • In the case of the memory-less bias functions, we will show that this maximum sample size is independent of N and is therefore bounded above by a constant even for an infinitely long data stream. • We will theoretically analyze the accuracy of the approach on the problem of query estimation. • Test the method for the problem of query estimation and data mining problems.

Bias Function • The bias function associated with the r th data point at the time of arrival of the t th point ( r ≤ t ) is given by f ( r, t ). • The probability p ( r, t ) of the r th point belonging to the reservoir at the time of arrival of the t th point is proportional to f ( r, t ). • The function f ( r, t ) is monotonically decreasing with t (for fixed r ) and monotonically increasing with r (for fixed t ). • Therefore, the use of a bias function ensures that recent points have higher probability of being represented in the sample reservoir.

Biased Sample • Definition: Let f ( r, t ) be the bias function for the r th point at the arrival of the t th point. A biased sample S ( t ) at the time of arrival of the t th point in the stream is defined as a sample such that the relative probability p ( r, t ) of the r th point belonging to the sample S ( t ) (of size n ) is proportional to f ( r, t ). • For the case of general functions f ( r, t ), it is an open problem to determine if maintenance algorithms can be implemented in one pass.

Challenges • In the case of unbiased maintenance algorithms, we only need to perform a single insertion and deletion operation periodically on the reservoir. • In the case of arbitrary functions, the entire set of points within the current sample may need to re-distributed in order to reflect the changes in the function f ( r, t ) over different values of t . • For a sample S ( t ) this requires Ω( | S ( t ) | ) = Ω( n ) operations, for every point in the stream irrespective of whether or not insertions are made.

Memoryless Bias Functions • The exponential bias function is defined as follows: f ( r, t ) = e − λ ( t − r ) (1) • The parameter λ defines the bias rate and typically lies in the range [0 , 1] with very small values. • A choice of λ = 0 represents the unbiased case. The exponential bias function defines the class of memory-less functions in which the future probability of retaining a current point in the reservoir is independent of its past history or arrival time. • Memory-less bias functions are natural, and also allow for an extremely efficient extension of the reservoir sampling method.

Maximum Reservoir Requirements • Result: The maximum reservoir requirement R ( t ) for a random sample (without duplicates) from a stream of length t which satisfies the bias function f ( r, t ) is given by: t � R ( t ) ≤ f ( i, t ) /f ( t, t ) (2) i =1 • Proof Sketch: – Derive expression for probability p ( r, t ) in terms of reservoir size n and bias function f ( r, t ). t � p ( r, t ) = n · f ( r, t ) / ( f ( i, t )) (3) i =1 – Since p ( r, t ) is a probability, it is at most 1. – Set r = t to obtain result.

Maximum Reservoir Requirement for Exponential Bias Functions • The maximum reservoir requirement R ( t ) for a random sample (without duplicates) from a stream of length t which satisfies the exponential bias function f ( r, t ) = e − λ ( t − r ) is given by: R ( t ) ≤ (1 − e − λt ) / (1 − e − λ ) (4) • Proof Sketch: Easy to show by instantiating result for general bias functions.

Constant Upper bound for Exponential Bias Functions • Result: The maximum reservoir requirement R ( t ) for a random sample from a stream of length t which satisfies the exponential bias function f ( r, t ) = e − λ ( t − r ) is bounded above by the constant 1 / (1 − e − λ ). • Approximation for small values of λ : The maximum reservoir requirement R ( t ) for a random sample (without duplicates) from a stream of length t which satisfies the exponential bias function f ( r, t ) = e − λ ( t − r ) is approximately bounded above by the constant 1 /λ .

Implications of Constant Upper Bound • For unbiased sampling, reservoir size may be as large as stream itself- no longer necessary for biased sampling! • The constant upper bound shows that maximum reservoir size is not sensitive to how long the points from the stream are being received. • Provides an estimate of the maximum sampling requirement. • We can maintain the maximum theoretical reservoir size if sufficient main memory is available.

On Biased Reservoir Sampling in the Presence of Stream Evolution - PowerPoint PPT Presentation

Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction in Data Streams Synopsis

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due

IPR/Reservoir Augmentation Reservoir Storage Permitting Issues Michael R. Welch, Ph.D., P.E.

Running Bro in the Cloud at Scale Reservoir Labs 1 About: Alan Commike Reservoir Labs:

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Introduction to Stream Computing and Reservoir Sampling COMP 480/580 February 6, 2020 Data

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Prepared by Cindy Safrit and Daphne Cartner Dam Failures Do Occur No one knows precisely how many

In the beginning. Courtesy of NASA/JPL-Caltech Matt Jackson UC Santa Barbara Lower Mantle

Sampling from Databases CompSci 590.04 Instructor:

HIV HIV Der ermatolo logy gy Up Update e 2019 2019 Toby Maurer, MD Indiana

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Introductory Chemical Engineering Thermodynamics Unit I. Earth, Air, Fire, and Water Chapter 2:

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with

Reservoir-induced topological order & quantized transport in open systems Michael

On Biased Reservoir Sampling in the Presence of Stream Evolution - PowerPoint PPT Presentation

Charu C. Aggarwal T J Watson Research Center IBM Corporation Hawthorne, NY USA On Biased Reservoir Sampling in the Presence of Stream Evolution VLDB Conference, Seoul, South Korea, 2006 Synopsis Construction in Data Streams Synopsis

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Data Stream Processing Part I Motivation Data Streams Reservoir Sampling 1 Homework 1 is due

IPR/Reservoir Augmentation Reservoir Storage Permitting Issues Michael R. Welch, Ph.D., P.E.

Running Bro in the Cloud at Scale Reservoir Labs 1 About: Alan Commike Reservoir Labs:

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Introduction to Stream Computing and Reservoir Sampling COMP 480/580 February 6, 2020 Data

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Prepared by Cindy Safrit and Daphne Cartner Dam Failures Do Occur No one knows precisely how many

In the beginning. Courtesy of NASA/JPL-Caltech Matt Jackson UC Santa Barbara Lower Mantle

Sampling from Databases CompSci 590.04 Instructor:

HIV HIV Der ermatolo logy gy Up Update e 2019 2019 Toby Maurer, MD Indiana

Sublinear Algorithms for Big Data Part 4: Random Topics Qin Zhang 1-1 Topic 3: Random sampling

Introductory Chemical Engineering Thermodynamics Unit I. Earth, Air, Fire, and Water Chapter 2:

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with

Reservoir-induced topological order &amp; quantized transport in open systems Michael

Reservoir-induced topological order & quantized transport in open systems Michael