Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013

Outline Sampling Estimating F k [AMS’96] Reservoir Sampling Priority Sampling

Estimating F k ◮ Suppose, you know m , the stream length ◮ Sample a index p uniformly and randomly with probability 1 m . Suppose a p = l ◮ Compute r = |{ q : q ≥ p , a q = l }| –the number of occurrences of l in the stream starting from a p ◮ Return X = m ( r k − ( r − 1) k ) ≤ n 1 − 1 ◮ Show E � � � � k ( F k ) 2 . X = F k , Var X

Estimating F k ◮ Maintain s 1 = O ( kn 1 − 1 k ) such estimates X 1 , X 2 , ..., X s 1 . Take ǫ 2 the average, Y = 1 � s 1 i =1 X i . s 1 ◮ Maintain s 2 = O (log 1 δ ) of these average estimates, Y 1 , Y 2 , ..., Y s 2 and take the median. ◮ Follows (1 ± ǫ ) approximation with probability ≥ (1 − δ ).

Estimating F k Lemma � � E = F k X f i n � 1 � � � � � E Y = E X | i is sampled on j th occurrence m i =1 j =1 n f i m (( f i − j + 1) k − ( f i − j ) k ) 1 � � = m i =1 j =1 n � 1 k + (2 k − 1 k ) + (3 k − 2 k ) + ... + ( f k � � i − ( f i − 1) k ) = i =1 = F k

Estimating F k Lemma ≤ kn 1 − 1 k ( F k ) 2 � � Var X f i n � 1 X 2 | i is sampled on j th occurrence � � Y 2 � � � E = E m i =1 j =1 n f i m 2 (( f i − j + 1) k − ( f i − j ) k ) 2 1 � � = m i =1 j =1 n � 1 2 k + (2 k − 1 k ) 2 + (3 k − 2 k ) 2 + ... + ( f k i − ( f i − 1) k ) 2 � � = m i =1 n k 1 2 k − 1 + k 2 k − 1 (2 k − 1 k ) + ..... + f k − 1 � ( f k i − ( f i − 1) k ) ≤ m i i =1 Using a k − b k = ( a − b )( a k − 1 + ba k − 2 + .. + b k − 1 ) ≤ ( a − b ) ka k − 1

Estimating F k n k 1 2 k − 1 + k 2 k − 1 (2 k − 1 k ) + ..... + f k − 1 � ( f k i − ( f i − 1) k ) m i i =1 n 1 2 k − 1 + 2 2 k − 1 + ... + f 2 k − 1 � < mk = mkF 2 k − 1 i i =1 � n � 2 kF 1 F 2 k − 1 ≤ kn 1 − 1 = kn 1 − 1 � f k k ( F k ) 2 = k i i =1 Reference: The space complexity of approximating the frequency moment by Alon, Matias, Szegedy.

Uniform Random Sample from Stream Without Replacement ◮ What happens when you do not know m ? Check out: Algorithms Every Data Scientist Should Know: Reservoir Sampling http://blog.cloudera.com/blog/2013/04/hadoop-stratified- randosampling-algorithm/

Reservoir Sampling ◮ Find a uniform sample s from stream if you do not know m ? ◮ Initially s = a 1 ◮ On seeing the t -th element set s = a t with probability 1 t � � � � = 1 1 1 1 − 1 = 1 � � � � Pr s = a i 1 − 1 − ... i i +1 i +2 t t ◮ Can you extend AMS algorithm to a single pass now ?

Reservoir Sampling of size k ◮ Find a uniform sample s of size k from stream if you do not know m ? ◮ Initially s = { a 1 , a 2 , ..., a k } ◮ On seeing the t -th element set, pick a number r ∈ [1 , t ] uniformly and randomly ◮ If r ≤ k , replace the r th element by a t � � � � � � = k 1 1 � 1 − 1 � = k Pr a i ∈ s 1 − 1 − ... i i +1 i +2 t t

Priority Sampling ◮ Element i has weight w i . ◮ Keep a sample of size k such that any subset sum query can be answered later. ◮ Uniform Sampling: Misses few heavy hitters ◮ Weighted Sampling with Replacements: duplicates of heavy hitters ◮ Weighted Sampling Without Replacement: Very complicated expression-does not work for subset sum

Priority Sampling ◮ For each item i = 0 , 1 , .., n − 1 generate a random number α i ∈ [0 , 1] uniformly and randomly. ◮ Assign priority q i = w i α i to the ith element. ◮ Select the k highest priority items in the sample S .

Priority Sampling ◮ Let τ be the priority of the ( k + 1)th highest priority. ◮ Set ˆ w i = max ( w i , τ ) if i is in the sample and 0 otherwise. ◮ E � � w i ˆ = w i

Priority Sampling ◮ A ( τ ′ ):Event τ ′ is the k th highest priority among all j � = i . ◮ For any value of τ ′ , � � � � E w i | A ( τ ′ ) ˆ = Pr i ∈ S | A ( τ ′ ) max ( w i , τ ′ ) � w i ◮ Pr α i < w i = min (1 , w i � � α i > τ ′ � � � i ∈ S | A ( τ ′ ) = Pr = Pr τ ′ ) τ ′ = max ( w i , τ ′ ) min (1 , w i ◮ E � � w i | A ( τ ′ ) ˆ τ ′ ) = w i ◮ Holds for all τ ′ , hence holds unconditionally.

Priority Sampling ◮ Near optimality: variance of the weight estimator is minimal among all k + 1-sparse unbiased estimators.

Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013 - PowerPoint PPT Presentation

Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013 Outline Sampling Estimating F k [AMS96] Reservoir Sampling Priority Sampling Estimating F k Suppose, you know m , the stream length Sample a index p uniformly and

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Combinatorial Probability Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Summer of NYTD, 2018 National Data Archive on Child Abuse and Neglect Bronfenbrenner Center for

Lobby Poll What data gaps does your coalition seem to face often? (mark all that apply)

Lecture 7 - Path Tracing Welcome! , = (, ) ,

Mathematics in the Hyperfinite World Evgeny Gordon Mathematics and Computer Science Department

Yog Darshan Yog Philosophy o by Sage Patanjali 1 By Shekhar Agrawal for Arya Samaj Greater

Nonparametric estimation in a multiplicative noise model Charlotte Dion (1) , (2) Joint work with

Numerical Solution of Stochastic Differential Equations with Jumps in Finance Eckhard Platen

Sambuz

Useful Links

Newsletter

Mail Us