lecture 7
play

Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013 - PowerPoint PPT Presentation

Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013 Outline Sampling Estimating F k [AMS96] Reservoir Sampling Priority Sampling Estimating F k Suppose, you know m , the stream length Sample a index p uniformly and


  1. Lecture 7 Barna Saha AT&T-Labs Research September 26, 2013

  2. Outline Sampling Estimating F k [AMS’96] Reservoir Sampling Priority Sampling

  3. Estimating F k ◮ Suppose, you know m , the stream length ◮ Sample a index p uniformly and randomly with probability 1 m . Suppose a p = l ◮ Compute r = |{ q : q ≥ p , a q = l }| –the number of occurrences of l in the stream starting from a p ◮ Return X = m ( r k − ( r − 1) k ) ≤ n 1 − 1 ◮ Show E � � � � k ( F k ) 2 . X = F k , Var X

  4. Estimating F k ◮ Maintain s 1 = O ( kn 1 − 1 k ) such estimates X 1 , X 2 , ..., X s 1 . Take ǫ 2 the average, Y = 1 � s 1 i =1 X i . s 1 ◮ Maintain s 2 = O (log 1 δ ) of these average estimates, Y 1 , Y 2 , ..., Y s 2 and take the median. ◮ Follows (1 ± ǫ ) approximation with probability ≥ (1 − δ ).

  5. Estimating F k Lemma � � E = F k X f i n � 1 � � � � � E Y = E X | i is sampled on j th occurrence m i =1 j =1 n f i m (( f i − j + 1) k − ( f i − j ) k ) 1 � � = m i =1 j =1 n � 1 k + (2 k − 1 k ) + (3 k − 2 k ) + ... + ( f k � � i − ( f i − 1) k ) = i =1 = F k

  6. Estimating F k Lemma ≤ kn 1 − 1 k ( F k ) 2 � � Var X f i n � 1 X 2 | i is sampled on j th occurrence � � Y 2 � � � E = E m i =1 j =1 n f i m 2 (( f i − j + 1) k − ( f i − j ) k ) 2 1 � � = m i =1 j =1 n � 1 2 k + (2 k − 1 k ) 2 + (3 k − 2 k ) 2 + ... + ( f k i − ( f i − 1) k ) 2 � � = m i =1 n k 1 2 k − 1 + k 2 k − 1 (2 k − 1 k ) + ..... + f k − 1 � ( f k i − ( f i − 1) k ) ≤ m i i =1 Using a k − b k = ( a − b )( a k − 1 + ba k − 2 + .. + b k − 1 ) ≤ ( a − b ) ka k − 1

  7. Estimating F k n k 1 2 k − 1 + k 2 k − 1 (2 k − 1 k ) + ..... + f k − 1 � ( f k i − ( f i − 1) k ) m i i =1 n 1 2 k − 1 + 2 2 k − 1 + ... + f 2 k − 1 � < mk = mkF 2 k − 1 i i =1 � n � 2 kF 1 F 2 k − 1 ≤ kn 1 − 1 = kn 1 − 1 � f k k ( F k ) 2 = k i i =1 Reference: The space complexity of approximating the frequency moment by Alon, Matias, Szegedy.

  8. Uniform Random Sample from Stream Without Replacement ◮ What happens when you do not know m ? Check out: Algorithms Every Data Scientist Should Know: Reservoir Sampling http://blog.cloudera.com/blog/2013/04/hadoop-stratified- randosampling-algorithm/

  9. Reservoir Sampling ◮ Find a uniform sample s from stream if you do not know m ? ◮ Initially s = a 1 ◮ On seeing the t -th element set s = a t with probability 1 t � � � � = 1 1 1 1 − 1 = 1 � � � � Pr s = a i 1 − 1 − ... i i +1 i +2 t t ◮ Can you extend AMS algorithm to a single pass now ?

  10. Reservoir Sampling of size k ◮ Find a uniform sample s of size k from stream if you do not know m ? ◮ Initially s = { a 1 , a 2 , ..., a k } ◮ On seeing the t -th element set, pick a number r ∈ [1 , t ] uniformly and randomly ◮ If r ≤ k , replace the r th element by a t � � � � � � = k 1 1 � 1 − 1 � = k Pr a i ∈ s 1 − 1 − ... i i +1 i +2 t t

  11. Priority Sampling ◮ Element i has weight w i . ◮ Keep a sample of size k such that any subset sum query can be answered later. ◮ Uniform Sampling: Misses few heavy hitters ◮ Weighted Sampling with Replacements: duplicates of heavy hitters ◮ Weighted Sampling Without Replacement: Very complicated expression-does not work for subset sum

  12. Priority Sampling ◮ For each item i = 0 , 1 , .., n − 1 generate a random number α i ∈ [0 , 1] uniformly and randomly. ◮ Assign priority q i = w i α i to the ith element. ◮ Select the k highest priority items in the sample S .

  13. Priority Sampling ◮ Let τ be the priority of the ( k + 1)th highest priority. ◮ Set ˆ w i = max ( w i , τ ) if i is in the sample and 0 otherwise. ◮ E � � w i ˆ = w i

  14. Priority Sampling ◮ A ( τ ′ ):Event τ ′ is the k th highest priority among all j � = i . ◮ For any value of τ ′ , � � � � E w i | A ( τ ′ ) ˆ = Pr i ∈ S | A ( τ ′ ) max ( w i , τ ′ ) � w i ◮ Pr α i < w i = min (1 , w i � � α i > τ ′ � � � i ∈ S | A ( τ ′ ) = Pr = Pr τ ′ ) τ ′ = max ( w i , τ ′ ) min (1 , w i ◮ E � � w i | A ( τ ′ ) ˆ τ ′ ) = w i ◮ Holds for all τ ′ , hence holds unconditionally.

  15. Priority Sampling ◮ Near optimality: variance of the weight estimator is minimal among all k + 1-sparse unbiased estimators.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend