countmin and count sketches
play

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 18 Heavy Hitters Problem Heavy Hitters Problem: Find all items i such that f i > m / k


  1. CS 498ABD: Algorithms for Big Data, Spring 2019 CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 18

  2. Heavy Hitters Problem Heavy Hitters Problem: Find all items i such that f i > m / k for some fixed k . Heavy hitters are very frequent items. We saw Misra-Gries deterministic algorithm that in O ( k ) space finds the heavy hitters assuming they exist. Two pass algorithm correctly identifies heavy hitters. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 18

  3. (Strict) Turnstile Model Turnstile model: each update is ( i j , ∆ j ) where ∆ j can be positive or negative Strict turnstile: need x i ≥ 0 at all time for all i In terms of frequent items we want additive error to x i Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 18

  4. Basic Hashing/Sampling Idea Heavy Hitters Problem: Find all items i such that f i > m / k . Let b 1 , b 2 , . . . , b k be the k heavy hitters Suppose we pick h : [ n ] → [ ck ] for some c > 1 h spreads b 1 , . . . , b k among the buckets ( k balls into ck bins) In ideal situation each bucket can be used to count a separate heavy hitter Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 18

  5. Part I CountMin Sketch Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 18

  6. CountMin Sketch [Cormode-Muthukrishnan] CountMin-Sketch( w , d ): h 1 , h 2 , . . . , h d are pair-wise independent hash functions from [ n ] → [ w ] . While (stream is not empty) do e t = ( i t , ∆ t ) is current item for ℓ = 1 to d do C [ ℓ, h ℓ ( i j )] ← C [ ℓ, h ℓ ( i j )] + ∆ t endWhile x i = min d For i ∈ [ n ] set ˜ ℓ =1 C [ ℓ, h ℓ ( i )] . Counter C [ ℓ, j ] simply counts the sum of all x i such that h ℓ ( i ) = j . That is, � C [ ℓ, j ] = x i . i : h ℓ ( i )= j Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 18

  7. Intuition Suppose there are k heavy hitters b 1 , b 2 , . . . , b k Consider b i : Hash function h ℓ sends b i to h ℓ ( b i ) . C [ ℓ, h ( b i )] counts x b i and also other items that hash to same bucket h ( b i ) so we always overcount (since strict turnstile model) Repeating with many hash functions and taking minimum is right thing to do: for b i the goal is to avoid other heavy hitters colliding with it Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 18

  8. Property of CountMin Sketch Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 18

  9. Property of CountMin Sketch Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Unlike Misra-Greis we have over estimates Actual items are not stored (requires work to recover heavy hitters) Works in strict turnstile model and hence can handle deletions Space usage is O ( log(1 /δ ) ) counters and hence ǫ O ( log(1 /δ ) log m ) bits ǫ Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 18

  10. Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

  11. Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

  12. Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

  13. Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

  14. Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Via Markov applied to Z ℓ − x i (we use strict turnstile here) Pr[ Z ℓ ] ≥ x i + ǫ � x � 1 ≤ 1 / 2 Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

  15. Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Via Markov applied to Z ℓ − x i (we use strict turnstile here) Pr[ Z ℓ ] ≥ x i + ǫ � x � 1 ≤ 1 / 2 Since the d hash functions are independent Pr[min ℓ Z ℓ ≥ x i + ǫ � x � 1 ] ≤ 1 / 2 d ≤ δ Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

  16. Summarizing Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Choose d = 2 ln n and w = 2 /ǫ : we have x i ≥ x i + ǫ � x � 1 ] ≤ 1 / n 2 . Pr[˜ By union bound, with probability (1 − 1 / n ) , for all i ∈ [ n ] , x i ≤ x i + ǫ � x � 1 ˜ Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 18

  17. Summarizing Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Choose d = 2 ln n and w = 2 /ǫ : we have x i ≥ x i + ǫ � x � 1 ] ≤ 1 / n 2 . Pr[˜ By union bound, with probability (1 − 1 / n ) , for all i ∈ [ n ] , x i ≤ x i + ǫ � x � 1 ˜ Total space O ( 1 ǫ log n ) counters and hence O ( 1 ǫ log n log m ) bits. Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 18

  18. CountMin as a Linear Sketch Question: Why is CountMin a linear sketch? Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 18

  19. CountMin as a Linear Sketch Question: Why is CountMin a linear sketch? Recall that for 1 ≤ ℓ ≤ d and 1 ≤ s ≤ w : � C [ ℓ, s ] = x i i : h ℓ ( i )= s Thus, once hash function h ℓ is fixed: C [ ℓ, s ] = � u , x � where u is a row vector in { 0 , 1 } n such that u i = 1 if h ℓ ( i ) = s and u i = 0 otherwise Thus, once hash functions are fixed, the counter values can be written as Mx where M ∈ { 0 , 1 } wd × n is the sketch matrix Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 18

  20. Part II Count Sketch Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 18

  21. Count Sketch [Charikar-Chen-FarachColton] Count-Sketch( w , d ): h 1 , h 2 , . . . , h d are pair-wise independent hash functions from [ n ] → [ w ] . g 1 , g 2 , . . . , g d are pair-wise independent hash functions from [ n ] → {− 1 , 1 } . While (stream is not empty) do e t = ( i t , ∆ t ) is current item for ℓ = 1 to d do C [ ℓ, h ℓ ( i j )] ← C [ ℓ, h ℓ ( i j )] + g ( i t )∆ t endWhile For i ∈ [ n ] set ˜ x i = median { g 1 ( i ) C [1 , h 1 ( i )] , . . . , g ℓ ( i ) C [ ℓ, h ℓ ( i )] } . Like CountMin, Count sketch has wd counters. Now counter values can become negative even if x is positive. Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 18

  22. Intuition Each hash function h ℓ spreads the elements across w buckets The has function g ℓ induces cancellations (inspired by F 2 estimation algorithm) Since answer may be negative even if x ≥ 0 , we take the median Exercise: Show that Count sketch is also a linear sketch. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 18

  23. Count Sketch Analysis Lemma Let d ≥ 4 log 1 3 δ and w > ǫ 2 . Then for any fixed i ∈ [ n ] , E[˜ x i ] = x i and Pr[ | ˜ x i − x i | ≥ ǫ � x � 2 ] ≤ δ. Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 18

  24. Count Sketch Analysis Lemma Let d ≥ 4 log 1 3 δ and w > ǫ 2 . Then for any fixed i ∈ [ n ] , E[˜ x i ] = x i and Pr[ | ˜ x i − x i | ≥ ǫ � x � 2 ] ≤ δ. Comparison to CountMin Error guarantee is with respect to � x � 2 instead of � x � 1 . For x ≥ 0 , � x � 2 ≤ � x � 1 and in some cases � x � 2 ≪ � x � 1 . Space increases to O ( 1 ǫ 2 log n ) counters from O ( 1 ǫ log n ) counters Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 18

  25. Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18

  26. Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . For i ′ ∈ [ n ] let Y i ′ be the indicator random variable that is 1 if h ℓ ( i ) = h ℓ ( i ′ ) ; that is i and i ′ collide in h ℓ . E [ Y i ′ ] = E [ Y 2 i ′ ] = 1 / w from pairwise independence of h ℓ . Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18

  27. Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . For i ′ ∈ [ n ] let Y i ′ be the indicator random variable that is 1 if h ℓ ( i ) = h ℓ ( i ′ ) ; that is i and i ′ collide in h ℓ . E [ Y i ′ ] = E [ Y 2 i ′ ] = 1 / w from pairwise independence of h ℓ . � g ℓ ( i ′ ) x i ′ Y i ′ Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] = g ℓ ( i ) i ′ Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend