CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 18

Heavy Hitters Problem Heavy Hitters Problem: Find all items i such that f i > m / k for some fixed k . Heavy hitters are very frequent items. We saw Misra-Gries deterministic algorithm that in O ( k ) space finds the heavy hitters assuming they exist. Two pass algorithm correctly identifies heavy hitters. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 18

(Strict) Turnstile Model Turnstile model: each update is ( i j , ∆ j ) where ∆ j can be positive or negative Strict turnstile: need x i ≥ 0 at all time for all i In terms of frequent items we want additive error to x i Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 18

Basic Hashing/Sampling Idea Heavy Hitters Problem: Find all items i such that f i > m / k . Let b 1 , b 2 , . . . , b k be the k heavy hitters Suppose we pick h : [ n ] → [ ck ] for some c > 1 h spreads b 1 , . . . , b k among the buckets ( k balls into ck bins) In ideal situation each bucket can be used to count a separate heavy hitter Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 18

Part I CountMin Sketch Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 18

CountMin Sketch [Cormode-Muthukrishnan] CountMin-Sketch( w , d ): h 1 , h 2 , . . . , h d are pair-wise independent hash functions from [ n ] → [ w ] . While (stream is not empty) do e t = ( i t , ∆ t ) is current item for ℓ = 1 to d do C [ ℓ, h ℓ ( i j )] ← C [ ℓ, h ℓ ( i j )] + ∆ t endWhile x i = min d For i ∈ [ n ] set ˜ ℓ =1 C [ ℓ, h ℓ ( i )] . Counter C [ ℓ, j ] simply counts the sum of all x i such that h ℓ ( i ) = j . That is, � C [ ℓ, j ] = x i . i : h ℓ ( i )= j Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 18

Intuition Suppose there are k heavy hitters b 1 , b 2 , . . . , b k Consider b i : Hash function h ℓ sends b i to h ℓ ( b i ) . C [ ℓ, h ( b i )] counts x b i and also other items that hash to same bucket h ( b i ) so we always overcount (since strict turnstile model) Repeating with many hash functions and taking minimum is right thing to do: for b i the goal is to avoid other heavy hitters colliding with it Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 18

Property of CountMin Sketch Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 18

Property of CountMin Sketch Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Unlike Misra-Greis we have over estimates Actual items are not stored (requires work to recover heavy hitters) Works in strict turnstile model and hence can handle deletions Space usage is O ( log(1 /δ ) ) counters and hence ǫ O ( log(1 /δ ) log m ) bits ǫ Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 18

Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Via Markov applied to Z ℓ − x i (we use strict turnstile here) Pr[ Z ℓ ] ≥ x i + ǫ � x � 1 ≤ 1 / 2 Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

Analysis Fix ℓ : h ℓ ( i ) is the bucket that h ℓ hashes i to. Z ℓ = C [ ℓ, h ℓ ( i )] is the counter value that i is hashed to. i ′ � = i Pr[ h ℓ ( i ′ ) = h ℓ ( i )] x i ′ E[ Z ℓ ] = x i + � By pairwise-independence E[ Z ℓ ] = x i + � i ′ � = i x i ′ / w ≤ x i + ǫ � x � 1 / 2 Via Markov applied to Z ℓ − x i (we use strict turnstile here) Pr[ Z ℓ ] ≥ x i + ǫ � x � 1 ≤ 1 / 2 Since the d hash functions are independent Pr[min ℓ Z ℓ ≥ x i + ǫ � x � 1 ] ≤ 1 / 2 d ≤ δ Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 18

Summarizing Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Choose d = 2 ln n and w = 2 /ǫ : we have x i ≥ x i + ǫ � x � 1 ] ≤ 1 / n 2 . Pr[˜ By union bound, with probability (1 − 1 / n ) , for all i ∈ [ n ] , x i ≤ x i + ǫ � x � 1 ˜ Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 18

Summarizing Lemma Let d = Ω(log 1 δ ) and w > 2 ǫ . Then for any fixed i ∈ [ n ] , x i ≤ ˜ x i and Pr[˜ x i ≥ x i + ǫ � x � 1 ] ≤ δ. Choose d = 2 ln n and w = 2 /ǫ : we have x i ≥ x i + ǫ � x � 1 ] ≤ 1 / n 2 . Pr[˜ By union bound, with probability (1 − 1 / n ) , for all i ∈ [ n ] , x i ≤ x i + ǫ � x � 1 ˜ Total space O ( 1 ǫ log n ) counters and hence O ( 1 ǫ log n log m ) bits. Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 18

CountMin as a Linear Sketch Question: Why is CountMin a linear sketch? Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 18

CountMin as a Linear Sketch Question: Why is CountMin a linear sketch? Recall that for 1 ≤ ℓ ≤ d and 1 ≤ s ≤ w : � C [ ℓ, s ] = x i i : h ℓ ( i )= s Thus, once hash function h ℓ is fixed: C [ ℓ, s ] = � u , x � where u is a row vector in { 0 , 1 } n such that u i = 1 if h ℓ ( i ) = s and u i = 0 otherwise Thus, once hash functions are fixed, the counter values can be written as Mx where M ∈ { 0 , 1 } wd × n is the sketch matrix Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 18

Part II Count Sketch Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 18

Count Sketch [Charikar-Chen-FarachColton] Count-Sketch( w , d ): h 1 , h 2 , . . . , h d are pair-wise independent hash functions from [ n ] → [ w ] . g 1 , g 2 , . . . , g d are pair-wise independent hash functions from [ n ] → {− 1 , 1 } . While (stream is not empty) do e t = ( i t , ∆ t ) is current item for ℓ = 1 to d do C [ ℓ, h ℓ ( i j )] ← C [ ℓ, h ℓ ( i j )] + g ( i t )∆ t endWhile For i ∈ [ n ] set ˜ x i = median { g 1 ( i ) C [1 , h 1 ( i )] , . . . , g ℓ ( i ) C [ ℓ, h ℓ ( i )] } . Like CountMin, Count sketch has wd counters. Now counter values can become negative even if x is positive. Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 18

Intuition Each hash function h ℓ spreads the elements across w buckets The has function g ℓ induces cancellations (inspired by F 2 estimation algorithm) Since answer may be negative even if x ≥ 0 , we take the median Exercise: Show that Count sketch is also a linear sketch. Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 18

Count Sketch Analysis Lemma Let d ≥ 4 log 1 3 δ and w > ǫ 2 . Then for any fixed i ∈ [ n ] , E[˜ x i ] = x i and Pr[ | ˜ x i − x i | ≥ ǫ � x � 2 ] ≤ δ. Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 18

Count Sketch Analysis Lemma Let d ≥ 4 log 1 3 δ and w > ǫ 2 . Then for any fixed i ∈ [ n ] , E[˜ x i ] = x i and Pr[ | ˜ x i − x i | ≥ ǫ � x � 2 ] ≤ δ. Comparison to CountMin Error guarantee is with respect to � x � 2 instead of � x � 1 . For x ≥ 0 , � x � 2 ≤ � x � 1 and in some cases � x � 2 ≪ � x � 1 . Space increases to O ( 1 ǫ 2 log n ) counters from O ( 1 ǫ log n ) counters Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 18

Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18

Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . For i ′ ∈ [ n ] let Y i ′ be the indicator random variable that is 1 if h ℓ ( i ) = h ℓ ( i ′ ) ; that is i and i ′ collide in h ℓ . E [ Y i ′ ] = E [ Y 2 i ′ ] = 1 / w from pairwise independence of h ℓ . Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18

Analysis Fix an i ∈ [ n ] . Let Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] . For i ′ ∈ [ n ] let Y i ′ be the indicator random variable that is 1 if h ℓ ( i ) = h ℓ ( i ′ ) ; that is i and i ′ collide in h ℓ . E [ Y i ′ ] = E [ Y 2 i ′ ] = 1 / w from pairwise independence of h ℓ . � g ℓ ( i ′ ) x i ′ Y i ′ Z ℓ = g ℓ ( i ) C [ ℓ, h ℓ ( i )] = g ℓ ( i ) i ′ Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 18

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 18 Heavy Hitters Problem Heavy Hitters Problem: Find all items i such that f i > m / k

Bloom Filters, Count Sketches and Adaptive Sketches Rice University Anshumali Shrivastava

Making Every Contact Count (MECC) Content What is Making Every Contact Count? Who is

Recitation 4 Question 3: Flying off the handle Parent Child fork() count++; print(count); 1

L ECTURE 10 Last time Multipurpose sketches Count-min and count-sketch Range queries,

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai

Physical Sketches CPSC 581 - Fall 2015 Motivation Experience your sketches in a more physical

Merge and Count Merge and count step. Given two sorted halves, count number of inversions

What is the Point-In- Time Count? The Point-in-Time (PIT) count is a count of sheltered and

WHAT IS THE POINT-IN-TIME COUNT? The Point-in-Time (PIT) count is a count of sheltered and

Count 2020 2020 Count The 2020 Everybody Counts Point-in-Time Count effort is one way the City of

FIT Count Training (Flower Insect Timed Count) Denise McGowan (Government of Jersey) Nadine

Count Controlled CSCI-UA.0002-008 Loops Count Controlled Loops A count controlled loop is a

2019 Annual Passenger Count JPB Board of Directors September 5, 2019 Agenda Item #13 OVERVIEW

Systems You Can Systems You Can Count On Count On Mission Statement Mission Statement ! To

pop-count update draft-ietf-pim-pop-count-03 pop-count version 3 changes Mainly changes to

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat

36 W 10.8 kJ 306 kJ, so 3:45 hr of operation Temperature (degrees Celsius) Temperature

Herringbone Accordion Tent Dome Amount of Cardboard 107 96 84.5 72 (ft^2) Weight (lbs)

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 06:

Efficient Private Statistics with Succinct Sketches Luca Melis , George Danezis, Emiliano De

Secure Data Retrieval on the Cloud: Homomorphic Encryption meets Coresets Adi Akavia (University

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 CountMin and Count Sketches Lecture 10 February 14, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 18 Heavy Hitters Problem Heavy Hitters Problem: Find all items i such that f i > m / k

Bloom Filters, Count Sketches and Adaptive Sketches Rice University Anshumali Shrivastava

Making Every Contact Count (MECC) Content What is Making Every Contact Count? Who is

Recitation 4 Question 3: Flying off the handle Parent Child fork() count++; print(count); 1

L ECTURE 10 Last time Multipurpose sketches Count-min and count-sketch Range queries,

COMPRESSING GRADIENT OPTIMIZERS VIA COUNT-SKETCHES Ryan Spring, Anastasios Kyrillidis, Vijai

Physical Sketches CPSC 581 - Fall 2015 Motivation Experience your sketches in a more physical

Merge and Count Merge and count step. Given two sorted halves, count number of inversions

What is the Point-In- Time Count? The Point-in-Time (PIT) count is a count of sheltered and

WHAT IS THE POINT-IN-TIME COUNT? The Point-in-Time (PIT) count is a count of sheltered and

Count 2020 2020 Count The 2020 Everybody Counts Point-in-Time Count effort is one way the City of

FIT Count Training (Flower Insect Timed Count) Denise McGowan (Government of Jersey) Nadine

Count Controlled CSCI-UA.0002-008 Loops Count Controlled Loops A count controlled loop is a

2019 Annual Passenger Count JPB Board of Directors September 5, 2019 Agenda Item #13 OVERVIEW

Systems You Can Systems You Can Count On Count On Mission Statement Mission Statement ! To

pop-count update draft-ietf-pim-pop-count-03 pop-count version 3 changes Mainly changes to

Latest on Linear Sketches for Large Graphs: Lots of Problems, Little Space, and Loads of

Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of

Communication-efficient Distributed SGD with Sketching Nikita Ivkin*, Daniel Rothchild*, Enayat

36 W 10.8 kJ 306 kJ, so 3:45 hr of operation Temperature (degrees Celsius) Temperature

Herringbone Accordion Tent Dome Amount of Cardboard 107 96 84.5 72 (ft^2) Weight (lbs)

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 06:

Efficient Private Statistics with Succinct Sketches Luca Melis , George Danezis, Emiliano De

Secure Data Retrieval on the Cloud: Homomorphic Encryption meets Coresets Adi Akavia (University

Iterative Sketching Agile Arizona 2017 Agenda Who am I? The Power of Sketching When

Communication-efficient Distributed SGD with Sketching Nikita Ivkin, Daniel Rothchild, Enayat