l ecture 9
play

L ECTURE 9 Last time Approximate counting Estimation of the 2 nd - PowerPoint PPT Presentation

Sublinear Algorithms L ECTURE 9 Last time Approximate counting Estimation of the 2 nd moment Linear sketching Today Multipurpose sketches Count-min and count-sketch Range queries, heavy hitters, quantiles 10/1/2020 Sofya


  1. Sublinear Algorithms L ECTURE 9 Last time • Approximate counting • Estimation of the 2 nd moment • Linear sketching Today • Multipurpose sketches • Count-min and count-sketch • Range queries, heavy hitters, quantiles 10/1/2020 Sofya Raskhodnikova;Boston University

  2. Multipurpose Sketches: Problems Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 The frequency vector of the stream is 𝑔 = (𝑔 1 , … , 𝑔 𝑜 ) , • where 𝑔 𝑗 is the number of times 𝑗 appears in the stream Goal: to maintain data structures that can answer the following queries Point Query: For 𝑗 ∈ [ 𝑜 ], estimate 𝑔 • 𝑗 Range Query: For 𝑗, 𝑘 ∈ [n], estimate 𝑔 𝑗 + 𝑔 𝑗+1 + . . . + 𝑔 • 𝑘 • Quantile Query: For 𝜚 ∈ [0, 1], find 𝑘 with 𝑔 1 + . . . + 𝑔 𝑘 ≈ 𝜚𝑛 Heavy Hitters Query: For 𝜚 ∈ [0, 1], find all 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 . • Desired accuracy: ±𝜁𝑛 with error probability 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 2

  3. Initial Solution to Point Queries • We could maintain the whole frequency vector 𝑔 1 , … , 𝑔 𝑜 • Then, on query 𝑗 , we can output 𝑔 𝑗 Idea: Group counts for some numbers together 3 7 2 1 5 4 1 1 4 6 7 1 … 𝑑 1 𝑑 2 𝑑 3 𝑑 𝑐 If 𝑗 falls into bucket 𝑘 , then 𝑔 𝑗 ≤ 𝑑 𝑘 . Point Query Algorithm (initial version) 1. Sample a hash function ℎ ∶ 𝑜 → [𝑐] from a 2-wise independent family Initialize counters 𝑑 1 , … , 𝑑 𝑐 to 0 2. 3. For each element 𝑏 , increment c ℎ 𝑏 by 1. Never underestimate 4. To answer a point query 𝑗 , return c ℎ 𝑗 . 3

  4. Initial Solution to Point Queries: Analysis Point Query Algorithm (initial version) Sample a hash function ℎ ∶ 𝑜 → [𝑐] from a 2-wise independent family 1. Initialize counters 𝑑 1 , … , 𝑑 𝑐 to 0 2. For each element 𝑏 , increment c ℎ 𝑏 by 1. 3. Never underestimate To answer a point query 𝑗 , return c ℎ 𝑗 . 4. Fix 𝑗 ∗ ∈ [𝑜] . • Let 𝑎 = 𝑑 ℎ 𝑗 ∗ − 𝑔 𝑗 ∗ be the overestimation error. • by 2-wise independence For all 𝑗 ≠ 𝑗 ∗ , let 𝑌 𝑗 = ቊ1 if ℎ 𝑗 = ℎ 𝑗 ∗ 𝔽 𝑌 𝑗 = Pr[ℎ 𝑗 = ℎ(𝑗 ∗ )] = 1 • 0 otherwise 𝑐 𝑗 = 1 𝑗 ≤ 𝑛 𝑎 = ෍ 𝑌 𝑗 ⋅ 𝑔 by linearity of 𝑗 𝔽 𝑎 = ෍ 𝔽 𝑌 𝑗 ⋅ 𝑔 𝑐 ෍ 𝑔 𝑐 𝑗≠𝑗 ∗ expectation 𝑗≠𝑗 ∗ 𝑗≠𝑗 ∗ By Markov’s inequality, if 𝑐 = 2/𝜁 then • Pr 𝑎 ≥ 𝜁𝑛 ≤ 𝔽 𝑎 𝜁𝑛 ≤ 1 𝜁𝑐 ≤ 1 2 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 4

  5. Count-Min Sketch [Cormode Muthukrishnan 03] Point Query Algorithm Set 𝑢 = log 2 1/𝜀 and 𝑐 = 2/𝜁 1. 2. Sample 𝑢 hash functions ℎ 𝑘 : 𝑜 → [𝑐] from a 2-wise independent family 3. Initialize 𝑢𝑐 counters 𝑑 𝑘,𝑙 to 0 4. For each element 𝑏 and each 𝑘 ∈ [𝑢] , increment c 𝑘,ℎ 𝑏 by 1. To answer a point query 𝑗 , return ෩ 5. 𝑔 𝑗 = min 𝑘∈[𝑢] c 𝑘,ℎ 𝑗 . Never underestimate 𝑗 ≤ ෩ Correctness: Pr 𝑔 𝑔 𝑗 ≤ 𝑔 𝑗 + 𝜁𝑛 • = 1 − Pr all 𝑢 hash functions overestimate by more than 𝜁𝑛 𝑢 ≥ 1 − 1 = 1 − 𝜀 since hash functions are chosen independently 2 Space: 𝑃 𝑢 (log 𝑜 + log 𝑐) for the hash functions + • 𝑃 𝑢𝑐 log 𝑛 for the counters 1 1 Total: 𝑃 log 𝑜 + log 𝑛 𝜁 log 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 5

  6. Multipurpose Sketches: Problems Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 The frequency vector of the stream is 𝑔 = (𝑔 1 , … , 𝑔 𝑜 ) , • where 𝑔 𝑗 is the number of times 𝑗 appears in the stream Goal: to maintain data structures that can answer the following queries Point Query: For 𝑗 ∈ [ 𝑜 ], estimate 𝑔 • 𝑗 Range Query: For 𝑗, 𝑘 ∈ [n], estimate 𝑔 𝑗 + 𝑔 𝑗+1 + . . . + 𝑔 • Denote by 𝑔 𝑗,𝑘 𝑘 • Quantile Query: For 𝜚 ∈ [0, 1], find 𝑘 with 𝑔 1 + . . . + 𝑔 𝑘 ≈ 𝜚𝑛 Heavy Hitters Query: For 𝜚 ∈ [0, 1], find all 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 . • Desired accuracy: ±𝜁𝑛 with error probability 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 6

  7. Range Queries We could estimate 𝑔 𝑗,𝑘 by ෩ 𝑗 + ሚ 𝑗+1 +. . . +෩ 𝑔 𝑔 𝑔 • 𝑗 But errors add up: need too much space to keep accurate enough estimates Idea: We could estimate counts for some intervals directly by grouping 𝑗, … , 𝑘 1 2 3 4 5 6 7 8 How many intervals do we need so that each interval is a sum of 𝑃 log 𝑜 original intervals? 7

  8. Dyadic Intervals [𝑜] 𝑜 1, 𝑜 2 + 1, 𝑜 2 lg 𝑜 + 1 levels 1, 𝑜 𝑜 4 + 1, 𝑜 2 + 1, 3𝑜 𝑜 3𝑜 4 + 1, 𝑜 4 2 4 … … … … … 𝑜 − 1 𝑜 1 2 • Exercise: Each interval [𝑗, 𝑘] is a sum of at most 2 lg 𝑜 dyadic intervals. • Such a representation of an interval is its dyadic decomposition. 8

  9. Count-Min Strikes Back Range Query Algorithm Construct lg 𝑜 + 1 Count-Min sketches, one for each level such that 1. for all intervals 𝐽 at that level, our estimate ෩ 𝑔 𝐽 for 𝑔 𝐽 satisfies 𝐽 ≤ ෩ Pr 𝑔 𝑔 𝐽 ≤ 𝑔 𝐽 + 𝜁𝑛 ≤ 1 − 𝜀 To answer a range query [𝑗, 𝑘] , let 𝐽 1 , … , 𝐽 𝑙 be its dyadic decomposition 2. Return ሚ 𝑔 𝑗,𝑘 = ሚ 𝐽 1 + ⋯ + ሚ 𝑔 𝑔 𝐽 𝑙 [𝑗,𝑘] ≤ ሚ Correctness: Pr 𝑔 𝑔 𝑗,𝑘 ≤ 𝑔 𝑗,𝑘 + 𝜁𝑛(2 lg 𝑜) ≥ 1 − 𝜀(2 lg 𝑜) • • Space: Multiply the old space complexity by log 𝑜 and divide 𝜁 and 𝜀 by log 𝑜 : 𝑃 log 2 𝑜 log 𝑜 + log 𝑛 1 𝜁 log log 𝑜 𝜀 Quantile Query: For 𝜚 ∈ [0, 1] find 𝑘 with 𝑔 1,𝑘 ≈ 𝜚𝑛 • 𝑛 𝑛 Approximate Median: Find 𝑘 such that 𝑔 1,𝑘 ≥ 2 − 𝜁𝑛 and 𝑔 1,𝑘−1 ≤ 2 + 𝜁𝑛 We can approximate median via binary search of range queries. Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 9

  10. Count-Min Strikes Back (Part 2) Heavy Hitters Query: For 𝜚 ∈ (𝜁, 1), find a set 𝑇 that – includes all 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 – excludes all 𝑘 with 𝑔 𝑘 ≤ 𝜚 − 𝜁 𝑛 Heavy Hitters Algorithm 1. Construct lg 𝑜 + 1 Count-Min sketches for levels of dyadic tree, as before 2. To answer query 𝜚 , mark the root. Going level-by-level from the root, mark children 𝐽 of marked nodes if ሚ 𝑔 𝐽 ≥ 𝜚𝑛 3. Return all marked leaves Correctness: If 𝑔 𝑗 ≥ 𝜚𝑛 , then for all ancestors 𝐽 of the leaf 𝑗 , ሚ 𝐽 ≥ 𝑔 𝑔 𝐽 ≥ 𝜚𝑛 • If we ensure that Pr[point query overestimates by > 𝜁𝑛 ] ≤ 𝜀/𝑜 , then, by union bound, all point queries are correct w.p. ≥ 1 − 𝜀 • There are at most 1/𝜚 indices 𝑗 with 𝑔 𝑗 ≥ 𝜚𝑛 Thus, 𝑃(𝜚 −1 log 𝑜) time suffices for post-processing Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 10

  11. CR-Precis: Deterministic Count-Min [Ganguly Majumder 07] Use deterministic hash functions: ℎ 𝑘 𝑏 = 𝑏 mod 𝑞 𝑘 , where 𝑞 𝑘 is the 𝑘 -th prime, for 𝑘 ∈ [𝑢] Analysis: Fix 𝑗 ∗ ∈ 𝑜 . Define 𝑨 1 , … , 𝑨 𝑢 such that 𝑑 𝑘,ℎ 𝑘 𝑗 ∗ = 𝑔 𝑗 ∗ + 𝑨 𝑘 , that is, 𝑨 𝑘 = ෍ 𝑔 𝑗 𝑗≠𝑗 ∗ :ℎ 𝑘 𝑗 =ℎ 𝑘 𝑗 ∗ Claim: For each 𝑗 ≠ 𝑗 ∗ , we have ℎ 𝑘 𝑗 = ℎ 𝑘 𝑗 ∗ for at most log 𝑜 primes 𝑞 𝑘 • by Chinese Remainder Theorem Thus, σ 𝑘∈ 𝑢 𝑨 𝑘 = σ 𝑘 σ 𝑗 𝑔 𝑗 = σ 𝑗 σ 𝑘 𝑔 𝑗 ≤ σ 𝑗 𝑔 • 𝑗 log 𝑜 = 𝑛 log 𝑜 𝑗 ∗ + 𝑛 log 𝑜 ෪ 𝑘∈[𝑢] c 𝑘,ℎ 𝑗 ∗ = min 𝑗 ∗ + 𝑨 𝑗 ∗ + min 𝑔 𝑗 ∗ = min 𝑘∈[𝑢] (𝑔 𝑘 ) = 𝑔 𝑘∈[𝑢] 𝑨 𝑘 ≤ 𝑔 𝑢 log 𝑜 𝑗 ≤ ෩ We set 𝑢 = to get 𝑔 𝑔 𝑗 ≤ 𝑔 𝑗 + 𝜁𝑛 • 𝜁 log 2 𝑜 Requires keeping at most 𝑢 ⋅ 𝑞 𝑢 = ෨ • 𝑃 counters since 𝑞 𝑢 = 𝑃(𝑢 log 𝑢) 𝜁 2 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 11

  12. Count-Sketch: Count-Min+AMS combined Count-Sketch In addition to ℎ 𝑘 : 𝑜 → [𝑐] , use hash functions 𝑠 𝑘 : 𝑜 → {−1,1} 1. 𝑘,𝑙 = σ 𝑗:ℎ 𝑘 𝑗 =𝑙 𝑠 Maintain 𝑢𝑐 counters 𝑑 𝑘 𝑗 𝑔 2. 𝑗 To answer a point query 𝑗 , return መ 3. 𝑔 𝑗 = median 𝑠 1 𝑗 𝑑 1,ℎ 1 𝑗 , … , 𝑠 𝑢 𝑗 𝑑 𝑢,ℎ 𝑢 𝑗 𝐺 2 Claim. 𝔽 𝑠 𝑘 𝑗 𝑑 𝑘,ℎ 𝑘 𝑗 = 𝑔 𝑗 and Var 𝑠 𝑘 𝑗 𝑑 𝑘,ℎ 𝑘 𝑗 ≤ ∀𝑘 ∈ [𝑢] 𝑐 • By Chebyshev, for 𝑐 = 2/𝜁 2 , 𝐺 = 1 2 Pr 𝑔 𝑗 − 𝑠 𝑘 𝑗 𝑑 𝑘,ℎ 𝑘 𝑗 ≥ 𝜁 𝐺 2 ≤ 𝜁 2 𝑐𝐺 3 2 • By Chernoff, for 𝑢 = Θ(log 1/𝜀) 𝑗 − መ Pr 𝑔 𝑔 𝑗 ≥ 𝜁 𝐺 2 ≤ 𝜀 Based on Andrew McGregor’s slides: https://people.cs.umass.edu/~mcgregor/711S18/vectors-3.pdf 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend