l ecture 7
play

L ECTURE 7 Last time Communication complexity Other models of - PowerPoint PPT Presentation

Sublinear Algorithms L ECTURE 7 Last time Communication complexity Other models of computation Today Streaming 9/24/2020 Sofya Raskhodnikova;Boston University Data Stream Model [Alon Matias Szegedy 96] B L A - B L A - B L A - B L


  1. Sublinear Algorithms L ECTURE 7 Last time • Communication complexity • Other models of computation Today • Streaming 9/24/2020 Sofya Raskhodnikova;Boston University

  2. Data Stream Model [Alon Matias Szegedy 96] B L A - B L A - B L A - B L A - B L A - B L A - B L A - Streaming (1) Quickly process each element Algorithm (3) Quickly produce output (2) Limited working memory Motivation: internet traffic analysis Model the stream as 𝑛 elements from [𝑜] , e.g., 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 = 3, 5, 3, 7, 5, 4, … Goal: Compute a function of the stream, e.g., median, number of distinct elements, longest increasing sequence. Based on Andrew McGregor’s slides: http://www.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf 2

  3. Streaming Puzzle A stream contains 𝑜 − 1 distinct elements from 𝑜 in arbitrary order. Problem: Find the missing element, using 𝑃(log 𝑜) space. 3

  4. Sampling from a Stream of Unknown Length Warm-up: Find a uniform sample 𝑡 from a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 of known length 𝑛 . 4

  5. Sampling from a Stream of Unknown Length Problem: Find a uniform sample 𝑡 from a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 of unknown length 𝑛 Algorithm (Reservoir Sampling) 1. Initially, 𝑡 ← 𝑏 1 2. On seeing the 𝑢 th element, 𝑡 ← 𝑏 𝑢 with probability 1/𝑢 Analysis: What is the probability that 𝑡 = 𝑏 𝑗 at some time 𝑢 ≥ 𝑗 ? Pr 𝑡 = 𝑏 𝑗 = 1 𝑗 + 1 ⋅ … ⋅ 1 − 1 1 𝑗 ⋅ 1 − 𝑢 = 1 𝑗 + 1 ⋅ … ⋅ 𝑢 − 1 𝑗 = 1 𝑗 ⋅ 𝑢 𝑢 Space: 𝑃(𝑙 log 𝑜 + log 𝑛) bits to get 𝑙 samples. 5

  6. Counting Distinct Elements Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Warm-up: Output the number of distinct elements in the stream. Exact solutions: • Store 𝑜 bits, indicating whether each domain element has appeared. • Store the stream: O (𝑛 log 𝑜) bits. Known lower bounds: Every deterministic algorithm requires Ω(𝑛) bits • (even for a constant-factor approximation). Every exact algorithm (even randomized) requires Ω 𝑜 bits. • Need to use both randomization and approximation to get polylog (𝑛, 𝑜) space 6

  7. Counting Distinct Elements Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 • Studied by [Flajolet Martin 83, Alon Matias Szegedy 96,...] Today: 𝑃(𝜁 −2 log 𝑜) space algorithm • [Bar−Yossef Jayram Kumar Sivakuar Trevisan 02] Optimal: 𝑃(𝜁 −2 + log 𝑜) space algorithm [Kane Nelson Woodruff 10] • 7

  8. ǁ Counting Distinct Elements Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 Algorithm Apply a random hash function ℎ ∶ 𝑜 → [𝑜] to each element 1. Compute 𝑌 , the 𝑢 -th smallest value of the hash seen where 𝑢 = 10 ⁄ 𝜁 2 2. 3. Return ǁ 𝑠 = 𝑢 ⋅ 𝑜/𝑌 as estimate for 𝑠 , the number of distinct elements. Analysis: Algorithm uses 𝑃(𝜁 −2 log 𝑜) bits of space (not accounting for storing ℎ ) • • We'll show: estimate ǁ 𝑠 has good accuracy with reasonable probability. Claim. Pr 𝑠 − 𝑠 ≤ 𝜁𝑠 ≥ 2/3 8

  9. ǁ ǁ Counting Distinct Elements: Analysis 𝑌: 𝑢 -th smallest hashed value Claim. Pr 𝑠 − 𝑠 ≤ 𝜁𝑠 ≥ 2/3 𝑢 = 10 ⁄ 𝜁 2 Proof: Suppose the distinct elements are 𝑓 1 , … , 𝑓 𝑠 𝑠 = 𝑢 ⋅ 𝑜/𝑌 • Overestimation: 𝑠 ≥ 1 + 𝜁 𝑠 = Pr 𝑢 ⋅ 𝑜 𝑢 ⋅ 𝑜 Pr ǁ ≥ 1 + 𝜁 𝑠 = Pr 𝑌 ≤ 𝑌 𝑠 1 + 𝜁 𝑢 𝑢⋅𝑜 𝑠 and 𝑍 = σ 𝑗=1 Let 𝑍 𝑗 = 𝟚 ℎ(𝑓 𝑗 ) ≤ 𝑍 • 𝐹[𝑍] = 𝑗 1 + 𝜁 𝑠 1+𝜁 Var 𝑍 ≤ 𝐹[𝑍] 𝑢 𝑢 E 𝑍 = 𝑠 ⋅ 𝐹 𝑍 1 = 𝑠 ⋅ 𝑠 1 + 𝜁 = 1 + 𝜁 𝑠 𝑠 Var 𝑍 = Var ෍ 𝑍 𝑗 = ෍ Var 𝑍 𝑗 𝑗=1 𝑗=1 𝑠 𝑠 2 = ෍ ≤ ෍ E 𝑍 E 𝑍 𝑗 = E 𝑍 𝑗 𝑗=1 𝑗=1 9

  10. ǁ ǁ Counting Distinct Elements: Analysis 𝑌: 𝑢 -th smallest hashed value Claim. Pr 𝑠 − 𝑠 ≤ 𝜁𝑠 ≥ 2/3 𝑢 = 10 ⁄ 𝜁 2 Proof: Suppose the distinct elements are 𝑓 1 , … , 𝑓 𝑠 𝑠 = 𝑢 ⋅ 𝑜/𝑌 • Overestimation: 𝑠 ≥ 1 + 𝜁 𝑠 = Pr 𝑢 ⋅ 𝑜 𝑢 ⋅ 𝑜 Pr ǁ ≥ 1 + 𝜁 𝑠 = Pr 𝑌 ≤ 𝑌 𝑠 1 + 𝜁 𝑢 𝑢⋅𝑜 𝑠 and 𝑍 = σ 𝑗=1 Let 𝑍 𝑗 = 𝟚 ℎ(𝑓 𝑗 ) ≤ 𝑍 • 𝐹[𝑍] = 𝑗 1 + 𝜁 𝑠 1+𝜁 Var 𝑍 ≤ 𝐹[𝑍] 𝑢 ⋅ 𝑜 Pr 𝑌 ≤ = Pr 𝑍 ≥ 𝑢 = Pr 𝑍 ≥ 1 + 𝜁 E 𝑍 𝑠 1 + 𝜁 By the Chebyshev’s inequality, for 𝜁 ≤ 2/3, • Var[𝑍] 𝜁 2 E 𝑍 = 1 + 𝜁 1 𝜁 2 ⋅ 𝑢 = 1 + 𝜁 ≤ 1 Pr 𝑍 ≥ 1 + 𝜁 E 𝑍 ≤ 2 ≤ 𝜁 ⋅ E 𝑍 10 6 1 Underestimation: A similar analysis shows Pr ǁ 𝑠 ≤ 1 − 𝜁 𝑠 ≤ • 6 10

  11. Removing the Random Hashing Assumption Idea: Use limited independence A family ℋ = {ℎ: 𝑏 → 𝑐 } of hash functions is 𝑙 -wise independent if for • all distinct 𝑦 1 , … , 𝑦 𝑙 ∈ [𝑏] and all 𝑧 1 , … , 𝑧 𝑙 ∈ 𝑐 , ℎ∈ℋ ℎ 𝑦 1 = 𝑧 1 , … , ℎ 𝑦 𝑙 = 𝑧 𝑙 = 1 Pr 𝑐 𝑙 Note: a uniformly random family is 𝑙 -wise independent for all 𝑙 Observations: For 𝑦 1 , … , 𝑦 𝑙 as above, • ℎ(𝑦 1 ) is uniform over 𝑐 1. ℎ 𝑦 1 , … , ℎ(𝑦 𝑙 ) are mutually independent. 2. Based on Sepehr Assadi’s lecture notes for CS 514 (Lecture 7, 03/20/20) at Rutgers 11

  12. Construction of 𝒍 -wise Independent Family Idea: Use limited independence A family ℋ = {ℎ: 𝑏 → 𝑐 } of hash functions is 𝑙 -wise independent if for • all distinct 𝑦 1 , … , 𝑦 𝑙 ∈ [𝑏] and all 𝑧 1 , … , 𝑧 𝑙 ∈ 𝑐 , ℎ∈ℋ ℎ 𝑦 1 = 𝑧 1 , … , ℎ 𝑦 𝑙 = 𝑧 𝑙 = 1 Pr 𝑐 𝑙 Construction of 𝑙 -wise Independent Family of Hash Functions 1. Let 𝑞 be a prime. Condider the set of polynomials of degree 𝑙 − 1 over 𝔾 𝑞 2. ℋ = {ℎ: {0, … , 𝑞 − 1} → {0, … , 𝑞 − 1} ∣ ℎ 𝑦 = 𝑑 𝑙−1 𝑦 𝑙−1 + ⋯ + 𝑑 1 𝑦 + 𝑑 0 , with 𝑑 0 , … , 𝑑 𝑙−1 ∈ 𝔾 𝑞 } 3. To sample ℎ ∈ ℋ , sample 𝑑 0 , … , 𝑑 𝑙−1 ∈ 𝔾 𝑞 u.i.r. Space to store ℎ is 𝑃 𝑙 log 𝑞 • For arbitrary 𝑏, 𝑐 , need 𝑃 𝑙 ⋅ log 𝑏 + log 𝑐 • space. Based on Sepehr Assadi’s lecture notes for CS 514 (Lecture 7, 03/20/20) at Rutgers 12

  13. Counting Distinct Elements: Final Algorithm Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 Algorithm Sample a hash function ℎ ∶ 𝑜 → [𝑜] from a 2-wise independent 1. family and apply ℎ to each element Compute 𝑌 , the 𝑢 -th smallest value of the hash seen where 𝑢 = 10 ⁄ 𝜁 2 2. 3. Return ǁ 𝑠 = 𝑢 ⋅ 𝑜/𝑌 as estimate for 𝑠 , the number of distinct elements. Analysis: Algorithm uses 𝑃(𝜁 −2 log 𝑜) bits of space • • Our correctness analysis applies. 13

  14. Frequency Moments Estimation Input: a stream 𝑏 1 , 𝑏 2 , … , 𝑏 𝑛 ∈ 𝑜 𝑛 The frequency vector of the stream is 𝑔 = (𝑔 1 , … , 𝑔 𝑛 ) , • where 𝑔 𝑗 is the number of times 𝑗 appears in the stream 𝑞 = σ 𝑗=1 𝑞 𝑜 The 𝑞 -th frequency moment is 𝐺 𝑞 = 𝑔 𝑔 • 𝑗 𝑞 𝐺 0 is the number of nonzero entries of 𝑔 (# of distinct elements) 𝐺 1 = 𝑛 (# of elements in the stream) 2 is a measure of non-uniformity 𝐺 2 = 𝑔 2 used e.g. for anomaly detection in network analysis 𝐺 ∞ = max 𝑔 𝑗 is the most frequent element 𝑗 Goal: Estimate 𝐺 𝑞 up to a multiplicative factor (1 + 𝜁) with probability ≥ 2/3 14

  15. Summary Streaming Model • Reservoir sampling • Distinct Elements (approximating 𝐺 0 ) 𝑙 -wise independent hashing • 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend