L ECTURE 7 Last time Communication complexity Other models of - - PowerPoint PPT Presentation

β–Ά
l ecture 7
SMART_READER_LITE
LIVE PREVIEW

L ECTURE 7 Last time Communication complexity Other models of - - PowerPoint PPT Presentation

Sublinear Algorithms L ECTURE 7 Last time Communication complexity Other models of computation Today Streaming 9/24/2020 Sofya Raskhodnikova;Boston University Data Stream Model [Alon Matias Szegedy 96] B L A - B L A - B L A - B L


slide-1
SLIDE 1

9/24/2020

Sublinear Algorithms

LECTURE 7

Last time

  • Communication complexity
  • Other models of computation

Today

  • Streaming

Sofya Raskhodnikova;Boston University

slide-2
SLIDE 2

Data Stream Model [Alon Matias Szegedy 96]

Motivation: internet traffic analysis Model the stream as 𝑛 elements from [π‘œ], e.g., 𝑏1, 𝑏2, … , 𝑏𝑛 = 3, 5, 3, 7, 5, 4, … Goal: Compute a function of the stream, e.g., median, number of distinct elements, longest increasing sequence.

2

B L A - B L A - B L A - B L A - B L A - B L A - B L A -

(2) Limited working memory (3) Quickly produce output (1) Quickly process each element

Streaming Algorithm

Based on Andrew McGregor’s slides: http://www.cs.umass.edu/~mcgregor/slides/10-jhu1.pdf

slide-3
SLIDE 3

Streaming Puzzle

A stream contains π‘œ βˆ’ 1 distinct elements from π‘œ in arbitrary order. Problem: Find the missing element, using 𝑃(log π‘œ) space.

3

slide-4
SLIDE 4

Sampling from a Stream of Unknown Length

Warm-up: Find a uniform sample 𝑑 from a stream 𝑏1, 𝑏2, … , 𝑏𝑛

  • f known length 𝑛.

4

slide-5
SLIDE 5

Sampling from a Stream of Unknown Length

Problem: Find a uniform sample 𝑑 from a stream 𝑏1, 𝑏2, … , 𝑏𝑛

  • f unknown length 𝑛

Analysis: What is the probability that 𝑑 = 𝑏𝑗 at some time 𝑒 β‰₯ 𝑗? Pr 𝑑 = 𝑏𝑗 = 1 𝑗 β‹… 1 βˆ’ 1 𝑗 + 1 β‹… … β‹… 1 βˆ’ 1 𝑒 = 1 𝑗 β‹… 𝑗 𝑗 + 1 β‹… … β‹… 𝑒 βˆ’ 1 𝑒 = 1 𝑒 Space: 𝑃(𝑙 log π‘œ + log 𝑛) bits to get 𝑙 samples.

5

Algorithm (Reservoir Sampling)

  • 1. Initially, 𝑑 ← 𝑏1
  • 2. On seeing the 𝑒th element, 𝑑 ← 𝑏𝑒 with probability 1/𝑒
slide-6
SLIDE 6

Counting Distinct Elements

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛 Warm-up: Output the number of distinct elements in the stream. Exact solutions:

  • Store π‘œ bits, indicating whether each domain element has appeared.
  • Store the stream: O(𝑛 log π‘œ) bits.

Known lower bounds:

  • Every deterministic algorithm requires Ξ©(𝑛) bits

(even for a constant-factor approximation).

  • Every exact algorithm (even randomized) requires Ξ© π‘œ bits.

Need to use both randomization and approximation to get polylog(𝑛, π‘œ) space

6

slide-7
SLIDE 7

Counting Distinct Elements

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability β‰₯ 2/3

  • Studied by [Flajolet Martin 83, Alon Matias Szegedy 96,...]
  • Today: 𝑃(πœβˆ’2 log π‘œ) space algorithm

[Barβˆ’Yossef Jayram Kumar Sivakuar Trevisan 02]

  • Optimal: 𝑃(πœβˆ’2 + log π‘œ) space algorithm [Kane Nelson Woodruff 10]

7

slide-8
SLIDE 8

Counting Distinct Elements

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability β‰₯ 2/3 Analysis:

  • Algorithm uses 𝑃(πœβˆ’2 log π‘œ) bits of space (not accounting for storing β„Ž)
  • We'll show: estimate ǁ

𝑠 has good accuracy with reasonable probability.

8

Algorithm 1. Apply a random hash function β„Ž ∢ π‘œ β†’ [π‘œ] to each element 2. Compute π‘Œ, the 𝑒-th smallest value of the hash seen where 𝑒 = 10 ⁄ 𝜁2 3. Return ǁ 𝑠 = 𝑒 β‹… π‘œ/π‘Œ as estimate for 𝑠, the number of distinct elements.

Claim.

Pr ǁ 𝑠 βˆ’ 𝑠 ≀ πœπ‘  β‰₯ 2/3

slide-9
SLIDE 9

Counting Distinct Elements: Analysis

Proof: Suppose the distinct elements are 𝑓1, … , 𝑓𝑠

  • Overestimation:

Pr ǁ 𝑠 β‰₯ 1 + 𝜁 𝑠 = Pr 𝑒 β‹… π‘œ π‘Œ β‰₯ 1 + 𝜁 𝑠 = Pr π‘Œ ≀ 𝑒 β‹… π‘œ 𝑠 1 + 𝜁

  • Let 𝑍

𝑗 = 𝟚 β„Ž(𝑓𝑗) ≀ π‘’β‹…π‘œ 𝑠 1+𝜁

and 𝑍 = σ𝑗=1

𝑠

𝑍

𝑗

E 𝑍 = 𝑠 β‹… 𝐹 𝑍

1 = 𝑠 β‹…

𝑒 𝑠 1 + 𝜁 = 𝑒 1 + 𝜁 Var 𝑍 = Var ෍

𝑗=1 𝑠

𝑍

𝑗 = ෍ 𝑗=1 𝑠

Var 𝑍

𝑗

≀ ෍

𝑗=1 𝑠

E 𝑍

𝑗 2 = ෍ 𝑗=1 𝑠

E 𝑍

𝑗 = E 𝑍

9

Claim.

Pr ǁ 𝑠 βˆ’ 𝑠 ≀ πœπ‘  β‰₯ 2/3

π‘Œ: 𝑒-th smallest hashed value 𝑒 = 10 ⁄ 𝜁2 ǁ 𝑠 = 𝑒 β‹… π‘œ/π‘Œ 𝐹[𝑍] = 𝑒 1 + 𝜁 Var 𝑍 ≀ 𝐹[𝑍]

slide-10
SLIDE 10

Counting Distinct Elements: Analysis

Proof: Suppose the distinct elements are 𝑓1, … , 𝑓𝑠

  • Overestimation:

Pr ǁ 𝑠 β‰₯ 1 + 𝜁 𝑠 = Pr 𝑒 β‹… π‘œ π‘Œ β‰₯ 1 + 𝜁 𝑠 = Pr π‘Œ ≀ 𝑒 β‹… π‘œ 𝑠 1 + 𝜁

  • Let 𝑍

𝑗 = 𝟚 β„Ž(𝑓𝑗) ≀ π‘’β‹…π‘œ 𝑠 1+𝜁

and 𝑍 = σ𝑗=1

𝑠

𝑍

𝑗

Pr π‘Œ ≀ 𝑒 β‹… π‘œ 𝑠 1 + 𝜁 = Pr 𝑍 β‰₯ 𝑒 = Pr 𝑍 β‰₯ 1 + 𝜁 E 𝑍

  • By the Chebyshev’s inequality, for 𝜁 ≀ 2/3,

Pr 𝑍 β‰₯ 1 + 𝜁 E 𝑍 ≀ Var[𝑍] 𝜁 β‹… E 𝑍

2 ≀

1 𝜁2E 𝑍 = 1 + 𝜁 𝜁2 β‹… 𝑒 = 1 + 𝜁 10 ≀ 1 6

  • Underestimation: A similar analysis shows Pr ǁ

𝑠 ≀ 1 βˆ’ 𝜁 𝑠 ≀

1 6

10

Claim.

Pr ǁ 𝑠 βˆ’ 𝑠 ≀ πœπ‘  β‰₯ 2/3

π‘Œ: 𝑒-th smallest hashed value 𝑒 = 10 ⁄ 𝜁2 ǁ 𝑠 = 𝑒 β‹… π‘œ/π‘Œ 𝐹[𝑍] = 𝑒 1 + 𝜁 Var 𝑍 ≀ 𝐹[𝑍]

slide-11
SLIDE 11

Removing the Random Hashing Assumption

Idea: Use limited independence

  • A family β„‹ = {β„Ž: 𝑏 β†’ 𝑐 } of hash functions is 𝑙-wise independent if for

all distinct 𝑦1, … , 𝑦𝑙 ∈ [𝑏] and all 𝑧1, … , 𝑧𝑙 ∈ 𝑐 , Pr

β„Žβˆˆβ„‹ β„Ž 𝑦1 = 𝑧1, … , β„Ž 𝑦𝑙 = 𝑧𝑙 = 1

𝑐𝑙 Note: a uniformly random family is 𝑙-wise independent for all 𝑙

  • Observations: For 𝑦1, … , 𝑦𝑙 as above,

1. β„Ž(𝑦1) is uniform over 𝑐 2. β„Ž 𝑦1 , … , β„Ž(𝑦𝑙) are mutually independent.

11

Based on Sepehr Assadi’s lecture notes for CS 514 (Lecture 7, 03/20/20) at Rutgers

slide-12
SLIDE 12

Construction of 𝒍-wise Independent Family

Idea: Use limited independence

  • A family β„‹ = {β„Ž: 𝑏 β†’ 𝑐 } of hash functions is 𝑙-wise independent if for

all distinct 𝑦1, … , 𝑦𝑙 ∈ [𝑏] and all 𝑧1, … , 𝑧𝑙 ∈ 𝑐 , Pr

β„Žβˆˆβ„‹ β„Ž 𝑦1 = 𝑧1, … , β„Ž 𝑦𝑙 = 𝑧𝑙 = 1

𝑐𝑙

  • Space to store β„Ž is 𝑃 𝑙 log π‘ž
  • For arbitrary 𝑏, 𝑐, need 𝑃 𝑙 β‹… log 𝑏 + log 𝑐

space.

12

Based on Sepehr Assadi’s lecture notes for CS 514 (Lecture 7, 03/20/20) at Rutgers

Construction of 𝑙-wise Independent Family of Hash Functions 1. Let π‘ž be a prime. 2. Condider the set of polynomials of degree 𝑙 βˆ’ 1 over π”Ύπ‘ž β„‹ = {β„Ž: {0, … , π‘ž βˆ’ 1} β†’ {0, … , π‘ž βˆ’ 1} ∣ β„Ž 𝑦 = π‘‘π‘™βˆ’1π‘¦π‘™βˆ’1 + β‹― + 𝑑1𝑦 + 𝑑0, with 𝑑0, … , π‘‘π‘™βˆ’1 ∈ π”Ύπ‘ž} 3. To sample β„Ž ∈ β„‹, sample 𝑑0, … , π‘‘π‘™βˆ’1 ∈ π”Ύπ‘ž u.i.r.

slide-13
SLIDE 13

Counting Distinct Elements: Final Algorithm

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛 Goal: Estimate the number of distinct elements in the stream up to a multiplicative factor (1 + 𝜁) with probability β‰₯ 2/3 Analysis:

  • Algorithm uses 𝑃(πœβˆ’2 log π‘œ) bits of space
  • Our correctness analysis applies.

13

Algorithm 1. Sample a hash function β„Ž ∢ π‘œ β†’ [π‘œ] from a 2-wise independent family and apply β„Ž to each element 2. Compute π‘Œ, the 𝑒-th smallest value of the hash seen where 𝑒 = 10 ⁄ 𝜁2 3. Return ǁ 𝑠 = 𝑒 β‹… π‘œ/π‘Œ as estimate for 𝑠, the number of distinct elements.

slide-14
SLIDE 14

Frequency Moments Estimation

Input: a stream 𝑏1, 𝑏2, … , 𝑏𝑛 ∈ π‘œ 𝑛

  • The frequency vector of the stream is 𝑔 = (𝑔

1, … , 𝑔 𝑛),

where 𝑔

𝑗 is the number of times 𝑗 appears in the stream

  • The π‘ž-th frequency moment is 𝐺

π‘ž =

𝑔

π‘ž π‘ž = σ𝑗=1 π‘œ

𝑔

𝑗 π‘ž

𝐺0 is the number of nonzero entries of 𝑔 (# of distinct elements) 𝐺

1 = 𝑛 (# of elements in the stream)

𝐺2 = 𝑔

2 2 is a measure of non-uniformity

used e.g. for anomaly detection in network analysis 𝐺

∞ = max 𝑗

𝑔

𝑗 is the most frequent element

Goal: Estimate 𝐺

π‘ž up to a multiplicative factor (1 + 𝜁) with probability β‰₯ 2/3

14

slide-15
SLIDE 15

Summary

Streaming Model

  • Reservoir sampling
  • Distinct Elements (approximating 𝐺

0)

  • 𝑙-wise independent hashing

20