Lecture #2: Advanced hashing and concentration bounds o Bloom - - PowerPoint PPT Presentation

β–Ά
lecture 2 advanced hashing and concentration bounds
SMART_READER_LITE
LIVE PREVIEW

Lecture #2: Advanced hashing and concentration bounds o Bloom - - PowerPoint PPT Presentation

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter:


slide-1
SLIDE 1
  • utline

Lecture #2: Advanced hashing and concentration bounds

  • Bloom filters
  • Cuckoo hashing
  • Load balancing
  • Tail bounds
slide-2
SLIDE 2

Bloom filters

Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter: Hash table that has only false positives (may report that a key is present when it is not, but always reports a key that is present) Very simple and fast Example: Google Chrome uses a Bloom filter to maintain its list of potentially malicious web sites.

  • Most queried keys are not in the table
  • If a key is in the table, can check against a slower (errorless) hash table

Many applications in networking (see survey by Broder and Mitzenmacher)

slide-3
SLIDE 3

Bloom filters

Data structure: Universe 𝒱. Parameters 𝑙, 𝑁 β‰₯ 1 Maintain an array 𝐡 of 𝑁 bits; initially 𝐡 0 = 𝐡 1 = β‹― = 𝐡 𝑁 βˆ’ 1 = 0 Choose 𝑙 hash functions β„Ž1, β„Ž2, … , β„Žπ‘™: 𝒱 β†’ 𝑁 (assume completely random functions for sake of analysis)

slide-4
SLIDE 4

Bloom filters

Data structure: Universe 𝒱. Parameters 𝑙, 𝑁 β‰₯ 1 Maintain an array 𝐡 of 𝑁 bits; initially 𝐡 0 = 𝐡 1 = β‹― = 𝐡 𝑁 βˆ’ 1 = 0 Choose 𝑙 hash functions β„Ž1, β„Ž2, … , β„Žπ‘™: 𝒱 β†’ 𝑁 (assume completely random functions for sake of analysis) To add a key 𝑦 ∈ 𝒱 to the dictionary 𝑇 βŠ† 𝒱, set bits 𝐡 β„Ž1 𝑦 ≔ 1, 𝐡 β„Ž2 𝑦 ≔ 1, … , 𝐡 β„Žπ‘™ 𝑦 ≔ 1

slide-5
SLIDE 5

Bloom filters

Data structure: Universe 𝒱. Parameters 𝑙, 𝑁 β‰₯ 1 Maintain an array 𝐡 of 𝑁 bits; initially 𝐡 0 = 𝐡 1 = β‹― = 𝐡 𝑁 βˆ’ 1 = 0 Choose 𝑙 hash functions β„Ž1, β„Ž2, … , β„Žπ‘™: 𝒱 β†’ 𝑁 (assume completely random functions for sake of analysis) To add a key 𝑦 ∈ 𝒱 to the dictionary 𝑇 βŠ† 𝒱, set bits 𝐡 β„Ž1 𝑦 ≔ 1, 𝐡 β„Ž2 𝑦 ≔ 1, … , 𝐡 β„Žπ‘™ 𝑦 ≔ 1 To answer a query: π‘Ÿ ∈ 𝑇 ? Check whether 𝐡 β„Žπ‘— 𝑦 = 1 for all 𝑗 = 1,2, … , 𝑙 If yes, answer Yes. If no, answer No.

slide-6
SLIDE 6

Bloom filters

No false negatives: Clearly if 𝑦 ∈ 𝑇, we return Yes. But there is some chance that other keys have caused the bits in positions β„Ž1 𝑦 , … , β„Žπ‘™(𝑦) to be set even if 𝑦 βˆ‰ 𝑇.

slide-7
SLIDE 7

Bloom filters

No false negatives: Clearly if 𝑦 ∈ 𝑇, we return Yes. Let us assume that 𝑇 = π‘œ. Compute β„™[𝐡 β„“ = 0] for some location β„“ ∈ [𝑁]: π‘ž 𝑙, 𝑂 = 1 βˆ’ 1 𝑁

𝑙𝑂

β‰ˆ π‘“βˆ’π‘™π‘‚

𝑁

But there is some chance that other keys have caused the bits in positions β„Ž1 𝑦 , … , β„Žπ‘™(𝑦) to be set even if 𝑦 βˆ‰ 𝑇. Heuristic analysis:

(Here we use the approximation 1 βˆ’

1 𝑁 𝑁

β‰ˆ π‘“βˆ’1 for 𝑁 large enough.)

slide-8
SLIDE 8

Bloom filters

No false negatives: Clearly if 𝑦 ∈ 𝑇, we return Yes. Let us assume that 𝑇 = π‘œ. Compute β„™[𝐡 β„“ = 0] for some location β„“ ∈ [𝑁]: π‘ž 𝑙, 𝑂 = 1 βˆ’ 1 𝑁

𝑙𝑂

β‰ˆ π‘“βˆ’π‘™π‘‚

𝑁

But there is some chance that other keys have caused the bits in positions β„Ž1 𝑦 , … , β„Žπ‘™(𝑦) to be set even if 𝑦 βˆ‰ 𝑇. Heuristic analysis:

(Here we use the approximation 1 βˆ’

1 𝑁 𝑁

β‰ˆ π‘“βˆ’1 for 𝑁 large enough.)

If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂), then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 1 βˆ’ π‘ž 𝑙, 𝑂

𝑙 β‰ˆ 1 βˆ’ π‘“βˆ’π‘™π‘‚ 𝑁 𝑙

slide-9
SLIDE 9

Bloom filters

Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂), then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 1 βˆ’ π‘ž 𝑙, 𝑂

𝑙 β‰ˆ 1 βˆ’ π‘“βˆ’π‘™π‘‚ 𝑁 𝑙

slide-10
SLIDE 10

Bloom filters

Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂), then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 1 βˆ’ π‘ž 𝑙, 𝑂

𝑙 β‰ˆ 1 βˆ’ π‘“βˆ’π‘™π‘‚ 𝑁 𝑙

But the actual fraction of 0′𝑑 in the hash table is a random variable π‘Œπ‘™,𝑂 with expectation 𝔽 π‘Œπ‘™,𝑂 = π‘ž 𝑙, 𝑂 To get the analysis right, we need a concentration bound: Want to say that π‘Œπ‘™,𝑂 is close to its expected value with high probability. [We will return to this in the 2nd half of the lecture]

slide-11
SLIDE 11

Bloom filters

Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂), then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 1 βˆ’ π‘ž 𝑙, 𝑂

𝑙 β‰ˆ 1 βˆ’ π‘“βˆ’π‘™π‘‚ 𝑁 𝑙

But the actual fraction of 0′𝑑 in the hash table is a random variable π‘Œπ‘™,𝑂 with expectation 𝔽 π‘Œπ‘™,𝑂 = π‘ž 𝑙, 𝑂 To get the analysis right, we need a concentration bound: Want to say that π‘Œπ‘™,𝑂 is close to its expected value with high probability. [We will return to this in the 2nd half of the lecture] If the heuristic analysis is correct, it gives nice estimates: For instance, if 𝑁 = 8𝑂, then choosing the optimal value of 𝑙 = 7 gives false positive rate about 2%.

slide-12
SLIDE 12
  • utline

Lecture #2: Advanced hashing and concentration bounds

  • Bloom filters
  • Cuckoo hashing
  • Load balancing
  • Tail bounds

Cuckoo hashing is a hash scheme with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

slide-13
SLIDE 13

Cuckoo hashing

Idea: Simple hashing without errors Lookups are worst case 𝑃(1) time Deletions are worst case 𝑃(1) time Insertions are expected 𝑃(1) time Insertion time is 𝑃(1) with good probability [will require a concentration bound]

slide-14
SLIDE 14

Cuckoo hashing

Data structure: Two tables 𝐡1 and 𝐡2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž1, β„Ž2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐡1 β„Ž1 𝑦

  • r 𝐡2[β„Ž2 𝑦 ]

is empty, store 𝑦 there.

slide-15
SLIDE 15

Cuckoo hashing

Data structure: Two tables 𝐡1 and 𝐡2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž1, β„Ž2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐡1 β„Ž1 𝑦

  • r 𝐡2[β„Ž2 𝑦 ]

is empty, store 𝑦 there. Bump: Whenever an element 𝑨 is bumped from 𝐡𝑗 β„Žπ‘— 𝑨 , attempt to store it in the other location π΅π‘˜ β„Žπ‘˜ 𝑨

(here 𝑗, π‘˜ = 1,2 or 2,1 )

If both locations are occupied, then place 𝑦 in 𝐡1 β„Ž1 𝑦 and bump the current occupant.

slide-16
SLIDE 16

Cuckoo hashing

Data structure: Two tables 𝐡1 and 𝐡2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž1, β„Ž2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐡1 β„Ž1 𝑦

  • r 𝐡2[β„Ž2 𝑦 ]

is empty, store 𝑦 there. Bump: Whenever an element 𝑨 is bumped from 𝐡𝑗 β„Žπ‘— 𝑨 , attempt to store it in the other location π΅π‘˜ β„Žπ‘˜ 𝑨

(here 𝑗, π‘˜ = 1,2 or 2,1 )

Abort: After 6 log 𝑂 consecutive bumps, stop the process and build a fresh hash table using new random hash functions β„Ž1, β„Ž2. If both locations are occupied, then place 𝑦 in 𝐡1 β„Ž1 𝑦 and bump the current occupant.

slide-17
SLIDE 17

Cuckoo hashing

Arrows represent the alternate location for each key. If we insert an item at the location of 𝐡, it will get bumped, thereby bumping 𝐢, and then we are done. Cycles are possible (where the insertion process never completes). What’s an example? Alternately (as in the picture), we can use a single table with 2𝑁 entries and two hash functions β„Ž1, β„Ž2: 𝒱 β†’ 2𝑁 (with the same β€œbumping” algorithm)

slide-18
SLIDE 18

Cuckoo hashing

Data structure: Two tables 𝐡1 and 𝐡2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž1, β„Ž2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is 𝑃(1) if 𝑁 β‰₯ 4𝑂.

slide-19
SLIDE 19

Cuckoo hashing

Data structure: Two tables 𝐡1 and 𝐡2 both of size 𝑁 = 𝑃 𝑂 Two hash functions β„Ž1, β„Ž2 ∢ 𝒱 β†’ [𝑁] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is 𝑃(1) if 𝑁 β‰₯ 4𝑂. Pretty good… but only 25% memory utilization. Can actually get about 50% memory utilization. Experimentally, with 3 hash functions instead of 2, can get β‰ˆ 90% utilization, but it is an open question to provide tight analyses for 𝑒 hash functions when 𝑒 β‰₯ 3.

slide-20
SLIDE 20
  • utline

Lecture #2: Advanced hashing and concentration bounds

  • Bloom filters
  • Cuckoo hashing
  • Load balancing
  • Tail bounds
slide-21
SLIDE 21

Load balancing

Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule.

slide-22
SLIDE 22

Load balancing

Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function β„Ž ∢ 𝑂 β†’ 𝑂

slide-23
SLIDE 23

Load balancing

Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function β„Ž ∢ 𝑂 β†’ 𝑂 Claim: The max-loaded server has < 8 log 𝑂 / log log 𝑂 jobs with probability at least 1 βˆ’ 1/𝑂

slide-24
SLIDE 24

Load balancing

Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function β„Ž ∢ 𝑂 β†’ 𝑂 Claim: The max-loaded server has < 8 log 𝑂 / log log 𝑂 jobs with probability at least 1 βˆ’ 1/𝑂 Proof: Probability that a fixed server 𝑗 ∈ {1,2, … , 𝑂} gets at least 𝑙 jobs is at most 𝑂 𝑙 1 𝑂

𝑙

≀ 𝑂𝑙 𝑙! β‹… 1 𝑂𝑙 ≀ 1 𝑙! ≀ π‘™βˆ’π‘™

2

slide-25
SLIDE 25

Load balancing

Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function β„Ž ∢ 𝑂 β†’ 𝑂 Claim: The max-loaded server has < 8 log 𝑂 / log log 𝑂 jobs with probability at least 1 βˆ’ 1/𝑂 Proof: Probability that a fixed server 𝑗 ∈ {1,2, … , 𝑂} gets at least 𝑙 jobs is at most 𝑂 𝑙 1 𝑂

𝑙

≀ 𝑂𝑙 𝑙! β‹… 1 𝑂𝑙 ≀ 1 𝑙! ≀ π‘™βˆ’π‘™

2

If we choose 𝑙 =

8 log 𝑂 log log 𝑂 this is at most 1/𝑂2

Explanation: 𝑙

𝑙 2 β‰₯

log 𝑂

4 log 𝑂 log log 𝑂 β‰₯ 22 log 𝑂 = 𝑂2

slide-26
SLIDE 26

Load balancing

Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function β„Ž ∢ 𝑂 β†’ 𝑂 Claim: The max-loaded server has < 8 log 𝑂 / log log 𝑂 jobs with probability at least 1 βˆ’ 1/𝑂 Proof: Probability that a fixed server 𝑗 ∈ {1,2, … , 𝑂} gets at least 𝑙 jobs is at most 𝑂 𝑙 1 𝑂

𝑙

≀ 𝑂𝑙 𝑙! β‹… 1 𝑂𝑙 ≀ 1 𝑙! ≀ π‘™βˆ’π‘™

2

If we choose 𝑙 =

8 log 𝑂 log log 𝑂 this is at most 1/𝑂2

Now a union bound shows that the probability of any server getting at least 𝑙 jobs is at most 1/𝑂.

slide-27
SLIDE 27

Concentration bounds

Claim: The max-loaded server has < 8 log 𝑂 / log log 𝑂 jobs with probability at least 1 βˆ’ 1/𝑂 Proof: Probability that a fixed server 𝑗 ∈ {1,2, … , 𝑂} gets at least 𝑙 jobs is at most 𝑂 𝑙 1 𝑂

𝑙

≀ 𝑂𝑙 𝑙! β‹… 1 𝑂𝑙 ≀ 1 𝑙! ≀ π‘™βˆ’π‘™

2

If we choose 𝑙 =

8 log 𝑂 log log 𝑂 this is at most 1/𝑂2

Now a union bound shows that the probability of any server getting at least 𝑙 jobs is at most 1/𝑂. This is an example of a concentration bound. Let π‘Œπ‘— be the number of jobs assigned to the 𝑗th server. By linearity of expectation, 𝔽 π‘Œπ‘— = Οƒπ‘˜=1

𝑂

β„™ job π‘˜ β†’ server 𝑗 = 𝑂 β‹… 1/𝑂 = 1.

slide-28
SLIDE 28

Concentration bounds

Claim: The max-loaded server has < 8 log 𝑂 / log log 𝑂 jobs with probability at least 1 βˆ’ 1/𝑂 Proof: Probability that a fixed server 𝑗 ∈ {1,2, … , 𝑂} gets at least 𝑙 jobs is at most 𝑂 𝑙 1 𝑂

𝑙

≀ 𝑂𝑙 𝑙! β‹… 1 𝑂𝑙 ≀ 1 𝑙! ≀ π‘™βˆ’π‘™

2

If we choose 𝑙 =

8 log 𝑂 log log 𝑂 this is at most 1/𝑂2

Now a union bound shows that the probability of any server getting at least 𝑙 jobs is at most 1/𝑂. This is an example of a concentration bound. Let π‘Œπ‘— be the number of jobs assigned to the 𝑗th server. By linearity of expectation, 𝔽 π‘Œπ‘— = Οƒπ‘˜=1

𝑂

β„™ job π‘˜ β†’ server 𝑗 = 𝑂 β‹… 1/𝑂 = 1. We showed that β„™ π‘Œπ‘— β‰₯

8 log 𝑂 log log 𝑂 ≀ 1 𝑂2 and then took a union bound over all 𝑂 servers.

slide-29
SLIDE 29

Concentration bounds

This is an example of a concentration bound. Let π‘Œπ‘— be the number of jobs assigned to the 𝑗th server. By linearity of expectation, 𝔽 π‘Œπ‘— = Οƒπ‘˜=1

𝑂

β„™ job π‘˜ β†’ server 𝑗 = 𝑂 β‹… 1/𝑂 = 1. We showed that β„™ π‘Œπ‘— β‰₯

8 log 𝑂 log log 𝑂 ≀ 1 𝑂2 and then took a union bound over all 𝑂 servers.

This is a common analysis technique: If a random variable (like π‘Œπ‘—) depends in a β€œsmooth” way on the outcome of many independent events, then it is likely not too far from its expectation.

slide-30
SLIDE 30

Concentration bounds

Let π‘Œπ‘— be the number of jobs assigned to the 𝑗th server. By linearity of expectation, 𝔽 π‘Œπ‘— = Οƒπ‘˜=1

𝑂

β„™ job π‘˜ β†’ server 𝑗 = 𝑂 β‹… 1/𝑂 = 1. We showed that β„™ π‘Œπ‘— β‰₯

8 log 𝑂 log log 𝑂 ≀ 1 𝑂2 and then took a union bound over all 𝑂 servers.

This is a common analysis technique: If a random variable (like π‘Œπ‘—) depends in a β€œsmooth” way on the outcome of many independent events, then it is likely not too far from its expectation. β€œSmooth” in this case means that the outcome of any decision (where to put job π‘˜) does not affect the value of π‘Œπ‘— by too much (only by 1). This is an example of a concentration bound.

slide-31
SLIDE 31

EXERCISE

Is it concentrated? [why or why not?] #1: Choose a uniformly random vector π‘Œ ∈ β„π‘œ with π‘Œ = π‘Œ1

2 + π‘Œ2 2 + β‹― + π‘Œπ‘œ 2 = 1

What is 𝔽[π‘Œ1

2] ?

What is the typical value of the maximum: max π‘Œ1 , π‘Œ2 , … , π‘Œπ‘œ ? #2 Rich get richer: Suppose we have 𝑂 people. Everyone starts with 1 dollar.

We assign 𝑂2 more dollars in rounds. 𝒋th round: If person π‘˜ already has π‘œπ‘˜ dollars, we give them the 𝑗th dollar with probability π‘œπ‘˜ 𝑗 βˆ’ 1 i.e., with probability to the proportional the amount of money they already have. Let π‘Œπ‘— be the amount of money person 𝑗 ends up with. What is the typical value of π‘Œ1? Is π‘Œ1 concentrated? What is the typical value of max(π‘Œ1, π‘Œ2, … , π‘Œπ‘œ)? Is it concentrated?

slide-32
SLIDE 32
  • utline

Lecture #2: Advanced hashing and concentration bounds

  • Bloom filters
  • Cuckoo hashing
  • Load balancing
  • Tail bounds
slide-33
SLIDE 33

Markov’s inequality

The more you know: The more information we have about a random variable, the stronger the concentration we can prove.

slide-34
SLIDE 34

Markov’s inequality

The more you know: The more information we have about a random variable, the stronger the concentration we can prove. The most basic concentration bound is Markov’s inequality. It requires knowing only the expected value: If π‘Œ is a non-negative random variable, then for any πœ‡ β‰₯ 1, β„™ π‘Œ β‰₯ πœ‡ ≀ 𝔽 π‘Œ πœ‡ Proof? (it’s written there)

slide-35
SLIDE 35

Markov’s inequality

The more you know: The more information we have about a random variable, the stronger the concentration we can prove. The most basic concentration bound is Markov’s inequality. It requires knowing only the expected value: If π‘Œ is a non-negative random variable, then for any πœ‡ β‰₯ 1, β„™ π‘Œ β‰₯ πœ‡ ≀ 𝔽 π‘Œ πœ‡ Proof? (it’s written there) Example: If your expected revenue is $10,000, then the probability to make $1,000 is at most 1/10.

slide-36
SLIDE 36

EXERCISE

Markov’s inequality: If π‘Œ is a non-negative random variable, then for any πœ‡ β‰₯ 1, β„™ π‘Œ β‰₯ πœ‡ ≀ 𝔽 π‘Œ πœ‡ A permutation is an invertible mapping 𝜌 ∢ 1,2, … , π‘œ β†’ {1,2, … , π‘œ} A number π‘˜ is called a fixed point of 𝜌 if 𝜌 π‘˜ = π‘˜. Exercise: Prove that if 𝜌 is a uniformly random permutation, then β„™ 𝜌 has more than 𝑙 fixed points ≀ 1 𝑙

slide-37
SLIDE 37

Chebyshev’s inequality

Recall that the variance of a random variable π‘Œ is the value var π‘Œ = 𝜏2 = 𝔽 π‘Œ βˆ’ π”½π‘Œ 2

slide-38
SLIDE 38

Chebyshev’s inequality

Recall that the variance of a random variable π‘Œ is the value var π‘Œ = 𝜏2 = 𝔽 π‘Œ βˆ’ π”½π‘Œ 2 Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2

slide-39
SLIDE 39

Chebyshev’s inequality

Recall that the variance of a random variable π‘Œ is the value var π‘Œ = 𝜏2 = 𝔽 π‘Œ βˆ’ π”½π‘Œ 2 Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2 Proof: Apply Markov’s inequality to the random variable 𝑍 = π‘Œ βˆ’ π”½π‘Œ 2

slide-40
SLIDE 40

Chebyshev’s inequality

Recall that the variance of a random variable π‘Œ is the value var π‘Œ = 𝜏2 = 𝔽 π‘Œ βˆ’ π”½π‘Œ 2 Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2 Proof: Apply Markov’s inequality to the random variable 𝑍 = π‘Œ βˆ’ π”½π‘Œ 2 Application: Suppose we map 𝑂 balls into 𝑂 bins using a 2-universal hash family β„‹. Then with probability at least 1/2, the maximum load is at most 𝑃 𝑂 .

slide-41
SLIDE 41

Chebyshev’s inequality

Application: Suppose we map 𝑂 balls into 𝑂 bins using a 2-universal hash family β„‹. Then with probability at least 1/2, the maximum load is at most 𝑃 𝑂 . Let 𝑀𝑗 be the load of bin 𝑗. Let π‘Œπ‘—π‘˜ be the indicator random variable such that π‘Œπ‘—π‘˜ = 1 ↔ 𝑗th bin gets the π‘˜th ball. Note that 𝔽 π‘Œπ‘—π‘˜ = 1/𝑂 for each 𝑗 = 1, … , 𝑂.

slide-42
SLIDE 42

Chebyshev’s inequality

Application: Suppose we map 𝑂 balls into 𝑂 bins using a 2-universal hash family β„‹. Then with probability at least 1/2, the maximum load is at most 𝑃 𝑂 . Let 𝑀𝑗 be the load of bin 𝑗. Let π‘Œπ‘—π‘˜ be the indicator random variable such that π‘Œπ‘—π‘˜ = 1 ↔ 𝑗th bin gets the π‘˜th ball. Note that 𝔽 π‘Œπ‘—π‘˜ = 1/𝑂 for each 𝑗 = 1, … , 𝑂. Exercise: For any random variable 𝑍, var 𝑍 = 𝔽 𝑍2 βˆ’ 𝔽𝑍 2

slide-43
SLIDE 43

Chebyshev’s inequality

Application: Suppose we map 𝑂 balls into 𝑂 bins using a 2-universal hash family β„‹. Then with probability at least 1/2, the maximum load is at most 𝑃 𝑂 . Let 𝑀𝑗 be the load of bin 𝑗. Let π‘Œπ‘—π‘˜ be the indicator random variable such that π‘Œπ‘—π‘˜ = 1 ↔ 𝑗th bin gets the π‘˜th ball. Note that 𝔽 π‘Œπ‘—π‘˜ = 1/𝑂 for each 𝑗 = 1, … , 𝑂. Exercise: For any random variable 𝑍, var 𝑍 = 𝔽 𝑍2 βˆ’ 𝔽𝑍 2 So write: var 𝑀𝑗 = 𝔽

π‘Œπ‘—1 + β‹― + π‘Œπ‘—π‘‚ 2 βˆ’ 1 We have 𝔽 π‘Œπ‘—π‘˜

2 = 𝔽 π‘Œπ‘—π‘˜ = 1/𝑂 and 𝔽 π‘Œπ‘—π‘˜π‘Œπ‘—π‘™ = β„™ β„Ž π‘˜ = β„Ž 𝑙 = 𝑗 ≀ 1/𝑂2

using the 2-universal property, so var 𝑀𝑗 ≀ 𝑂 β‹… 1 𝑂 + 𝑂 𝑂 βˆ’ 1 𝑂2 βˆ’ 1 = 1 βˆ’ 1 𝑂 ≀ 1

slide-44
SLIDE 44

Chebyshev’s inequality var 𝑀𝑗 ≀ 𝑂 β‹… 1 𝑂 + 𝑂 𝑂 βˆ’ 1 𝑂2 βˆ’ 1 = 1 βˆ’ 1 𝑂 ≀ 1

Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2

slide-45
SLIDE 45

Chebyshev’s inequality var 𝑀𝑗 ≀ 𝑂 β‹… 1 𝑂 + 𝑂 𝑂 βˆ’ 1 𝑂2 βˆ’ 1 = 1 βˆ’ 1 𝑂 ≀ 1

Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2 Apply Chebyshev’s inequality to 𝑀𝑗, yielding β„™ 𝑀𝑗 βˆ’ 1 β‰₯ πœ‡ ≀ 1 πœ‡2

slide-46
SLIDE 46

Chebyshev’s inequality var 𝑀𝑗 ≀ 𝑂 β‹… 1 𝑂 + 𝑂 𝑂 βˆ’ 1 𝑂2 βˆ’ 1 = 1 βˆ’ 1 𝑂 ≀ 1

Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2 Apply Chebyshev’s inequality to 𝑀𝑗, yielding β„™ 𝑀𝑗 βˆ’ 1 β‰₯ πœ‡ ≀ 1 πœ‡2 Thus β„™ 𝑀𝑗 βˆ’ 1 β‰₯ 2𝑂 ≀ 1

2𝑂, so a union bound yields

β„™ max 𝑀1, … , 𝑀𝑂 β‰₯ 2𝑂 + 1 ≀ 1 2

slide-47
SLIDE 47

EXERCISE

Let π‘ž be the actual percentage of the population the prefers candidate #1 and let ΖΈ π‘ž = π‘Œ1 + β‹― + π‘Œπ‘œ /π‘œ denote the empirical mean. Exercise: Prove that if we want π‘ž βˆ’ ΖΈ π‘ž ≀ πœ— to hold with 99% probability, then we need only sample π‘œ = 𝑃(1/πœ—2) voters. Chebyshev’s inequality: If π‘Œ is a random variable with var π‘Œ = 𝜏2, then for any πœ‡ > 0, β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡πœ ≀ 1 πœ‡2 Suppose we choose π‘œ independent random voters and ask them whether they prefer candidate #1 over candidate #2. We see outcomes π‘Œ1, π‘Œ2, … , π‘Œπ‘œ ∈ 0,1 .

slide-48
SLIDE 48

Sums of independent random variables

Hoeffding’s inequality: Let π‘Œ1, … , π‘Œπ‘œ be a sequence of independent random variables where, for each 1 ≀ 𝑗 ≀ π‘œ, we have 𝑏𝑗 ≀ π‘Œπ‘— ≀ 𝑐𝑗. Let π‘Œ = (π‘Œ1 + β‹― + π‘Œπ‘œ)/π‘œ. Then:

β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡ ≀ 2 𝑓

βˆ’ 2πœ‡2π‘œ2 σ𝑗=1

π‘œ

π‘π‘—βˆ’π‘π‘— 2

slide-49
SLIDE 49

Sums of independent random variables

Hoeffding’s inequality: Let π‘Œ1, … , π‘Œπ‘œ be a sequence of independent random variables where, for each 1 ≀ 𝑗 ≀ π‘œ, we have 𝑏𝑗 ≀ π‘Œπ‘— ≀ 𝑐𝑗. Let π‘Œ = (π‘Œ1 + β‹― + π‘Œπ‘œ)/π‘œ. Then:

β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡ ≀ 2 𝑓

βˆ’ 2πœ‡2π‘œ2 σ𝑗=1

π‘œ

π‘π‘—βˆ’π‘π‘— 2

Suppose we wanted our poll from the previous slide to be correct with probability at least 1 βˆ’ πœ€. Chebyshev’s inequality would tell us we need at most 𝑃

1 πœ—2 πœ€ samples.

slide-50
SLIDE 50

Sums of independent random variables

Hoeffding’s inequality: Let π‘Œ1, … , π‘Œπ‘œ be a sequence of independent random variables where, for each 1 ≀ 𝑗 ≀ π‘œ, we have 𝑏𝑗 ≀ π‘Œπ‘— ≀ 𝑐𝑗. Let π‘Œ = (π‘Œ1 + β‹― + π‘Œπ‘œ)/π‘œ. Then:

β„™ π‘Œ βˆ’ π”½π‘Œ β‰₯ πœ‡ ≀ 2 𝑓

βˆ’ 2πœ‡2π‘œ2 σ𝑗=1

π‘œ

π‘π‘—βˆ’π‘π‘— 2

Suppose we wanted our poll from the previous slide to be correct with probability at least 1 βˆ’ πœ€. Chebyshev’s inequality would tell us we need at most 𝑃

1 πœ—2 πœ€ samples.

Setting 𝑏𝑗 = 0, 𝑐𝑗 = 1, and πœ‡ = πœ— in Hoeffding’s inequality gives β„™ ΖΈ π‘ž βˆ’ π‘ž β‰₯ πœ— ≀ 2π‘“βˆ’2πœ—2π‘œ so we only need π‘œ ≀ 𝑃

log 1

πœ€

πœ—2

samples.

slide-51
SLIDE 51

Sums of independent random variables

Chernoff bound (multiplicative): Let π‘Œ1, … , π‘Œπ‘œ be a sequence of independent 0,1 -valued random variables. Let π‘žπ‘— = 𝔽[π‘Œπ‘—], π‘Œ = π‘Œ1 + π‘Œ2 + β‹― + π‘Œπ‘œ, 𝜈 = 𝔽[π‘Œ]. Then for every 𝛾 β‰₯ 1:

β„™ π‘Œ β‰₯ π›Ύπœˆ ≀ π‘“π›Ύβˆ’1 𝛾𝛾

𝜈

β„™ π‘Œ ≀ 𝜈/𝛾 ≀ 𝑓

1 π›Ύβˆ’1

𝛾𝛾

𝜈

slide-52
SLIDE 52

Sums of independent random variables

Chernoff bound (multiplicative): Let π‘Œ1, … , π‘Œπ‘œ be a sequence of independent 0,1 -valued random variables. Let π‘žπ‘— = 𝔽[π‘Œπ‘—], π‘Œ = π‘Œ1 + π‘Œ2 + β‹― + π‘Œπ‘œ, 𝜈 = 𝔽[π‘Œ]. Then for every 𝛾 β‰₯ 1: 𝑂 balls thrown randomly into 𝑂 bins. π‘Œπ‘— = 1 if ith ball ends up in first bin and π‘Œπ‘— = 0 otherwise. Then π‘Œ = # of balls in first bin. As we calculated earlier, 𝔽 π‘Œ = 1 For 𝛾 β‰ˆ

log 𝑂 log log 𝑂, the Chernoff bound gives β„™ π‘Œ β‰₯ 𝛾 ≀ 1/𝑂2

Reproduce balls in bins:

β„™ π‘Œ β‰₯ π›Ύπœˆ ≀ π‘“π›Ύβˆ’1 𝛾𝛾

𝜈

β„™ π‘Œ ≀ 𝜈/𝛾 ≀ 𝑓

1 π›Ύβˆ’1

𝛾𝛾

𝜈

slide-53
SLIDE 53

Sums of independent random variables

Chernoff bound (multiplicative): Let π‘Œ1, … , π‘Œπ‘œ be a sequence of independent 0,1 -valued random variables. Let π‘žπ‘— = 𝔽[π‘Œπ‘—], π‘Œ = π‘Œ1 + π‘Œ2 + β‹― + π‘Œπ‘œ, 𝜈 = 𝔽[π‘Œ]. Then for every 𝛾 β‰₯ 1:

β„™ π‘Œ β‰₯ π›Ύπœˆ ≀ π‘“π›Ύβˆ’1 𝛾𝛾

𝜈

β„™ π‘Œ ≀ 𝜈/𝛾 ≀ 𝑓

1 π›Ύβˆ’1

𝛾𝛾

𝜈

𝑂 balls thrown randomly into 𝑂 bins. π‘Œπ‘— = 1 if ith ball ends up in first bin and π‘Œπ‘— = 0 otherwise. Then π‘Œ = # of balls in first bin. As we calculated earlier, 𝔽 π‘Œ = 1 For 𝛾 β‰ˆ

log 𝑂 log log 𝑂, the Chernoff bound gives β„™ π‘Œ β‰₯ 𝛾 ≀ 1/𝑂2

Reproduce balls in bins: This type of analysis works for much more complicated kinds of events (see homework #2)

slide-54
SLIDE 54

(return to) Bloom filters

Heuristic analysis: If each location in 𝐡 is 0 with probability π‘ž(𝑙, 𝑂), then a false positive for 𝑦 βˆ‰ 𝑇 should happen with probability at most 1 βˆ’ π‘ž 𝑙, 𝑂

𝑙 β‰ˆ 1 βˆ’ π‘“βˆ’π‘™π‘‚ 𝑁 𝑙

But the actual fraction of 0′𝑑 in the hash table is a random variable π‘Œπ‘™,𝑂 with expectation 𝔽 π‘Œπ‘™,𝑂 = π‘ž 𝑙, 𝑂 To get the analysis right, we need a concentration bound: Want to say that π‘Œπ‘™,𝑂 is close to its expected value with high probability. Let’s analyze!

slide-55
SLIDE 55

(return to) Bloom filters

We have an array with 𝑁 bits and to hash an element 𝑦 ∈ 𝒱, we set the bits in positions β„Ž1 𝑦 , β„Ž2 𝑦 , … , β„Žπ‘™ 𝑦 to 1.

slide-56
SLIDE 56

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. We have an array with 𝑁 bits and to hash an element 𝑦 ∈ 𝒱, we set the bits in positions β„Ž1 𝑦 , β„Ž2 𝑦 , … , β„Žπ‘™ 𝑦 to 1.

slide-57
SLIDE 57

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 We have an array with 𝑁 bits and to hash an element 𝑦 ∈ 𝒱, we set the bits in positions β„Ž1 𝑦 , β„Ž2 𝑦 , … , β„Žπ‘™ 𝑦 to 1.

slide-58
SLIDE 58

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 Let 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 We have an array with 𝑁 bits and to hash an element 𝑦 ∈ 𝒱, we set the bits in positions β„Ž1 𝑦 , β„Ž2 𝑦 , … , β„Žπ‘™ 𝑦 to 1.

slide-59
SLIDE 59

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 Let 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 Define π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ to be the expected # of 0’s in the hash table after hashing the first π‘˜ elements. We have an array with 𝑁 bits and to hash an element 𝑦 ∈ 𝒱, we set the bits in positions β„Ž1 𝑦 , β„Ž2 𝑦 , … , β„Žπ‘™ 𝑦 to 1.

slide-60
SLIDE 60

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 Let 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 Define π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ to be the expected # of 0’s in the hash table after hashing the first π‘˜ elements. Note that 𝑦1, … , 𝑦𝑂 are any set of keys. The randomness here is all in the choice of the hash functions β„Ž1, … , β„Žπ‘™. We have an array with 𝑁 bits and to hash an element 𝑦 ∈ 𝒱, we set the bits in positions β„Ž1 𝑦 , β„Ž2 𝑦 , … , β„Žπ‘™ 𝑦 to 1.

slide-61
SLIDE 61

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ We calculated before that π‘Œ0 = 𝔽 π‘Œ = 𝑛 1 βˆ’ 1

𝑛 𝑙𝑂

[why?]

slide-62
SLIDE 62

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ We calculated before that π‘Œ0 = 𝔽 π‘Œ = 𝑛 1 βˆ’ 1

𝑛 𝑙𝑂

[why?] Now we want to know the probability that π‘Œ is much different from its expectation π‘Œ0.

slide-63
SLIDE 63

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ We calculated before that π‘Œ0 = 𝔽 π‘Œ = 𝑛 1 βˆ’ 1

𝑛 𝑙𝑂

[why?] Now we want to know the probability that π‘Œ is much different from its expectation π‘Œ0. Claim #1: π‘Œ

π‘˜+1 βˆ’ π‘Œ π‘˜ ≀ 𝑙

for all π‘˜ = 1,2, … , 𝑂 βˆ’ 1

slide-64
SLIDE 64

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ We calculated before that π‘Œ0 = 𝔽 π‘Œ = 𝑛 1 βˆ’ 1

𝑛 𝑙𝑂

[why?] Now we want to know the probability that π‘Œ is much different from its expectation π‘Œ0. Claim #1: π‘Œ

π‘˜+1 βˆ’ π‘Œ π‘˜ ≀ 𝑙

for all π‘˜ = 1,2, … , 𝑂 βˆ’ 1 Claim #2: 𝔽 π‘Œ

π‘˜+1

𝐼 𝑦1 , … , 𝐼 π‘¦π‘˜ = π‘Œ

π‘˜

for all π‘˜ = 1,2, … , 𝑂 βˆ’ 1

slide-65
SLIDE 65

(return to) Bloom filters

Let π‘Œ be the # of 0’s in the hash table after 𝑂 elements are hashed. Consider the 𝑂 elements to hash: 𝑦1, 𝑦2, … , 𝑦𝑂 𝐼 𝑦𝑗 = β„Ž1 𝑦𝑗 , … , β„Žπ‘™ 𝑦𝑗 π‘Œ

π‘˜ = 𝔽 π‘Œ

𝐼 𝑦1 , 𝐼 𝑦2 , … , 𝐼 π‘¦π‘˜ We calculated before that π‘Œ0 = 𝔽 π‘Œ = 𝑛 1 βˆ’ 1

𝑛 𝑙𝑂

[why?] Now we want to know the probability that π‘Œ is much different from its expectation π‘Œ0. Claim #1: π‘Œ

π‘˜+1 βˆ’ π‘Œ π‘˜ ≀ 𝑙

for all π‘˜ = 1,2, … , 𝑂 βˆ’ 1 Claim #2: 𝔽 π‘Œ

π‘˜+1

𝐼 𝑦1 , … , 𝐼 π‘¦π‘˜ = π‘Œ

π‘˜

for all π‘˜ = 1,2, … , 𝑂 βˆ’ 1 Such a sequence of random variables is called a martingale

slide-66
SLIDE 66

Azuma’s inequality

Suppose that {π‘Œ0, π‘Œ1, … , π‘Œπ‘‚} is a martingale such that for some constants {𝑑

π‘˜},

π‘Œ

π‘˜+1 βˆ’ π‘Œ π‘˜ ≀ 𝑑 π‘˜ for all π‘˜ = 0,1, … , 𝑂 βˆ’ 1. Then for any πœ‡ > 0,

β„™ π‘Œπ‘‚ βˆ’ π‘Œ0 β‰₯ πœ‡ ≀ 2 exp βˆ’ πœ‡2 2 𝑑1

2 + β‹― + 𝑑𝑂 2

slide-67
SLIDE 67

Azuma’s inequality

Suppose that {π‘Œ0, π‘Œ1, … , π‘Œπ‘‚} is a martingale such that for some constants {𝑑

π‘˜},

π‘Œ

π‘˜+1 βˆ’ π‘Œ π‘˜ ≀ 𝑑 π‘˜ for all π‘˜ = 0,1, … , 𝑂 βˆ’ 1. Then for any πœ‡ > 0,

β„™ π‘Œπ‘‚ βˆ’ π‘Œ0 β‰₯ πœ‡ ≀ 2 exp βˆ’ πœ‡2 2 𝑑1

2 + β‹― + 𝑑𝑂 2

For our problem: 𝑑1 = 𝑑2 = β‹― = 𝑑𝑂 = 𝑙 So the probability that the # of 0’s differs from its expectation by more than πœ‡ is at most 2 exp(βˆ’πœ‡2/2𝑙2𝑂) So the deviation is β‰ˆ 𝑙 𝑂 and is tightly concentrated in this window.

slide-68
SLIDE 68

(take home) EXERCISE

Suppose that {π‘Œ0, π‘Œ1, … , π‘Œπ‘‚} is a martingale such that for some constants {𝑑

π‘˜},

π‘Œ

π‘˜+1 βˆ’ π‘Œ π‘˜ ≀ 𝑑 π‘˜ for all π‘˜ = 0,1, … , 𝑂 βˆ’ 1. Then for any πœ‡ > 0,

β„™ π‘Œπ‘‚ βˆ’ π‘Œ0 β‰₯ πœ‡ ≀ 2 exp βˆ’ πœ‡2 2 𝑑1

2 + β‹― + 𝑑𝑂 2

For our problem: 𝑑1 = 𝑑2 = β‹― = 𝑑𝑂 = 𝑙 So the probability that the # of 0’s differs from its expectation by more than πœ‡ is at most 2 exp(βˆ’πœ‡2/2𝑙2𝑂) So the deviation is β‰ˆ 𝑙 𝑂 and is tightly concentrated in this window. Improve the error probability to 2 exp(βˆ’πœ‡2/2𝑙𝑂) using a different martingale.