Lecture #2: Advanced hashing and concentration bounds o Bloom - PowerPoint PPT Presentation

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds

Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter: Hash table that has only false positives (may report that a key is present when it is not, but always reports a key that is present) Very simple and fast Example: Google Chrome uses a Bloom filter to maintain its list of potentially malicious web sites. - Most queried keys are not in the table - If a key is in the table, can check against a slower (errorless) hash table Many applications in networking (see survey by Broder and Mitzenmacher)

Bloom filters Data structure: Universe 𝒱 . Parameters 𝑙, 𝑁 ≥ 1 Maintain an array 𝐵 of 𝑁 bits; initially 𝐵 0 = 𝐵 1 = ⋯ = 𝐵 𝑁 − 1 = 0 Choose 𝑙 hash functions ℎ 1 , ℎ 2 , … , ℎ 𝑙 : 𝒱 → 𝑁 (assume completely random functions for sake of analysis)

Bloom filters Data structure: Universe 𝒱 . Parameters 𝑙, 𝑁 ≥ 1 Maintain an array 𝐵 of 𝑁 bits; initially 𝐵 0 = 𝐵 1 = ⋯ = 𝐵 𝑁 − 1 = 0 Choose 𝑙 hash functions ℎ 1 , ℎ 2 , … , ℎ 𝑙 : 𝒱 → 𝑁 (assume completely random functions for sake of analysis) To add a key 𝑦 ∈ 𝒱 to the dictionary 𝑇 ⊆ 𝒱 , set bits 𝐵 ℎ 1 𝑦 ≔ 1, 𝐵 ℎ 2 𝑦 ≔ 1, … , 𝐵 ℎ 𝑙 𝑦 ≔ 1

Bloom filters Data structure: Universe 𝒱 . Parameters 𝑙, 𝑁 ≥ 1 Maintain an array 𝐵 of 𝑁 bits; initially 𝐵 0 = 𝐵 1 = ⋯ = 𝐵 𝑁 − 1 = 0 Choose 𝑙 hash functions ℎ 1 , ℎ 2 , … , ℎ 𝑙 : 𝒱 → 𝑁 (assume completely random functions for sake of analysis) To add a key 𝑦 ∈ 𝒱 to the dictionary 𝑇 ⊆ 𝒱 , set bits 𝐵 ℎ 1 𝑦 ≔ 1, 𝐵 ℎ 2 𝑦 ≔ 1, … , 𝐵 ℎ 𝑙 𝑦 ≔ 1 To answer a query: 𝑟 ∈ 𝑇 ? Check whether 𝐵 ℎ 𝑗 𝑦 = 1 for all 𝑗 = 1,2, … , 𝑙 If yes, answer Yes . If no, answer No .

Bloom filters No false negatives: Clearly if 𝑦 ∈ 𝑇 , we return Yes . But there is some chance that other keys have caused the bits in positions ℎ 1 𝑦 , … , ℎ 𝑙 (𝑦) to be set even if 𝑦 ∉ 𝑇 .

Bloom filters No false negatives: Clearly if 𝑦 ∈ 𝑇 , we return Yes . But there is some chance that other keys have caused the bits in positions ℎ 1 𝑦 , … , ℎ 𝑙 (𝑦) to be set even if 𝑦 ∉ 𝑇 . 𝑁 1 (Here we use the approximation 1 − ≈ 𝑓 −1 Heuristic analysis: 𝑁 for 𝑁 large enough.) Let us assume that 𝑇 = 𝑜 . Compute ℙ[𝐵 ℓ = 0] for some location ℓ ∈ [𝑁] : 𝑙𝑂 1 − 1 ≈ 𝑓 −𝑙𝑂 𝑞 𝑙, 𝑂 = 𝑁 𝑁

Bloom filters No false negatives: Clearly if 𝑦 ∈ 𝑇 , we return Yes . But there is some chance that other keys have caused the bits in positions ℎ 1 𝑦 , … , ℎ 𝑙 (𝑦) to be set even if 𝑦 ∉ 𝑇 . 𝑁 1 (Here we use the approximation 1 − ≈ 𝑓 −1 Heuristic analysis: 𝑁 for 𝑁 large enough.) Let us assume that 𝑇 = 𝑜 . Compute ℙ[𝐵 ℓ = 0] for some location ℓ ∈ [𝑁] : 𝑙𝑂 1 − 1 ≈ 𝑓 −𝑙𝑂 𝑞 𝑙, 𝑂 = 𝑁 𝑁 If each location in 𝐵 is 0 with probability 𝑞(𝑙, 𝑂) , then a false positive for 𝑦 ∉ 𝑇 should happen with probability at most 𝑙 𝑙 ≈ 1 − 𝑓 −𝑙𝑂 1 − 𝑞 𝑙, 𝑂 𝑁

Bloom filters Heuristic analysis: If each location in 𝐵 is 0 with probability 𝑞(𝑙, 𝑂) , then a false positive for 𝑦 ∉ 𝑇 should happen with probability at most 𝑙 𝑙 ≈ 1 − 𝑓 −𝑙𝑂 1 − 𝑞 𝑙, 𝑂 𝑁

Bloom filters Heuristic analysis: If each location in 𝐵 is 0 with probability 𝑞(𝑙, 𝑂) , then a false positive for 𝑦 ∉ 𝑇 should happen with probability at most 𝑙 𝑙 ≈ 1 − 𝑓 −𝑙𝑂 1 − 𝑞 𝑙, 𝑂 𝑁 But the actual fraction of 0 ′ 𝑡 in the hash table is a random variable 𝑌 𝑙,𝑂 with expectation 𝔽 𝑌 𝑙,𝑂 = 𝑞 𝑙, 𝑂 To get the analysis right, we need a concentration bound : Want to say that 𝑌 𝑙,𝑂 is close to its expected value with high probability . [We will return to this in the 2 nd half of the lecture]

Bloom filters Heuristic analysis: If each location in 𝐵 is 0 with probability 𝑞(𝑙, 𝑂) , then a false positive for 𝑦 ∉ 𝑇 should happen with probability at most 𝑙 𝑙 ≈ 1 − 𝑓 −𝑙𝑂 1 − 𝑞 𝑙, 𝑂 𝑁 But the actual fraction of 0 ′ 𝑡 in the hash table is a random variable 𝑌 𝑙,𝑂 with expectation 𝔽 𝑌 𝑙,𝑂 = 𝑞 𝑙, 𝑂 To get the analysis right, we need a concentration bound : Want to say that 𝑌 𝑙,𝑂 is close to its expected value with high probability . [We will return to this in the 2 nd half of the lecture] If the heuristic analysis is correct, it gives nice estimates: For instance, if 𝑁 = 8𝑂 , then choosing the optimal value of 𝑙 = 7 gives false positive rate about 2% .

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Cuckoo hashing is a hash scheme with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

Cuckoo hashing Idea: Simple hashing without errors Lookups are worst case 𝑃(1) time Deletions are worst case 𝑃(1) time Insertions are expected 𝑃(1) time Insertion time is 𝑃(1) with good probability [will require a concentration bound]

Cuckoo hashing Data structure: Two tables 𝐵 1 and 𝐵 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions ℎ 1 , ℎ 2 ∶ 𝒱 → [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐵 1 ℎ 1 𝑦 or 𝐵 2 [ℎ 2 𝑦 ] is empty, store 𝑦 there.

Cuckoo hashing Data structure: Two tables 𝐵 1 and 𝐵 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions ℎ 1 , ℎ 2 ∶ 𝒱 → [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐵 1 ℎ 1 𝑦 or 𝐵 2 [ℎ 2 𝑦 ] is empty, store 𝑦 there. Bump: If both locations are occupied, then place 𝑦 in 𝐵 1 ℎ 1 𝑦 and bump the current occupant. Whenever an element 𝑨 is bumped from 𝐵 𝑗 ℎ 𝑗 𝑨 , attempt to store it in the other location 𝐵 𝑘 ℎ 𝑘 𝑨 (here 𝑗, 𝑘 = 1,2 or 2,1 )

Cuckoo hashing Data structure: Two tables 𝐵 1 and 𝐵 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions ℎ 1 , ℎ 2 ∶ 𝒱 → [𝑁] (will assume hash functions are fully random) When an element 𝑦 ∈ 𝑇 is inserted, if either 𝐵 1 ℎ 1 𝑦 or 𝐵 2 [ℎ 2 𝑦 ] is empty, store 𝑦 there. Bump: If both locations are occupied, then place 𝑦 in 𝐵 1 ℎ 1 𝑦 and bump the current occupant. Whenever an element 𝑨 is bumped from 𝐵 𝑗 ℎ 𝑗 𝑨 , attempt to store it in the other location 𝐵 𝑘 ℎ 𝑘 𝑨 (here 𝑗, 𝑘 = 1,2 or 2,1 ) Abort: After 6 log 𝑂 consecutive bumps, stop the process and build a fresh hash table using new random hash functions ℎ 1 , ℎ 2 .

Cuckoo hashing Alternately (as in the picture), we can use a single table with 2𝑁 entries and two hash functions ℎ 1 , ℎ 2 : 𝒱 → 2𝑁 (with the same “bumping” algorithm) Arrows represent the alternate location for each key. If we insert an item at the location of 𝐵 , it will get bumped, thereby bumping 𝐶 , and then we are done. Cycles are possible (where the insertion process never completes). What’s an example?

Cuckoo hashing Data structure: Two tables 𝐵 1 and 𝐵 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions ℎ 1 , ℎ 2 ∶ 𝒱 → [𝑁] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is 𝑃(1) if 𝑁 ≥ 4𝑂 .

Cuckoo hashing Data structure: Two tables 𝐵 1 and 𝐵 2 both of size 𝑁 = 𝑃 𝑂 Two hash functions ℎ 1 , ℎ 2 ∶ 𝒱 → [𝑁] (will assume hash functions are fully random) Theorem: Expected time to perform an insert operation is 𝑃(1) if 𝑁 ≥ 4𝑂 . Pretty good… but only 25% memory utilization. Can actually get about 50% memory utilization. Experimentally, with 3 hash functions instead of 2 , can get ≈ 90% utilization, but it is an open question to provide tight analyses for 𝑒 hash functions when 𝑒 ≥ 3 .

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds

Load balancing Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule.

Load balancing Suppose we have 𝑂 jobs to assign to 𝑂 servers. Clearly we could achieve a load of one job/server, but this might result in an expensive/hard-to-parallelize allocation rule. We could hash the balls into bins. Let’s again consider the case of a uniformly random hash function ℎ ∶ 𝑂 → 𝑂

Lecture #2: Advanced hashing and concentration bounds o Bloom - PowerPoint PPT Presentation

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter:

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lecture 8: Hashing I Lecture Overview Dictionaries and Python Motivation Prehashing

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

Advanced Algorithms COMS31900 Hashing part two Static Perfect Hashing Rapha el Clifford

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New

CS 310 Advanced Data Structures and Algorithms Hashing June 5, 2018 Mohammad Hadian

Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files

Extending Binary Linear Classification One-Versus-All Classification (OVA) } In the presence of

Evaluation metrics and model selection Marta Arias Dept. CS, UPC Fall 2018 Quantifying the

Modeling trade-offs between false positives and negatives Tyler Moore Computer Science &

Data Mining Lecture 05: Overfitting Evaluation: accuracy, precision, recall, ROC

Evaluation of example tools for hairy tasks. Presenter: Hardik Sahi (20743327) Outline

Quantitative Evaluation Adapted in part from:

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Average Individual Fairness Aaron Roth Based on Joint Work with: Michael Kearns and Saeed

Lecture #2: Advanced hashing and concentration bounds o Bloom - PowerPoint PPT Presentation

outline Lecture #2: Advanced hashing and concentration bounds o Bloom filters o Cuckoo hashing o Load balancing o Tail bounds Bloom filters Idea: For the sake of efficiency, sometime we allow our data structure to make mistakes Bloom filter:

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Lecture 8: Hashing I Lecture Overview Dictionaries and Python Motivation Prehashing

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

Advanced Algorithms COMS31900 Hashing part two Static Perfect Hashing Rapha el Clifford

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New

CS 310 Advanced Data Structures and Algorithms Hashing June 5, 2018 Mohammad Hadian

Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files

Extending Binary Linear Classification One-Versus-All Classification (OVA) } In the presence of

Evaluation metrics and model selection Marta Arias Dept. CS, UPC Fall 2018 Quantifying the

Modeling trade-offs between false positives and negatives Tyler Moore Computer Science &amp;

Data Mining Lecture 05: Overfitting Evaluation: accuracy, precision, recall, ROC

Evaluation of example tools for hairy tasks. Presenter: Hardik Sahi (20743327) Outline

Quantitative Evaluation Adapted in part from:

Quantitative Text Analysis. Applications to Social Media Research Pablo Barber a London

Average Individual Fairness Aaron Roth Based on Joint Work with: Michael Kearns and Saeed

Modeling trade-offs between false positives and negatives Tyler Moore Computer Science &