overview intro to hashing
play

Overview Intro to Hashing Intro to Hashing Hashing with Chaining - PowerPoint PPT Presentation

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing? Distribute key/value pairs across bins with a hash function , Hashing Performance Hashing (Application of Probability) which maps elements from


  1. Overview Intro to Hashing ◮ Intro to Hashing ◮ Hashing with Chaining What’s hashing? ◮ Distribute key/value pairs across bins with a hash function , ◮ Hashing Performance Hashing (Application of Probability) which maps elements from large universe U (of size n ) to a ◮ Hash Families small set { 0 , . . . , k − 1 } ◮ Balls and Bins ◮ Given a key, always returns one integer Ashwinee Panda ◮ Load Balancing ◮ Hashing the same key returns the same integer; h ( x ) = h ( x ) ◮ Universal Hashing Final CS 70 Lecture! ◮ Hashing two different keys might not always return different ◮ Perfect Hashing integers 9 Aug 2018 ◮ Collisions occur when h ( x ) = h ( y ) for x � = y What’s the point? You may have heard of SHA256, a special class of hash function Although the name of the class is “Discrete Mathematics and known as a cryptographic hash function. Probability Theory”, what you’ve learned is not just theoretical but has far-reaching applications across multiple fields. Today we’ll dive deep into one such application: hashing. Hashing with Chaining Hashing Performance Hash Families In CS 61B you learned one particular use for hashing: hash tables ◮ If | U | ≥ ( n − 1) k + 1 then the Pigeonhole Principle says one with linked lists. Operation Average-Case Worst-Case bucket of the hash function must contain at least n items Pseudocode for hashing one key with a given hash function: Search O (1) O ( n ) ◮ For any hash function, we might have keys that all map to the Insert O (1) O ( n ) same bin—then our hash table will have terrible performance! def hash_function(x): Delete O (1) O ( n ) ◮ Seems hard to pick just one hash function to avoid worst-case return x mod 7 ◮ Hashing has great average-case performance, poor worst-case ◮ Instead, develop randomized algorithm! hash = hash_function(key) ◮ Worst-case is when all keys map to the same bin (collisions); ◮ Randomized algorithms use randomness to make decisions linked_list = hash_table[hash] performance scales as maximum number of keys in a bin ◮ Quicksort expects to find the right answer in O ( n log n ) time linked_list.append(key) An adversary can induce the worst case (adversarial attack) but may run for O ( n 2 ) time (CS 61B) ◮ For h ( x ) = x mod 7, suppose our set of keys is all multiples ◮ We can restart a randomized algorithm as many times as we ◮ Mapping many keys to the same index causes a collision of 7! wish, to make the P [fail] arbitrarily low ◮ Resolve collisions with “chaining” ◮ Each item will hash to the same bin ◮ To guard against an adversary we generate a hash function h ◮ Chaining isn’t perfect; we have to search through the list in ◮ To do any operation, we’ll have to go through the entire uniformly at random from a hash family H O ( ℓ ) time where ℓ is the length of the linked list ◮ Even if the keys are chosen by an adversary, no adversary can linked list ◮ Longer lists mean worse performance choose bad keys for the entire family simultaneously, so our ◮ Try to minimize collisions scheme will work with high probability

  2. Balls and Bins Balls and Bins Load Balancing X i is the indicator random variable which turns on if the i th ball ◮ If we want to be really random, we’d see hashing as just falls into bin 1 and X is the number of balls that fall into bin 1 ◮ Distributed computing: evenly distribute a workload balls and bins ◮ E [ X i ] = P [ X i = 1] = 1 ◮ Specifically, suppose that the random variables h ( x ) as x ◮ m identical jobs, n identical processors (may not be identical k ◮ E [ X ] = n ranges over U are independent but that won’t actually matter) k E i is the indicator variable that bin i is empty ◮ Balls will be the keys to be stored ◮ Ideally we should distribute these perfectly evenly so each k ) n ◮ Using the complement of X i we find P [ E i ] = (1 − 1 processor gets m n jobs ◮ Bins will be the k locations in hash table ◮ Centralized systems are capable of this, but centralized E is the number of empty locations ◮ The hash function maps each key to a uniformly random k ) n ◮ E [ E ] = k (1 − 1 systems require a server to exert a degree of control that is location often impractical n ) n ≈ n ◮ Each key (ball) chooses a bin uniformly and independently ⇒ E [ E ] = n (1 − 1 e and E [ X ] = n ◮ k = n = n ◮ This is actually similar to balls and bins! ◮ How likely can collisions be? The probability that two balls ◮ How can we expect 1 item per location (very intuitive with n ◮ Let’s continue using our random algorithm of hashing 1 fall into same bin is balls and n bins) and also expect more than a third of k 2 ◮ Let’s try to derive an upper bound for the maximum length, ◮ Birthday Paradox: 23 balls and 365 bins = locations to be empty? ⇒ 50% chance of assuming m = n collision! C is the number of bins with ≥ 2 balls √ 1 k ) n ◮ n ≥ k = ⇒ 2 chance of collision ◮ E [ C ] = n − k + E [ E ] = n − k + k (1 − 1 Load Balancing Load Balancing Universal Hashing Expected max load is � n t ) t H i , t is the event that t keys hash to bin i t =1 t P [ M t ] where P [ M t ] ≤ n ( e What we’ve been working with so far is “ k -wise independent” n ) n − t n ) t (1 − 1 hashing or fully independent hashing. � n � ( 1 ◮ P [ H i , t ] = ◮ Split sum into two parts and bound each part separately. t n n � n ◮ β = ⌈ 5 ln n ◮ For any number of balls k , the probability that they fall into ◮ Approximation: � ≤ t t ( n − t ) n − t by Stirling’s formula ln ln n ⌉ . How did we get this? Take a look at Note 15. t 1 the same bin of n bins is x ) x ≤ e by the limit ◮ � n t =1 t P [ M t ] = � β t =1 t P [ M t ] + � n ◮ Approximation: ∀ x > 0 , (1 + 1 t = β t P [ M t ] n k ◮ Very strong requirement! n ) n − t ≤ 1 and ( 1 n ) t = 1 ◮ Because (1 − 1 n t we can simplify Sum over smaller values: ◮ Fully independent hash functions require a large number of n ) n − t ≤ ◮ Replace t with the upper bound of β n ) t (1 − 1 n n n n − t ◮ � n ( 1 � t t ( n − t ) n − t n t = bits to store t t t ( n − t ) n − t ◮ � β t =1 t P [ M t ] ≤ � β t =1 β P [ M t ] = β � β t =1 P [ M t ] ≤ β Do we compromise, and make our worst case worse so we can have t n − t ) n − t = 1 n − t t ) ≤ e t as the sum of disjoint probabilities is bounded by 1 = 1 t t t t (1 + t t ((1 + n − t ) more space? t t Sum over larger values: M t : event that max list length hashing n items to n bins is t ◮ Often you do have to sacrifice time for space, vice-versa ◮ Use our expression for P [ H i , t ] and see that P [ M t ] ≤ 1 M i , t : event that max list length is t , and this list is in bin i n 2 . ◮ But not this time! Let’s inspect our worst-case ◮ Since this bound decreases as t grows, and t ≤ n : ◮ P [ M t ] = P [ � n i =1 M i , t ] ≤ � n i =1 P [ M i , t ] ≤ � n i =1 P [ H i , t ] ◮ Collisions only care about two balls colliding ◮ � n t = β t P [ M t ] ≤ � n t = β n 1 n 2 ≤ � n 1 ◮ Identically distributed loads means � n n ≤ 1 i =1 P [ H i , t ] = n P [ H i , t ] t = β We don’t need “ k -wise independence” we only need “2-wise ◮ Expected max load is O ( β ) = O ( ln n ln ln n ) The probability that the max list length is t is at most n ( e t ) t independence”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend