bloom filters
play

Bloom Filters References A. Broder and M. Mitzenmacher, Network - PowerPoint PPT Presentation

2/16/2017 Bloom Filters References A. Broder and M. Mitzenmacher, Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics , vol. 1 no. 4, pp. 485-509, 2004. Li


  1. 2/16/2017 Bloom Filters References A. Broder and M. Mitzenmacher, “Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey,” Internet Mathematics , vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking , Vol. 8, No. 3, June 2000. o Origin of counting Bloom filters O i in f ntin Bl m filt s 2/16/2017 Bloom Filters (Simon S. Lam) 1 1

  2. 2/16/2017 Origin and applications  Randomized data structure introduced by Burton Bloom [CACM 1970] o It represents a set for membership queries, with false positives o Probability of false positive can be controlled by o Probability of false positive can be controlled by design parameters o When space efficiency is important, a Bloom filter ma may be used if the effect of false positives can be be used if the effect f false p sitives can be mitigated.  First applications in dictionaries and databases 2/16/2017 Bloom Filters (Simon S. Lam) 2 2

  3. 2/16/2017 First application in networking: distributed cache (2000) distributed cache (2000) Proxy 2 Proxy 2 Cache 2 Summary 1 Proxy 1 Summary 3 y Cache 1 Cache 1 Summary 2 Summary 3 Proxy 3 Proxy 3 Cache 3 Summary 1 Summary 2 Summary 2  Numerous applications in networking since 2000  N li ti i t ki i 2000 2/16/2017 Bloom Filters (Simon S. Lam) 3 3

  4. 2/16/2017 Standard Bloom Filter  A Bloom filter is an array of m bits representing a set S = { x 1 , x 2 , … , x n } of n elements { n } 1 2 o Array set to 0 initially  k independent hash functions h 1 , … , h k with range {1 2 {1, 2, …, m} } o Assume that each hash function maps each item in the universe to a random number uniformly over the range universe to a random number uniformly over the range {1, 2, …, m}  For each element x in S, the bit h i (x) in the array is set to 1, for 1 ≤ i ≤ k, i t t 1 f 1 i k o A bit in the array may be set to 1 multiple times for different elements ff m 2/16/2017 Bloom Filters (Simon S. Lam) 4 4

  5. 2/16/2017 A Bloom filter example (three hash functions) ( ) Insert X 1 and X 2 Check Y 1 and Y 2 2/16/2017 Bloom Filters (Simon S. Lam) 5 5

  6. 2/16/2017 Standard Bloom Filter (cont.)  To check membership of y in S, check whether h i (y), 1 ≤ i ≤ k, are all set to 1 whether h i (y), ≤ ≤ k, are all set to o If not, y is definitely not in S o Else, we conclude that y is in S, but sometimes this conclusion is wrong (false positive)  For many applications, false positives are acceptable as long as the probability of a t bl l th b bilit f false positive is small enough  We will assume that kn < m 2/16/2017 Bloom Filters (Simon S. Lam) 6 6

  7. 2/16/2017 False positive probability  After all members of S have been hashed to a Bloom filter, the probability that a specific bit is still 0 is 1 /  ' (1 ) kn − kn m = − = p e p m m  For a non member, it may be found to be a member of S (all of its k bits are nonzero) with false positive of S (all of its k bits are nonzero) with false positive probability  (1 ') (1 ) k k − − p p 2/16/2017 Bloom Filters (Simon S. Lam) 7 7

  8. 2/16/2017 False positive probability (cont.)  Define 1 ' (1 ') (1 (1 ) ) k kn k = − = − − f p m / (1 ) (1 e − ) k kn m k = − = − f p  Two competing forces as k increases (1 (1 ') k ) o Larger k o Larger k -> is smaller for a fixed p > is smaller for a fixed p’ − − p p o Larger k -> p’= is smaller -> 1-p’ larger (1 1/ ) kn − m 2/16/2017 Bloom Filters (Simon S. Lam) 8 8

  9. 2/16/2017 False positive rate vs. k m m Number of bits per member 8 n = Number of 2/16/2017 Bloom Filters (Simon S. Lam) 9 9

  10. 2/16/2017 Optimal number k from derivative Rewrite Rewrite as as f f / / exp(ln(1 ) ) exp( ln(1 )) − − kn m k kn m = − = − f e k e / Let ln(1 ) − kn m = − g k e Minimizing will minimize g exp( ) p( ) = g g f f g g / (1 − ) kn m ∂ ∂ − g k e / ln(1 − ) kn m = − + e / / 1 1 − kn m kn m ∂ ∂ − ∂ ∂ k k e k k k n / / ln(1 − ) − ln(2) ln(2) 0 kn m kn m = − + = − + = e e / 1 1 − kn m − e e m m if we plug ( / )ln 2 which is optimal = k m n ( (It is in fact a global optimum) i i f l b l i ) 2/16/2017 Bloom Filters (Simon S. Lam) 10 10

  11. 2/16/2017 Optimal k from symmetry / e −  Alternatively, from we get kn m = p m ln( ) ln( ) = − k k p p n From previous slide, we have From previous slide, we have m / ln(1 − ) ln( )ln(1 ) kn m = − = − − g k e p p n  From above, symmetry indicates that the minimum value for g occurs when p=1/2. g p Thus m m ln(1/ 2) ln(2) = − = k opt n n n n 2/16/2017 Bloom Filters (Simon S. Lam) 11 11

  12. 2/16/2017 Optimal k from symmetry using the precise probability of false positive using the precise probability of false positive ' ( (1 ') ) exp( ln(1 p( ( ')) )) k = − = − f f p p k p p From ' ( (1 1 / ) , solving for ) , g kn = − p p m k 1 = ln( ') k p l (1 ln(1 1 / 1 / ) ) − n m Let ' ln(1 ( ') ) = − ( (in equation for ' above) q ) g g k p p f f 1 ln( ')ln(1 ') = − p p ln(1 1/ ln(1 1/ ) ) − n n m m 2/16/2017 Bloom Filters (Simon S. Lam) 12 12

  13. 2/16/2017 Using the precise probability of false positive to get optimal k (cont.) p g p ( )  From previous slide 1 ' ln( ')ln(1 ') = − g p p ln(1 1/ ) − n m  By symmetry, g’ (also f’) minimized at p’=1/2  Optimal k is 1 1 ' ln( ') ln(1/ 2) = = k p opt ln(1 1/ ) ln(1 1/ ) − − n m n m 2/16/2017 Bloom Filters (Simon S. Lam) 13 13

  14. 2/16/2017 Optimal number of hash functions  Using m the false positive rate is ln(2) = k opt n m m ln(2) ( ) ln(2) ( ) / / (1 ) (0.5)  (0.6185) , where ln(2) 0.6931 m n − = = p n n  In practice, k should be an integer. May choose an integer value smaller than k opt to reduce hashing overhead l ll h k d h hi h d m/n denotes False positive rate bits per entry bits per entry 2/16/2017 Bloom Filters (Simon S. Lam) 14 14

  15. 2/16/2017 False positive rate vs. bits per entry False positive 4 hash functions rate rate Using optimal number of hash functions m/n 2/16/2017 Bloom Filters (Simon S. Lam) 15 15

  16. 2/16/2017 Standard Bloom Filter tricks  Two Bloom filters representing sets S 1 and S 2 with the same number of bits and using g the same hash functions. o A Bloom filter that represents the union of S 1 and S 2 can be obtained by taking the OR of the bit S 2 can be obtained by taking the OR of the bit vectors  A Bloom filter can be halved in size. Suppose the size is a power of 2. h i i f 2 o Just OR the first and second halves of the bit vector vector o When hashing to do a lookup, the highest order bit is masked Notation: OR denotes bitwise or 2/16/2017 Bloom Filters (Simon S. Lam) 16 16

  17. 2/16/2017 Counting Bloom filters  Proposed by Fan et al. [2000] for distributed caching cach ng  Every entry in a counting Bloom filter is a small counter (rather than a single bit). ( g ) o When an item is inserted into the set, the corresponding counters are each incremented by 1 o When an item is deleted from the set, the h d l d f h h corresponding counters are each decremented by 1  To avoid counter overflow its size must be  To avoid counter overflow, its size must be sufficiently large. It was found that 4 bits per counter are enough. u ug . 2/16/2017 Bloom Filters (Simon S. Lam) 17 17

  18. 2/16/2017 Counter overflow probability  Consider a set of n elements, k hash  Consider a set of n elements, k hash functions, and m counters o C(i) is the count for the i th counter   − j nk j    1 1 nk [ ( ) ] 1 = =       −       P c i j           j j m m     1 1 nk ≤  [ ( ) [ ( ) ] ] ≥  P c i j j   j m j   enk ≤  (a very loose upper bound)      jm 2/16/2017 Bloom Filters (Simon S. Lam) 18 18

  19. 2/16/2017 Counter overflow probability (cont.)  Choose k such that k ≤ m/n (ln 2) Then j j         ln2 ln2 enk enk e e [ ( ) ] ≥ ≤ ≤     P c i j     jm j j j     ln 2 e for some i [max ( ) ] ≥ ≤   P c i j m   1 j ≤ ≤ i m  Using 4 bits, each counter counts from 0 to 15 15 [max ( ) 16] 1.37 10 − ≥ ≤ × × P c i m 1 ≤ ≤ i m 2/16/2017 Bloom Filters (Simon S. Lam) 19 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend