bloom filters and their applications
play

Bloom Filters and their Applications These slides were developed by - PDF document

Bloom Filters and their Applications These slides were developed by -- and used with permission from -- Shengquan Wang. CPSC 662 Introduction Membership Query Given a set S={x 1 , x 2 , , x n } on a universe U , want to answer the query


  1. Bloom Filters and their Applications These slides were developed by -- and used with permission from -- Shengquan Wang. CPSC 662 Introduction • Membership Query Given a set S={x 1 , x 2 , …, x n } on a universe U , want to answer the query of the form: Is y � S ? – Spell check • Data structure – Space – Search time x i can be a long string n can be a very large number • Hashing is one of the good candidates (randomized) 1

  2. Hash Function • It converts an input from a (typically) large domain into an output in a (typically) smaller range H(x) 0 1 1 XXXXXXXXXXX 2 2 XXXXXXXXXXX 3 3 collision XXXXXXXXXXX 4 4 XXXXXXXXXXX 5 false positive XXXXXXXXXXX 6 7 7 y � H(y) ? Examples of Simple Hash Functions • Truncation : If students have an 9-digit identification number, take the last 3 digits as the table position – e.g. 925371622 becomes 622 • Folding: Split a 9-digit number into three 3-digit numbers, and add them – e.g. 925371622 becomes 925 + 376 + 622 = 1923 • Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) – e.g. 1923 mod 1000 = 923 (1923 % 1000) 2

  3. Hashing Performance • Hash each element of the set to b number of bits, with b = 2 log 2 n – The probability that two elements collide is 1/n 2 . – False positive probability = 1/n (Asymptotically vanishing probability of error) – Binary search time = O(log 2 n) – Space = � (n log 2 n) Bloom Filters • Generalized randomized data structure • Invented by Burton Bloom in 1970 • Basic idea: Use m -bit array to represent a set with n elements with k hashing functions • Bloom filter provides a answer in – “Constant” search time (time to hash). – Small amount of space. – But with some probability of being wrong B. Bloom, “ Space/time tradeoffs in hash coding with allowable errors,” CACM 13 (1970). 3

  4. Example • Start with an m bit array, filled with 0 s B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Hash each item x j � S into [1,…,m] , k number of times. If H i (x j ) = a � [1,…,m] , then set B[a] = 1 B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 • To check if y � S , check if all H i (y) are ones B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 • False positive: All H i (y) are ones, but y not in S B 0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0 Example Y 2 Y 3 X 2 Y 1 X 1 False Positive h 3 h 1 h 2 1 2 3 4 5 6 7 8 9 10 11 12 =m x 1 -> {2, 5, 9} x 2 -> {5, 7, 11} 4

  5. Probabilities 1 0 1 0 0 1 1 1 0 1 1 0 • Notation: – n = number of elements in the set to be represented – m = size of the bloom filter – k = number of hash functions • Probability that a bit is still zero after all elements are hashed into the Bloom filter • Probability of a false positive Determining the value of k • Goal: Optimize k that minimizes false positive rate Optimal result: k = (ln 2)m/n � f = (0.6185) m/n • – m = number of bits in bloom filter – n = number of elements in the set 5

  6. Example 0.1 m / n = 8 0.09 0.08 False positive rate 0.07 Opt k = 8 ln 2 = 5.45 ... 0.06 0.05 0.04 0.03 0.02 0.01 0 0 1 2 3 4 5 6 7 8 9 10 Hash functions Tradeoffs • Three parameters. – Size m / n : bits per item. – Time k : number of hash functions. – Error f : false positive probability. False positive probability decreases exponentially with linear increase in the number of hash functions & space 6

  7. Comparison Hashing Bloom filters bit per element bit per element 2 log 2 n m/n (m/n = 8) space � (n log 2 n) n * (m/n) space false postive false postive rate (f) rate (f) 1/n (1-e –k n/m ) k ( � 0.02) Lookup time Lookup time O(log 2 n) O(k) k = 1 tradeoff between m/n and f Application: Distributed Caching • Send Bloom filters of URLs • False positives do not hurt much – Get errors from cache changes anyway Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 4 Web Cache 5 Web Cache 6 L. Fan, P. Cao, J. Almeida and A.Z. Broder “Summary Cache: A scalable wide-area Web cache sharing protocol” IEEE/ACM Transactions on Networking 2000 7

  8. Example http://www.perl.com/pub/a/2004/04/08/bloom_filters.html http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html http://www.flipcode.com/articles/article_bloomfilters.shtml http://loaf.cantbedone.org/about.htm http://www.cap-lore.com/code/BloomTheory.html http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf http://lemonodor.com/archives/000881.html http://citeseer.ist.psu.edu/mitzenmacher01compressed.html Application: Set Reconciliation for Content Delivery • Suppose two hosts A and B have S A and S B • A wants to know S A -S B so that it can send those documents to B, that B does not have • B sends Bloom filter corresponding to S B • A sends its documents which are not in that bloom filter • False positives: approximate J. Byers, J. Considine, M. Mitzenmacher, S. Rost, “Informed Content Delivery Across Adaptive Overlay Networks” SIGCOMM 2002 8

  9. Application: Set Intersection for Keyword Search • Let H A , H B be hosts responsible for keywords A and B respectively • Suppose we want documents having both keywords A and B � FIND S A ∩ S B • Steps: – H A sends Bloom filter corresponding to S A to H B – H B computes approximate S A ∩ S B and sends back to H A • False positives : H A can find out, so no problem P. Reynolds and A. Vahdat, “Efficient Peer-to-peer keyword searching” Application: Moderate-sized P2P networks • Distributed hash tables for scalability • For moderate sized P2P network – per-node Bloom filter – Use 8 or 16 bits per object instead of 64 bit identifiers – False positives : Not much problem F. M. Cuena-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen, “PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities.” 9

  10. Application: Resource Routing • Network has tree topology. • B has bloom filters for all children S b , S f , S g , S h A sub -trees collectively and also for each child sub-tree individually. B C D E F G H I J K L M N S. Rhea and J. Kubiatowicz, “Probabilistic Location and Routing” INFOCOMM 2002 Application: Multicast • Typically routers maintain a list of interfaces for each multicast address • An Efficient Solution: Keep list of addresses for each interface and use Bloom filter to represent these addresses – Parallelizable • False Positives: Not bad, just wastes some resources B. Gronvall “Scalable Multicast Forwarding” SIGCOMM 2002 10

  11. Application: Detecting Routing Loops • Current mechanism: TTL • Each packet contain a small Bloom filter to track the nodes visited – If filter does not change at a node, then a possible loop !! • False positives: Problematic A. Whitaker and D. Wetherall “Forwarding without Loops in Icarus” OPENARCH 2002 Application: IP Traceback • Use Bloom filters to record the packets seen by each router • False positives: – Router mistakenly identifies packet as having been seen – Multiple possible paths A.C. Snoeren, C. Partridge, L.A. Sanchez, C.E. Jones, F. Tchakountio, S.T.Kent and W.T. Strayer “Hash-based IP traceback” SIGCOMM 2001 11

  12. Summary • The Bloom Filter Principle: Wherever a list or set is used, and space is a consideration, a Bloom filter should be considered. When using a Bloom filter, consider the potential effects of false positives. References • Space/time tradeoffs in hash coding with allowable errors. B. Bloom. CACM 13 (1970). • Network Applications of Bloom Filters: A Survey. A. Broder and M. Mitzenmacher. Allerton Conference 2002. • Compressed Bloom Filters. M. Mitzenmacher. PODC 2001 . • Spectral Bloom Filters. S. Cohen and Y. Matias. SIGMOD 2003. • The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables. B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. SODA 2004 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend