Bloom Filters and their Applications These slides were developed by - - PDF document

bloom filters and their applications
SMART_READER_LITE
LIVE PREVIEW

Bloom Filters and their Applications These slides were developed by - - PDF document

Bloom Filters and their Applications These slides were developed by -- and used with permission from -- Shengquan Wang. CPSC 662 Introduction Membership Query Given a set S={x 1 , x 2 , , x n } on a universe U , want to answer the query


slide-1
SLIDE 1

1

Bloom Filters and their Applications

These slides were developed by -- and used with permission from -- Shengquan Wang. CPSC 662

Introduction

  • Membership Query

Given a set S={x1, x2, …, xn} on a universe U, want to answer the query of the form: Is yS ? – Spell check

  • Data structure

– Space – Search time

  • Hashing is one of the good candidates (randomized)

xi can be a long string n can be a very large number

slide-2
SLIDE 2

2

Hash Function

  • It converts an input from a (typically) large domain into an output in a

(typically) smaller range 1 1 2 2 3 3 4 4 5 6 7 7 XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX XXXXXXXXXXX collision

H(x) y H(y) ?

false positive

Examples of Simple Hash Functions

  • Truncation: If students have an 9-digit identification number, take the last 3

digits as the table position – e.g. 925371622 becomes 622

  • Folding: Split a 9-digit number into three 3-digit numbers, and add them

– e.g. 925371622 becomes 925 + 376 + 622 = 1923

  • Modular arithmetic: If the table size is 1000, the first example always

keeps within the table range, but the second example does not (it should be mod 1000) – e.g. 1923 mod 1000 = 923 (1923 % 1000)

slide-3
SLIDE 3

3

Hashing Performance

  • Hash each element of the set to b number of bits,

with b = 2 log2 n – The probability that two elements collide is 1/n2. – False positive probability = 1/n (Asymptotically vanishing probability of error) – Binary search time = O(log2 n) – Space = (n log2 n)

Bloom Filters

  • Generalized randomized data structure
  • Invented by Burton Bloom in 1970
  • Basic idea: Use m-bit array to represent a set with n elements with k hashing

functions

  • Bloom filter provides a answer in

– “Constant” search time (time to hash). – Small amount of space. – But with some probability of being wrong

  • B. Bloom, “Space/time tradeoffs in hash coding with allowable errors,”

CACM 13 (1970).

slide-4
SLIDE 4

4

B

1 1 1 1 1 1 1 1

B

1 1 1 1 1 1 1 1

B

1 1 1 1 1 1 1 1

B Example

  • Start with an m bit array, filled with 0s
  • Hash each item xj S into [1,…,m], k number of times.

If Hi(xj) = a [1,…,m], then set B[a] = 1

  • To check if y S, check if all Hi(y) are ones
  • False positive: All Hi(y) are ones, but y not in S

Example

1 2 3 4 5 6 7 8 9 10 11 12 =m X1 X2 h1 h2 h3 Y1 Y3

False Positive

Y2 x1 -> {2, 5, 9} x2 -> {5, 7, 11}

slide-5
SLIDE 5

5

  • Notation:

– n = number of elements in the set to be represented – m = size of the bloom filter – k = number of hash functions

  • Probability that a bit is still zero after all elements are hashed into the Bloom

filter

  • Probability of a false positive

Probabilities

1 1 1 1 1 1 1

Determining the value of k

  • Goal: Optimize k that minimizes false positive rate
  • Optimal result: k = (ln 2)m/n f = (0.6185)m/n

– m = number of bits in bloom filter – n = number of elements in the set

slide-6
SLIDE 6

6

Example

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 1 2 3 4 5 6 7 8 9 10 Hash functions False positive rate

m/n = 8

Opt k = 8 ln 2 = 5.45...

Tradeoffs

  • Three parameters.

– Size m/n : bits per item. – Time k : number of hash functions. – Error f : false positive probability.

False positive probability decreases exponentially with linear increase in the number of hash functions & space

slide-7
SLIDE 7

7

Comparison

n * (m/n) (n log2 n)

space space

(1-e–k n/m)k ( 0.02) 1/n

false false postive postive rate (f) rate (f)

O(k) O(log2 n)

Lookup time Lookup time

tradeoff between m/n and f k = 1 m/n (m/n = 8) 2 log2 n

bit per element bit per element

Bloom filters Hashing

Application: Distributed Caching

  • Send Bloom filters of URLs
  • False positives do not hurt much

– Get errors from cache changes anyway

  • L. Fan, P. Cao, J. Almeida and A.Z. Broder “Summary Cache: A scalable wide-area

Web cache sharing protocol” IEEE/ACM Transactions on Networking 2000

Web Cache 1 Web Cache 2 Web Cache 3 Web Cache 6 Web Cache 5 Web Cache 4

slide-8
SLIDE 8

8

Example

http://www.perl.com/pub/a/2004/04/08/bloom_filters.html http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html http://www.flipcode.com/articles/article_bloomfilters.shtml http://loaf.cantbedone.org/about.htm http://www.cap-lore.com/code/BloomTheory.html http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf http://lemonodor.com/archives/000881.html http://citeseer.ist.psu.edu/mitzenmacher01compressed.html

  • J. Byers, J. Considine, M. Mitzenmacher, S. Rost, “Informed Content

Delivery Across Adaptive Overlay Networks” SIGCOMM 2002

Application: Set Reconciliation for Content Delivery

  • Suppose two hosts A and B have SA and SB
  • A wants to know SA-SB so that it can send those documents to B, that B does

not have

  • B sends Bloom filter corresponding to SB
  • A sends its documents which are not in that bloom filter
  • False positives: approximate
slide-9
SLIDE 9

9

  • Let HA, HB be hosts responsible for keywords A and B respectively
  • Suppose we want documents having both keywords A and B FIND

SA∩SB

  • Steps:

– HA sends Bloom filter corresponding to SA to HB – HB computes approximate SA∩SB and sends back to HA

  • False positives : HA can find out, so no problem
  • P. Reynolds and A. Vahdat, “Efficient Peer-to-peer keyword searching”

Application: Set Intersection for Keyword Search

Application: Moderate-sized P2P networks

  • Distributed hash tables for scalability
  • For moderate sized P2P network – per-node Bloom filter

– Use 8 or 16 bits per object instead of 64 bit identifiers – False positives : Not much problem

  • F. M. Cuena-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen,

“PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities.”

slide-10
SLIDE 10

10

Application: Resource Routing

A C B D E G F H J K I L M N

Sb , Sf , Sg , Sh

  • Network has tree topology.
  • B has bloom filters for all children

sub -trees collectively and also for each child sub-tree individually.

  • S. Rhea and J. Kubiatowicz, “Probabilistic Location and Routing”

INFOCOMM 2002

  • B. Gronvall “Scalable Multicast Forwarding” SIGCOMM 2002

Application: Multicast

  • Typically routers maintain a list of interfaces for each multicast address
  • An Efficient Solution: Keep list of addresses for each interface and use

Bloom filter to represent these addresses – Parallelizable

  • False Positives: Not bad, just wastes some resources
slide-11
SLIDE 11

11

Application: Detecting Routing Loops

  • Current mechanism: TTL
  • Each packet contain a small Bloom filter to track the nodes visited

– If filter does not change at a node, then a possible loop !!

  • False positives: Problematic
  • A. Whitaker and D. Wetherall “Forwarding without Loops in Icarus”

OPENARCH 2002

Application: IP Traceback

  • Use Bloom filters to record the packets seen by each router
  • False positives:

– Router mistakenly identifies packet as having been seen – Multiple possible paths

A.C. Snoeren, C. Partridge, L.A. Sanchez, C.E. Jones, F. Tchakountio, S.T.Kent and W.T. Strayer “Hash-based IP traceback” SIGCOMM 2001

slide-12
SLIDE 12

12

Summary

  • The Bloom Filter Principle:

Wherever a list or set is used, and space is a consideration, a Bloom filter should be considered. When using a Bloom filter, consider the potential effects of false positives.

  • Space/time tradeoffs in hash coding with allowable errors. B. Bloom.

CACM 13 (1970).

  • Network Applications of Bloom Filters: A Survey. A. Broder and M.
  • Mitzenmacher. Allerton Conference 2002.
  • Compressed Bloom Filters. M. Mitzenmacher. PODC 2001.
  • Spectral Bloom Filters. S. Cohen and Y. Matias. SIGMOD 2003.
  • The Bloomier Filter: An Efficient Data Structure for Static Support

Lookup Tables. B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. SODA 2004

References