Bloom Filters References A. Broder and M. Mitzenmacher, Network - - PowerPoint PPT Presentation

bloom filters
SMART_READER_LITE
LIVE PREVIEW

Bloom Filters References A. Broder and M. Mitzenmacher, Network - - PowerPoint PPT Presentation

2/16/2017 Bloom Filters References A. Broder and M. Mitzenmacher, Network applications of Bloom A. Broder and M. Mitzenmacher, Network applications of Bloom filters: A survey, Internet Mathematics , vol. 1 no. 4, pp. 485-509, 2004. Li


slide-1
SLIDE 1

2/16/2017 1

Bloom Filters

References

  • A. Broder and M. Mitzenmacher, “Network applications of Bloom
  • A. Broder and M. Mitzenmacher, Network applications of Bloom

filters: A survey,” Internet Mathematics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Almeida, Andrei Broder, “Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions on Networking, Vol. 8, No. 3, June 2000.

O i in f ntin Bl m filt s

  • Origin of counting Bloom filters

2/16/2017 Bloom Filters (Simon S. Lam) 1

slide-2
SLIDE 2

2/16/2017 2

Origin and applications

Randomized data structure introduced by Burton Bloom [CACM 1970]

  • It represents a set for membership queries, with

false positives

  • Probability of false positive can be controlled by
  • Probability of false positive can be controlled by

design parameters

  • When space efficiency is important, a Bloom filter

ma be used if the effect f false p sitives can be may be used if the effect of false positives can be mitigated.

First applications in dictionaries and databases

2/16/2017 Bloom Filters (Simon S. Lam) 2

slide-3
SLIDE 3

2/16/2017 3

First application in networking: distributed cache (2000) distributed cache (2000)

Proxy 2 Proxy 1 Cache 1 Proxy 2 Cache 2 Summary 1 Summary 3 Cache 1 Summary 2 Summary 3 y Proxy 3 Proxy 3 Cache 3 Summary 1 Summary 2 Summary 2  N

li ti i t ki i 2000

2/16/2017 Bloom Filters (Simon S. Lam) 3

 Numerous applications in networking since 2000

slide-4
SLIDE 4

2/16/2017 4

Standard Bloom Filter

A Bloom filter is an array of m bits representing a set S = { x1, x2, … , xn} of n elements {

1 2 n}

  • Array set to 0 initially

k independent hash functions h1, … , hk with range {1 2 } {1, 2, …, m}

  • Assume that each hash function maps each item in the

universe to a random number uniformly over the range universe to a random number uniformly over the range {1, 2, …, m}

For each element x in S, the bit hi(x) in the array i t t 1 f 1 i k is set to 1, for 1 ≤ i ≤ k,

  • A bit in the array may be set to 1 multiple times for

different elements ff m

2/16/2017 Bloom Filters (Simon S. Lam) 4

slide-5
SLIDE 5

2/16/2017 5

A Bloom filter example

(three hash functions) ( )

Insert X1 and X2 Check Y1 and Y2

2/16/2017 Bloom Filters (Simon S. Lam) 5

slide-6
SLIDE 6

2/16/2017 6

Standard Bloom Filter (cont.)

To check membership of y in S, check whether hi(y), 1≤i≤k, are all set to 1 whether hi(y), ≤ ≤k, are all set to

  • If not, y is definitely not in S
  • Else, we conclude that y is in S, but sometimes this

conclusion is wrong (false positive)

For many applications, false positives are t bl l th b bilit f acceptable as long as the probability of a false positive is small enough We will assume that kn < m

2/16/2017 Bloom Filters (Simon S. Lam) 6

slide-7
SLIDE 7

2/16/2017 7

False positive probability

 After all members of S have been hashed to a Bloom filter, the probability that a specific bit is still 0 is

/

1 ' (1 )kn

kn m

p e p m

= − = 

 For a non member, it may be found to be a member

  • f S (all of its k bits are nonzero) with false positive

m

  • f S (all of its k bits are nonzero) with false positive

probability

(1 ') (1 )

k k

p p − − 

2/16/2017 Bloom Filters (Simon S. Lam) 7

slide-8
SLIDE 8

2/16/2017 8

False positive probability (cont.)

Define

1 ' (1 ') (1 (1 ) )

k kn k

f p m = − = − −

/

(1 ) (1 )

k kn m k

f p e− = − = −

 Two competing forces as k increases

  • Larger k

> is smaller for a fixed p’

(1 ')k p −

  • Larger k -> is smaller for a fixed p
  • Larger k -> p’= is smaller -> 1-p’ larger

(1 1/ )kn m −

(1 ) p −

2/16/2017 Bloom Filters (Simon S. Lam) 8

slide-9
SLIDE 9

2/16/2017 9

False positive rate vs. k

m

Number of bits per member

8 m n =

Number of

2/16/2017 9 Bloom Filters (Simon S. Lam)

slide-10
SLIDE 10

2/16/2017 10

Optimal number k from derivative

Rewrite as f

/ /

Rewrite as exp(ln(1 ) ) exp( ln(1 ))

kn m k kn m

f f e k e

− −

= − = −

/

Let ln(1 ) Minimizing will minimize exp( )

kn m

g k e g f g

= − = g p( ) g f g

/ / /

(1 ) ln(1 ) 1

kn m kn m kn m

g k e e k k

− − −

∂ ∂ − = − + ∂ ∂

/ / /

ln(1 ) ln(2) ln(2) 1

kn m kn m kn m

k n e e e m

− − −

= − + = − + =

/

1

kn m

k e k ∂ − ∂

if we plug ( / )ln 2 which is optimal ( i i f l b l i ) k m n = 1 e m − (It is in fact a global optimum)

2/16/2017 Bloom Filters (Simon S. Lam) 10

slide-11
SLIDE 11

2/16/2017 11

Optimal k from symmetry

Alternatively, from we get

/ kn m

p e− =

ln( ) m k p ln( ) From previous slide, we have k p n = −

/

From previous slide, we have ln(1 ) ln( )ln(1 )

kn m

m g k e p p

= − = − −

From above, symmetry indicates that the minimum value for g occurs when p=1/2.

n

g p Thus

ln(1/ 2) ln(2)

  • pt

m m k n n = − =

2/16/2017 Bloom Filters (Simon S. Lam) 11

n n

slide-12
SLIDE 12

2/16/2017 12

Optimal k from symmetry

using the precise probability of false positive using the precise probability of false positive

' (1 ') exp( ln(1 '))

k

f p k p = − = −

From ' (1 1 / ) , solving for

kn

p m k = −

( ) p( ( )) f p p

( ) , g 1 = ln( ') l (1 1 / ) p k p ln(1 1 / ) n m −

(in equation for ' above)

Let ' ln(1 ')

f

g k p = −

( q )

( ) 1 ln( ')ln(1 ') ln(1 1/ )

f

g p p p n m = −

2/16/2017 Bloom Filters (Simon S. Lam) 12

ln(1 1/ ) n m −

slide-13
SLIDE 13

2/16/2017 13

Using the precise probability of false positive to get optimal k (cont.) p g p ( )

From previous slide

1 ' ln( ')ln(1 ') ln(1 1/ ) g p p n m = − −

By symmetry, g’ (also f’) minimized at p’=1/2 Optimal k is 1 1 ' ln( ') ln(1/ 2) ln(1 1/ ) ln(1 1/ )

  • pt

k p n m n m = = − −

2/16/2017 Bloom Filters (Simon S. Lam) 13

slide-14
SLIDE 14

2/16/2017 14

Optimal number of hash functions

 Using the false positive rate is

ln(2) ln(2) / m m

ln(2)

  • pt

m k n =

 In practice, k should be an integer. May choose an integer l ll h k d h hi h d

( ) ( ) /

(1 ) (0.5) (0.6185) , where ln(2) 0.6931

m n n n

p − = = 

value smaller than kopt to reduce hashing overhead

m/n denotes bits per entry False positive rate bits per entry

2/16/2017 Bloom Filters (Simon S. Lam) 14

slide-15
SLIDE 15

2/16/2017 15

False positive rate vs. bits per entry

4 hash functions False positive rate rate Using optimal number

  • f hash functions

2/16/2017 Bloom Filters (Simon S. Lam) 15

m/n

slide-16
SLIDE 16

2/16/2017 16

Standard Bloom Filter tricks

Two Bloom filters representing sets S1 and S2 with the same number of bits and using g the same hash functions.

  • A Bloom filter that represents the union of S1 and

S2 can be obtained by taking the OR of the bit S2 can be obtained by taking the OR of the bit vectors

A Bloom filter can be halved in size. Suppose h i i f 2 the size is a power of 2.

  • Just OR the first and second halves of the bit

vector vector

  • When hashing to do a lookup, the highest order bit

is masked

2/16/2017 Bloom Filters (Simon S. Lam) 16

Notation: OR denotes bitwise or

slide-17
SLIDE 17

2/16/2017 17

Counting Bloom filters

Proposed by Fan et al. [2000] for distributed caching cach ng Every entry in a counting Bloom filter is a small counter (rather than a single bit). ( g )

  • When an item is inserted into the set, the

corresponding counters are each incremented by 1 h d l d f h h

  • When an item is deleted from the set, the

corresponding counters are each decremented by 1

To avoid counter overflow its size must be To avoid counter overflow, its size must be sufficiently large. It was found that 4 bits per counter are enough. u ug .

2/16/2017 Bloom Filters (Simon S. Lam) 17

slide-18
SLIDE 18

2/16/2017 18

Counter overflow probability

Consider a set of n elements, k hash Consider a set of n elements, k hash functions, and m counters

  • C(i) is the count for the ith counter

1 1 [ ( ) ] 1

j nk j

nk P c i j j

     = = −            j m m            1 [ ( ) ] nk   1 [ ( ) ]

j

P c i j j m   ≥ ≤    

(a very loose upper bound)

j

enk jm   ≤    

2/16/2017 Bloom Filters (Simon S. Lam) 18

 

slide-19
SLIDE 19

2/16/2017 19

Counter overflow probability (cont.)

Choose k such that k ≤ m/n (ln 2) Then

ln2

j j

enk e     ln2 [ ( ) ] enk e P c i j jm j     ≥ ≤ ≤        

j

 

1

ln 2 [max ( ) ]

j i m

e P c i j m j

≤ ≤

  ≥ ≤    

for some i

Using 4 bits, each counter counts from 0 to 15

15 1

[max ( ) 16] 1.37 10

i m

P c i m

− ≤ ≤

≥ ≤ × ×

2/16/2017 Bloom Filters (Simon S. Lam) 19

slide-20
SLIDE 20

2/16/2017 20

Counter overflow consequences

When a counter does overflow, it may be left at its maximum value. at ts max mum value. This can later cause a false negative only if eventually the counter goes down to 0 when it y g should have remain at nonzero. The expected time to this event is very large p y g but is something we need to keep in mind for any application that does not allow false ti negatives

2/16/2017 Bloom Filters (Simon S. Lam) 20

slide-21
SLIDE 21

2/16/2017 21

Conclusions

Wherever a list or set is used, and space is at a premium, a Bloom filter may be used if the a prem um, a Bloom f lter may be used f the effect of false positives can be mitigated

  • No false negative

With a counting Bloom filter, false negatives are possible, albeit highly unlikely

2/16/2017 Bloom Filters (Simon S. Lam) 21

slide-22
SLIDE 22

2/16/2017 22

The End

2/16/2017 Bloom Filters (Simon S. Lam) 22