Simple and Space-Efficient Minimal Perfect Hash Functions Fabiano - - PowerPoint PPT Presentation

simple and space efficient minimal perfect hash functions
SMART_READER_LITE
LIVE PREVIEW

Simple and Space-Efficient Minimal Perfect Hash Functions Fabiano - - PowerPoint PPT Presentation

Simple and Space-Efficient Minimal Perfect Hash Functions Fabiano C. Botelho Department of Computer Science Federal University of Minas Gerais, Brazil Rasmus Pagh Computational Logic and Algorithms Group IT Univ of Copenhagen, DenMark Nivio


slide-1
SLIDE 1

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 1

Fabiano C. Botelho

Simple and Space-Efficient Minimal Perfect Hash Functions

Computational Logic and Algorithms Group IT Univ of Copenhagen, DenMark

Rasmus Pagh

Department of Computer Science Federal University of Minas Gerais, Brazil

Nivio Ziviani

Department of Computer Science Federal University of Minas Gerais, Brazil

slide-2
SLIDE 2

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 2

What Is The Problem to Solve?

Design, analyze and implement MPHFs that:

 Use space close to the optimal  Faster to generate than the ones available in the

literature

 Fast to compute  Small memory to generate the functions

slide-3
SLIDE 3

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 3

Perfect Hash Function

1 m -1

...

Key set S of size n Hash Table

1 n -1

...

Perfect Hash Function

u |U| U S = ⊆ where ,

slide-4
SLIDE 4

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 4

Minimal Perfect Hash Function

1 n -1

...

1 n -1

...

Minimal Perfect Hash Function

Key set S of size n Hash Table

u |U| U S = ⊆ where ,

slide-5
SLIDE 5

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 5

Lower Bounds For Storage Space

e nlog Space Storage ≥

 PHFs (m ≈ n):  MPHFs (m = n):

e m n log Space Storage

2

4427 . 1 log = e

slide-6
SLIDE 6

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 6

Related Work

 Theoretical Results  Practical Results  Heuristics

slide-7
SLIDE 7

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 7

Theoretical Results

O(n) O(1) O(n+log log u) Hagerup and Thorup (2001) O(n) O(1) Not analyzed Schmidt and Siegel (1990) O(n) Expon. Expon. Mehlhorn (1984) Size (bits)

  • Eval. Time
  • Gen. Time

Work

slide-8
SLIDE 8

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 8

Practical Results

O(n log n) O(1) O(n) Pagh (1999) O(n log n) O(1) O(n) Majewski, Wormald, Havas and Czech (1996) O(n log n) O(1) O(n) Czech, Havas and Majewski (1992) Size (bits)

  • Eval. Time
  • Gen. Time

Work

slide-9
SLIDE 9

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 9

Heuristics

Data mining Sparse spatial data Index data in CD-ROM Application Not analyzed O(1) O(n) Chang, Lin and Chou (2005, 2006) O(n) O(1) O(n) Lefebvre and Hoppe (2006) O(n) O(1) Exp. Fox, Chen and Heath (1992) Size (bits) Eval. Time Gen. Time Work

slide-10
SLIDE 10

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 10

Our Family of Algorithms

Near-optimal space

Evaluation in constant time

Function generation in linear time

Simple to describe and implement

Algorithms in the literature with near-optimal space either:

Require exponential time for construction and evaluation, or

Use near-optimal space only asymptotically, for large n

Acyclic random hypergraphs

Used before by Majewski et all (1996): O(n log n) bits

We proceed differently: O(n) bits

(we changed space complexity, close to theoretical lower bound)

slide-11
SLIDE 11

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 11

Our Family of Algorithms - Remark

Chazelle et al (SODA 2004) presented a way of constructing PHFs that is equivalent to ours

It is explained as a modification of the ``Bloomier Filter'' data structure, but they do not make explicit that a PHF is constructed

slide-12
SLIDE 12

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 12

Random Hypergraphs (r-graphs)

2 1 4 5 3 

3-graph is induced by three uniform hash functions

h0(jan) = 1 h1(jan) = 3 h2(jan) = 5 h0(feb) = 1 h1(feb) = 2 h2(feb) = 5 h0(mar) = 0 h1(mar) = 3 h2(mar) = 4

Our best result uses 3-graphs

3-graph:

slide-13
SLIDE 13

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 13

Acyclic 2-graph

1 3 2

Gr:

5 7 4 6

jan feb m a r a p r

h0 h1 L:Ø

slide-14
SLIDE 14

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 14

1 3 2

Gr:

5 7 4 6

jan feb a p r

h0 h1 L: {0,5}

Acyclic 2-graph

slide-15
SLIDE 15

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 15

1 3 2

Gr:

5 7 4 6

jan a p r

h0 h1 L: {0,5} {2,6}

1

Acyclic 2-graph

slide-16
SLIDE 16

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 16

1 3 2

Gr:

5 7 4 6

jan

h0 h1 L: {0,5} {2,6}

1

{2,7}

2

Acyclic 2-graph

slide-17
SLIDE 17

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 17

1 3 2

Gr:

5 7 4 6

h0 h1 L: {0,5} {2,6}

1

{2,7}

2

{2,5}

3

Gr is acyclic

Acyclic 2-graph

slide-18
SLIDE 18

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 18

The Family of Algorithms (r = 2)

jan feb mar apr

S

1 3 2

Gr:

5 7 4 6

jan feb m a r a p r

Mapping h0 h1

slide-19
SLIDE 19

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 19

jan feb mar apr

S

1 3 2

Gr:

5 7 4 6

jan feb m a r a p r

Mapping Assigning

r r r r 1 1 2 3 4 5 6 1 7

g h0 h1 L L: {0,5} {2,6}

1

{2,7}

2

{2,5}

3

The Family of Algorithms (r = 2)

slide-20
SLIDE 20

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 20

jan feb mar apr

S

1 3 2

Gr:

5 7 4 6

jan feb m a r a p r

Mapping Assigning

r r r r 1 1 2 3 4 5 6 1 7

g h0 h1 L L: {0,5} {2,6}

1

{2,7}

2

{2,5}

3

The Family of Algorithms (r = 2)

  • Values in the

range {0,1, ..., r}

  • r = 2 or r = 3
  • At most 2 bits for

each vertex in g

slide-21
SLIDE 21

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 21

jan feb mar apr

S

1 3 2

Gr:

5 7 4 6

jan feb m a r a p r

Mapping Assigning

r r r r 1 1 2 3 4 5 6 1 7

g

mar jan feb apr

Hash Table 1 2 3

Ranking

assigned assigned assigned assigned

h0 h1

phf(feb) = hi=1 (feb) = 6

L

The Family of Algorithms (r = 2)

i = (g(h0(feb)) + g(h1(feb))) mod r = (g(2) + g(6)) mod 2 = 1 mphf(feb) = rank(phf(feb)) = rank(6) = 2

slide-22
SLIDE 22

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 22

 Sufficient condition for the family of algorithms work

(Majewski et al (1996))

 Repeatedly selects h0,h1..., hr-1  For r = 2, m=cn and c>2,

 For c = 2.09, Pra = 0.29

 For r = 3 and c≥1.23: probability tends to 1  Number of iterations is 1/Pra:

 r = 2: 3.5 iterations  r = 3: 1.0 iteration

Use of Acyclic Random Hypergraphs

2 a

) / 2 ( 1 Pr c − =

slide-23
SLIDE 23

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 23

 MPHFs (ranking information required):

 g: [0,m-1] _ {0,1,2}  2m + _m = (2+ _)cn bits  For c = 2.09 and _ = 0.125 _ 4.44 n bits

 PHFs (ranking information not required):

 g: [0,m-1] _ {0,1}  m = cn bits, c = 2.09 _ 2.09 n bits

Space to Represent the Functions (r = 2)

r r r r 1 1 2 3 4 5 6 1 7

g

 Packed MPHFs (Range of size 3):

 log 3 bits for each entry of g (arithmetic coding)  (log 3 + _)cn bits.  For c = 2.09 and _ = 0.125 _ 3.6 n bits.

slide-24
SLIDE 24

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 24

 MPHFs (ranking information required):

 g: [0,m-1] _ {0,1,2,3}  2m + _m = (2+ _)cn bits  For c = 1.23 and _ = 0.125 _ 2.62 n bits  Optimal: 1.4427n bits.

 PHFs (ranking information not required):

 g: [0,m-1] _ {0,1,2}  m = cn bits, c = 1.23 _ 2.46 n bits

Space to Represent the Functions (r = 3)

 Packed PHFs (Range of size 3):

 log 3 bits for each entry of g (arithmetic coding)  (log 3) cn bits, c = 1.23 _ 1.95 n bits  Optimal: 1.17n bits

slide-25
SLIDE 25

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 25

Experimental Results

 Metrics:

 Generation time  Storage space  Evaluation time

 Collection:

 64 bytes long on average (URLs collected from the web)

 Experiments

 Commodity PC with a cache of 2 Mbytes

slide-26
SLIDE 26

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 26

Related Algorithms

 Botelho, Kohayakawa, Ziviani (2005) - BKZ  Fox, Chen and Heath (1992) – FCH  Czech, Havas and Majewski (1992) – CHM  Majewski, Wormald, Havas and Czech (1996) – MWHC  Pagh (1999) - PAGH

slide-27
SLIDE 27

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 27

Generation Time and Storage Space

18.65 44.16 52.55 ± 2.66 PAGH 11.30 26.76

  • 10. 63 ± 0.09

MWHC 1.55 3.66 5901.9 ± 1489.6 FCH Size (MB) Storage Space Generation Time (sec) Algorithms Bits/Key BKZ Ours 9.19 21.76 16.85 ± 1.85 1.11 2.62 9.80 ± 0.007 r = 3 1.52 3.60 19.49 ± 3.750 r = 2

n=3,541,615 keys

slide-28
SLIDE 28

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 28

Evaluation Time

n=3,541,615 keys

2.78 PAGH 2.85 MWHC 2.14 FCH Evaluation Time (sec) Algorithms BKZ Ours 2.81 2.73 r = 3 2.63 r = 2

slide-29
SLIDE 29

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 29

Comparison of the Resulting PHFs and MPHFs

n Yes 2 2.73 2.14 2.16 2.63 1.83 Evaluation Time (sec) 3 3 3 2 r n 1.23n 1.23n 2.09n m No Yes No No Packed 1.11 2.62 9.80 ± 0.007 0.82 1.95 9.95 ± 0.009 Size (MB) Storage Space Generation Time (sec) Bits/Key 1.04 2.46 9.73 ± 0.009 1.52 3.60 19.49 ± 3.750 0.88 2.09 19.41 ± 3.736

n=3,541,615 keys

slide-30
SLIDE 30

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 30

Conclusions

 We have presented an efficient family of algorithms

 Near space-optimal PHFs and MPHFs

 The algorithms are simpler and has much lower constant

factors than existing theoretical results

 Outperforms the main practical general purpose

algorithms found in the literature considering

 generation time  storage space

 Implementation available at http://cmph.sf.net

 LGPL free software license

slide-31
SLIDE 31

LATIN - LAboratory for Treating INformation (www.dcc.ufmg.br/latin) 31

? ?