Linear probing with constant independence Anna Pagh, Rasmus Pagh, - - PowerPoint PPT Presentation

linear probing with constant independence
SMART_READER_LITE
LIVE PREVIEW

Linear probing with constant independence Anna Pagh, Rasmus Pagh, - - PowerPoint PPT Presentation

Linear probing with constant independence Anna Pagh, Rasmus Pagh, and Milan Ru i IT University of Copenhagen STOC 2007 Hashing with linear probing Hashing with linear probing Hashing with linear probing Hashing with linear probing


slide-1
SLIDE 1

Linear probing with constant independence

Anna Pagh, Rasmus Pagh, and Milan Ružić IT University of Copenhagen

STOC 2007

slide-2
SLIDE 2

Hashing with linear probing

slide-3
SLIDE 3

Hashing with linear probing

slide-4
SLIDE 4

Hashing with linear probing

slide-5
SLIDE 5

Hashing with linear probing

slide-6
SLIDE 6

Hashing with linear probing

slide-7
SLIDE 7

Hashing with linear probing

It was settled in the 60s that this is inferior to e.g. double hashing. So why care?

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

389 km/h 20 km/h

slide-11
SLIDE 11

Race car vs golf car

  • Linear probing uses a sequential scan and is

thus cache-friendly.

  • On my laptop: 24x speed difference

between sequential and random access!

  • Experimental studies have shown linear

probing to be faster than other methods for load factor α in the range 30-70%.

For 4-byte words For “ small” keys

slide-12
SLIDE 12

Race car vs golf car

  • Linear probing uses a sequential scan and is

thus cache-friendly.

  • On my laptop: 24x speed difference

between sequential and random access!

  • Experimental studies have shown linear

probing to be faster than other methods for load factor α in the range 30-70%.

  • But: No theory behind the hash functions

used for linear probing in practice.

For 4-byte words For “ small” keys

slide-13
SLIDE 13

History of linear probing

  • First described in 1954.
  • Analyzed in 1962 by D. Knuth, aged 24.

Assumes hash function h is truly random.

  • Over 30 papers using this assumption.
  • Siegel and Schmidt (1990) showed that it

suffices that h is O(log n)-wise independent.

slide-14
SLIDE 14

History of linear probing

  • First described in 1954.
  • Analyzed in 1962 by D. Knuth, aged 24.

Assumes hash function h is truly random.

  • Over 30 papers using this assumption.
  • Siegel and Schmidt (1990) showed that it

suffices that h is O(log n)-wise independent.

slide-15
SLIDE 15

History of linear probing

  • First described in 1954.
  • Analyzed in 1962 by D. Knuth, aged 24.

Assumes hash function h is truly random.

  • Over 30 papers using this assumption.
  • Siegel and Schmidt (1990) showed that it

suffices that h is O(log n)-wise independent. Our main result: It suffices that h is 5-wise independent.

slide-16
SLIDE 16

This talk

  • Background and motivation
  • Hash functions
  • New analysis of linear probing
  • Lower bound for 2-wise independence
  • XOR probing
slide-17
SLIDE 17

log(n)-wise independence

  • Siegel (1989) showed time-space trade-offs

for evaluation of a function from a log(n)- wise independent family:

  • Upper bound 2 is theoretically appealing,

but has a huge constant factor – and uses many random memory accesses!

Time Space Lower bound

log(n) log(s/ log n)

s Upper bound 1∗ O(log n) O(log n) Upper bound 2 O(1) nǫ

slide-18
SLIDE 18

log(n)-wise independence

  • Siegel (1989) showed time-space trade-offs

for evaluation of a function from a log(n)- wise independent family:

  • Upper bound 2 is theoretically appealing,

but has a huge constant factor – and uses many random memory accesses!

Time Space Lower bound

log(n) log(s/ log n)

s Upper bound 1∗ O(log n) O(log n) Upper bound 2 O(1) nǫ

slide-19
SLIDE 19

5-wise independence

  • Polynomial hash function:
  • Tabulation-based hash function:

Carter and Wegman (FOCS ’79) Thorup and Zhang (SODA ‘04)

Already quite fast Within factor 2 of the fastest universal hash functions

h(x) = 4

  • i=0

aixi mod p

  • mod r

h(x1, x2) = T1[x1] ⊕ T2[x2] ⊕ T3[x1 + x2]

slide-20
SLIDE 20

This talk

  • Background and motivation
  • Hash functions
  • New analysis of linear probing
  • Lower bound for 2-wise independence
  • XOR probing
slide-21
SLIDE 21

Insertion cost upper bound

slide-22
SLIDE 22

Insertion cost upper bound

slide-23
SLIDE 23

Insertion cost upper bound

  • 1. Choose max t so

B balls hash to B-t slots, for some B

{

slide-24
SLIDE 24

Insertion cost upper bound

  • 1. Choose max t so

B balls hash to B-t slots, for some B

{

{

  • 2. Choose max C such that C

balls hash to C+t slots

slide-25
SLIDE 25

Insertion cost upper bound

  • 1. Choose max t so

B balls hash to B-t slots, for some B

{

{

  • 2. Choose max C such that C

balls hash to C+t slots

slide-26
SLIDE 26

Insertion cost upper bound

  • 1. Choose max t so

B balls hash to B-t slots, for some B

{

{

  • 2. Choose max C such that C

balls hash to C+t slots

slide-27
SLIDE 27

Insertion cost upper bound

  • 1. Choose max t so

B balls hash to B-t slots, for some B

{

{

  • 2. Choose max C such that C

balls hash to C+t slots

Cost( )≤1+C+t Lemma:

slide-28
SLIDE 28

Proof idea

  • Lemma: If operation on x goes on for more

than k steps, then there are “unusually many” keys with hash values in either:

1) Some interval with h(x) as right endpoint, or 2) The interval [h(x),h(x)+k]

α

h(x)

h(x) + k

slide-29
SLIDE 29

Proof idea

  • To bound cost, upper bound probability of

each event using tail bounds for sums of random variables with limited independence.

  • Lemma: If operation on x goes on for more

than k steps, then there are “unusually many” keys with hash values in either:

1) Some interval with h(x) as right endpoint, or 2) The interval [h(x),h(x)+k]

α

h(x)

h(x) + k

slide-30
SLIDE 30

Our main result

Theorem 2 Consider any sequence of insertions, dele- tions, and lookups in a linear probing hash table using a 5-wise independent hash function. Then the expected cost of any operation, performed at load factor α, is O(1 + (1 − α)−3) . As a consequence, the expected average cost of successful lookups is O(1 + (1 − α)−2).

slide-31
SLIDE 31

Our main result

factor (1 − α)−1 from what can be proved using full independence

Theorem 2 Consider any sequence of insertions, dele- tions, and lookups in a linear probing hash table using a 5-wise independent hash function. Then the expected cost of any operation, performed at load factor α, is O(1 + (1 − α)−3) . As a consequence, the expected average cost of successful lookups is O(1 + (1 − α)−2).

slide-32
SLIDE 32

This talk

  • Background and motivation
  • Hash functions
  • New analysis of linear probing
  • Lower bound for 2-wise independence
  • XOR probing
slide-33
SLIDE 33

Cost lower bound

slide-34
SLIDE 34

Cost lower bound

slide-35
SLIDE 35

Cost lower bound

slide-36
SLIDE 36

Cost lower bound

slide-37
SLIDE 37

Cost lower bound

Lemma 2 Suppose that the multiset of hash values for the keys is

j Ij, where I1, I2, . . . are intervals. Then the total number

  • f steps to perform the insertions is at least
  • j1<j2

|Ij1 ∩ Ij2|2/2 .

slide-38
SLIDE 38

Bad example: “Linear hashing”

  • First example of pairwise independence.

h(x) = (ax + b mod p) mod r

slide-39
SLIDE 39

Bad example: “Linear hashing”

  • First example of pairwise independence.
  • Consider an interval S1={z+1,...,z+n}.

h(x) = (ax + b mod p) mod r

slide-40
SLIDE 40

Bad example: “Linear hashing”

  • First example of pairwise independence.
  • Consider an interval S1={z+1,...,z+n}.
  • Observation:

Let m=a-1 (mod p). Then h(S1) is the union of at most m+1 intervals (mod r).

h(x) = (ax + b mod p) mod r

slide-41
SLIDE 41

Lower bound for n insertions

  • Idea: Let S = union of two random intervals

⇒ Expect that the 2 times m+1 intervals

have large overlap

⇒ Expected cost ⇒ For random m, expected cost ⇒ In the case p=O(n), Ω(n log n) cost!

  • m

n m 2 = Ω(n2/m) . Ω

  • 1

p

p−1

  • m=1

n2/m

  • = Ω

n2 p log p

  • .
slide-42
SLIDE 42

XOR probing

  • XOR probing: Probe sequence never leaves

the (aligned) memory block before it has been fully traversed.

  • For XOR probing, we can show the same

result as in the fully random case, up to a constant factor, using 5-wise independence.

Linear probing: h(x), h(x) + 1, h(x) + 2, . . . XOR probing: h(x), h(x) ⊕ 1, h(x) ⊕ 2, . . .

slide-43
SLIDE 43

End remarks

  • Theory and practice of linear probing now

(seem) much closer.

  • We can generalize to variable key lengths.
slide-44
SLIDE 44

End remarks

  • Theory and practice of linear probing now

(seem) much closer.

  • We can generalize to variable key lengths.
  • Open:
  • Still many hashing schemes where theory

does not provide satisfactory methods.

  • Tighter analysis, lower independence?
slide-45
SLIDE 45
slide-46
SLIDE 46

T H E N D E

slide-47
SLIDE 47

Why 5?

  • For every key x, the hash values of the
  • ther keys are 4-wise independent with

respect to h(x).

  • 4-wise independence gives a tail bound that

is sufficiently strong.

  • 2-wise independence would give a tail

bound that is too weak.

slide-48
SLIDE 48

Why 5?

  • For every key x, the hash values of the
  • ther keys are 4-wise independent with

respect to h(x).

  • 4-wise independence gives a tail bound that

is sufficiently strong.

  • 2-wise independence would give a tail

bound that is too weak.