Succinct Data Structures for Retrieval and Approximate Membership - - PowerPoint PPT Presentation

succinct data structures for retrieval and approximate
SMART_READER_LITE
LIVE PREVIEW

Succinct Data Structures for Retrieval and Approximate Membership - - PowerPoint PPT Presentation

Succinct Data Structures for Retrieval and Approximate Membership Martin Dietzfelbinger Technische Universit at Ilmenau Joint work with Rasmus Pagh February 18, 2008 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 Retrieval Adam 0 Peter


slide-1
SLIDE 1

Succinct Data Structures for Retrieval and Approximate Membership

Martin Dietzfelbinger Technische Universit¨ at Ilmenau Joint work with Rasmus Pagh February 18, 2008

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008

slide-2
SLIDE 2

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-3
SLIDE 3

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Alice

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-4
SLIDE 4

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Alice 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-5
SLIDE 5

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Peter

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-6
SLIDE 6

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Peter

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-7
SLIDE 7

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Godzilla

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-8
SLIDE 8

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 1 Godzilla

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-9
SLIDE 9

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Alex

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-10
SLIDE 10

Retrieval

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1 Alex 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 1

slide-11
SLIDE 11

Retrieval

S ⊆ U. Store f : S → R, where w.l.o.g. R = {0, 1}r, in data structure Dretr.

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 2

slide-12
SLIDE 12

Retrieval

S ⊆ U. Store f : S → R, where w.l.o.g. R = {0, 1}r, in data structure Dretr.

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1

On x ∈ S, return f(x). On x / ∈ S return any element of R.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 2

slide-13
SLIDE 13

Retrieval

S ⊆ U. Store f : S → R, where w.l.o.g. R = {0, 1}r, in data structure Dretr.

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1

On x ∈ S, return f(x). On x / ∈ S return any element of R. Issues:

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 2

slide-14
SLIDE 14

Retrieval

S ⊆ U. Store f : S → R, where w.l.o.g. R = {0, 1}r, in data structure Dretr.

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1

On x ∈ S, return f(x). On x / ∈ S return any element of R. Issues:

  • Space for Dretr.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 2

slide-15
SLIDE 15

Retrieval

S ⊆ U. Store f : S → R, where w.l.o.g. R = {0, 1}r, in data structure Dretr.

Peter Adam Eve Alice Bob Sue 1 1 1 Alex 1

On x ∈ S, return f(x). On x / ∈ S return any element of R. Issues:

  • Space for Dretr.
  • Evaluation time O(1)

(unless said otherwise).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 2

slide-16
SLIDE 16

Simple solutions – Our results

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-17
SLIDE 17

Simple solutions – Our results

Simple:

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-18
SLIDE 18

Simple solutions – Our results

Simple:

  • Dictionaries: nr bits plus space for dictionary.

Solve in addition the problem “is x ∈ S?”.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-19
SLIDE 19

Simple solutions – Our results

Simple:

  • Dictionaries: nr bits plus space for dictionary.

Solve in addition the problem “is x ∈ S?”.

  • Minimal perfect hash function for S with range [n]:

nr bits plus Θ(n) bits for hash function: Best possible: n/ ln 2 ≈ 1.44n. [Hagerup, Tholey 2001].

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-20
SLIDE 20

Simple solutions – Our results

Simple:

  • Dictionaries: nr bits plus space for dictionary.

Solve in addition the problem “is x ∈ S?”.

  • Minimal perfect hash function for S with range [n]:

nr bits plus Θ(n) bits for hash function: Best possible: n/ ln 2 ≈ 1.44n. [Hagerup, Tholey 2001]. Our results:

  • Space (1 + e−k)nr, evaluation time O(k).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-21
SLIDE 21

Simple solutions – Our results

Simple:

  • Dictionaries: nr bits plus space for dictionary.

Solve in addition the problem “is x ∈ S?”.

  • Minimal perfect hash function for S with range [n]:

nr bits plus Θ(n) bits for hash function: Best possible: n/ ln 2 ≈ 1.44n. [Hagerup, Tholey 2001]. Our results:

  • Space (1 + e−k)nr, evaluation time O(k).
  • (Optimal) Space n(r + o(1)), evaluation time O(log n).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-22
SLIDE 22

Simple solutions – Our results

Simple:

  • Dictionaries: nr bits plus space for dictionary.

Solve in addition the problem “is x ∈ S?”.

  • Minimal perfect hash function for S with range [n]:

nr bits plus Θ(n) bits for hash function: Best possible: n/ ln 2 ≈ 1.44n. [Hagerup, Tholey 2001]. Our results:

  • Space (1 + e−k)nr, evaluation time O(k).
  • (Optimal) Space n(r + o(1)), evaluation time O(log n).

Assumption: Fully random hash functions for free.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 3

slide-23
SLIDE 23

Overview

  • Introduction, Overview
  • Basic approach: Hash-read-add
  • Previous work
  • Results
  • Calkin’s theorem on sparse random matrices
  • Retrieval structure I
  • Approximate membership
  • Retrieval structure II: Construction in linear time
  • Multiple-choice hashing and retrieval structures
  • Summary

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 4

slide-24
SLIDE 24

Basic approach: Hash-read-add

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 5

slide-25
SLIDE 25

Basic approach: Hash-read-add

A : U ∋ x → A(x) = { h1 (x), . . . , hk (x)} ⊆ [m] (distinct)

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 5

slide-26
SLIDE 26

Basic approach: Hash-read-add

A : U ∋ x → A(x) = { h1 (x), . . . , hk (x)} ⊆ [m] (distinct) Assumption: A(x) fully random on S. Justification: “Split-and-share” (O(n1−δ) extra space, not discussed here).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 5

slide-27
SLIDE 27

Basic approach: Hash-read-add

A : U ∋ x → A(x) = { h1 (x), . . . , hk (x)} ⊆ [m] (distinct) Assumption: A(x) fully random on S. Justification: “Split-and-share” (O(n1−δ) extra space, not discussed here). Data structure: Array T[0 . . m − 1], entries from R = {0, 1}r.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 5

slide-28
SLIDE 28

Basic approach: Hash-read-add

3 h2 m−1 T: 1 2 x 1 1 1 1 1 1 h 1 1 h

f(x) =

  • 1≤ℓ≤k

T [hℓ(x)] Evaluation time: O(k). Nonadaptive reads!

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 6

slide-29
SLIDE 29

Basic approach: Hash-read-add

Sufficient condition for this to work: Matrix MA = (pij) 1≤i≤n

0≤j<m

= with pij = 1 , if j ∈ Axi, , otherwise,

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 7

slide-30
SLIDE 30

Basic approach: Hash-read-add

Sufficient condition for this to work: Matrix MA = (pij) 1≤i≤n

0≤j<m

= with pij = 1 , if j ∈ Axi, , otherwise,

[m] 1 2 3 4 5 6 7 A(x1): 1 1 1 A(x2): 1 1 1 A(x3): 1 1 1 A(x4): 1 1 1 A(x5): 1 1 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 7

slide-31
SLIDE 31

Basic approach: Hash-read-add

Sufficient condition for this to work: Matrix MA = (pij) 1≤i≤n

0≤j<m

= with pij = 1 , if j ∈ Axi, , otherwise,

[m] 1 2 3 4 5 6 7 A(x1): 1 1 1 A(x2): 1 1 1 A(x3): 1 1 1 A(x4): 1 1 1 A(x5): 1 1 1

has full row rank.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 7

slide-32
SLIDE 32

Basic approach: Hash-read-add

If MA has full row rank . . . then

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

slide-33
SLIDE 33

Basic approach: Hash-read-add

If MA has full row rank . . . then the system MA ·    a0 . . . am−1    =    f(x1) . . . f(xn)   

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

slide-34
SLIDE 34

Basic approach: Hash-read-add

If MA has full row rank . . . then the system MA ·    a0 . . . am−1    =    f(x1) . . . f(xn)    has a solution (a0, . . . , am−1)T (goes into T[0 . . m − 1]).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

slide-35
SLIDE 35

Basic approach: Hash-read-add

If MA has full row rank . . . then the system MA ·    a0 . . . am−1    =    f(x1) . . . f(xn)    has a solution (a0, . . . , am−1)T (goes into T[0 . . m − 1]). If r = 1: Linear algebra in Z2.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

slide-36
SLIDE 36

Basic approach: Hash-read-add

If MA has full row rank . . . then the system MA ·    a0 . . . am−1    =    f(x1) . . . f(xn)    has a solution (a0, . . . , am−1)T (goes into T[0 . . m − 1]). If r = 1: Linear algebra in Z2. If r ≥ 2: Work in parallel in the components.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

slide-37
SLIDE 37

Basic approach: Hash-read-add

If MA has full row rank . . . then the system MA ·    a0 . . . am−1    =    f(x1) . . . f(xn)    has a solution (a0, . . . , am−1)T (goes into T[0 . . m − 1]). If r = 1: Linear algebra in Z2. If r ≥ 2: Work in parallel in the components. Alternative: R ⊆ GF(q), any finite field (like Zp), calculate in GF(q).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

slide-38
SLIDE 38

Previous work

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

slide-39
SLIDE 39

Previous work

[1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f(xi) = i−1). Propose hash-read-add-scheme, experiments, no analysis.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

slide-40
SLIDE 40

Previous work

[1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f(xi) = i−1). Propose hash-read-add-scheme, experiments, no analysis. [2] [Majewski, Wormald, Havas, Czech 1996] (Ordered) Perfect hashing, r ≥ 1, m = O(n). Analysis via random (hyper)graph theory.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

slide-41
SLIDE 41

Previous work

[1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f(xi) = i−1). Propose hash-read-add-scheme, experiments, no analysis. [2] [Majewski, Wormald, Havas, Czech 1996] (Ordered) Perfect hashing, r ≥ 1, m = O(n). Analysis via random (hyper)graph theory. [3] [Chazelle, Kilian, Rubinfeld, Tal 2004] Implicit in work on “Bloomier filter”, r ≥ 1, m = O(n). Ad-hoc analysis.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

slide-42
SLIDE 42

Previous work

In [2] + [3]:

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

slide-43
SLIDE 43

Previous work

In [2] + [3]: Use sufficient condition “hypergraph GA (with hyperedges A(xi)) is acyclic”.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

slide-44
SLIDE 44

Previous work

In [2] + [3]: Use sufficient condition “hypergraph GA (with hyperedges A(xi)) is acyclic”. Equivalent: Each nonempty subset of the A(xi)’s covers at least one node

  • nly once.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

slide-45
SLIDE 45

Previous work

In [2] + [3]: Use sufficient condition “hypergraph GA (with hyperedges A(xi)) is acyclic”. Equivalent: Each nonempty subset of the A(xi)’s covers at least one node

  • nly once.

Equivalent: MA can be brought into echelon form by row and column exchanges.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

slide-46
SLIDE 46

Previous work

[m] 7 3 1 6 2 4 5 A(x1): 1 1 1 A(x4): 1 1 1 A(x3): 1 1 1 A(x2): 1 1 1 A(x5): 1 1 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

slide-47
SLIDE 47

Previous work

[m] 7 3 1 6 2 4 5 A(x1): 1 1 1 A(x4): 1 1 1 A(x3): 1 1 1 A(x2): 1 1 1 A(x5): 1 1 1

Thresholds: (Acyclicity whp if m ≥ (1 + γ)n > γkn): k 2 3 4 5 6 asympt. γk 2 1.222 1.295 1.425 1.570 ln k

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

slide-48
SLIDE 48

Previous work

[m] 7 3 1 6 2 4 5 A(x1): 1 1 1 A(x4): 1 1 1 A(x3): 1 1 1 A(x2): 1 1 1 A(x5): 1 1 1

Thresholds: (Acyclicity whp if m ≥ (1 + γ)n > γkn): k 2 3 4 5 6 asympt. γk 2 1.222 1.295 1.425 1.570 ln k Advantage: Solve linear system in time O(nk).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

slide-49
SLIDE 49

New in this context: Calkin’s theorem

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

slide-50
SLIDE 50

New in this context: Calkin’s theorem

Theorem [Calkin 1997] There are constants βk < 1, k = 3, 4, . . . , with the following properties:

  • If (1 + γ) > β−1

k

and m ≥ (1 + γ)n and MA is as before, then MA has full row rank with probability 1 − 1/nε.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

slide-51
SLIDE 51

New in this context: Calkin’s theorem

Theorem [Calkin 1997] There are constants βk < 1, k = 3, 4, . . . , with the following properties:

  • If (1 + γ) > β−1

k

and m ≥ (1 + γ)n and MA is as before, then MA has full row rank with probability 1 − 1/nε.

  • β−1

k

≈ 1 + e−k/ ln 2, for growing k.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

slide-52
SLIDE 52

New in this context: Calkin’s theorem

Theorem [Calkin 1997] There are constants βk < 1, k = 3, 4, . . . , with the following properties:

  • If (1 + γ) > β−1

k

and m ≥ (1 + γ)n and MA is as before, then MA has full row rank with probability 1 − 1/nε.

  • β−1

k

≈ 1 + e−k/ ln 2, for growing k. Thresholds: k 2 3 4 5 6 asympt. β−1

k

2 1.1243 1.034 1.011 1.0038 1 + e−k/ ln 2 γk 2 1.222 1.295 1.425 1.570 ln k

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

slide-53
SLIDE 53

New retrieval structures

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

slide-54
SLIDE 54

New retrieval structures

Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ)rn bits can be built if 1 + γ > β−1

k

≈ 1 + e−k/ ln 2. Construction time: O(n3).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

slide-55
SLIDE 55

New retrieval structures

Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ)rn bits can be built if 1 + γ > β−1

k

≈ 1 + e−k/ ln 2. Construction time: O(n3). Proof : Existence: Calkin. Construction: Solve a linear system.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

slide-56
SLIDE 56

New retrieval structures

Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ)rn bits can be built if 1 + γ > β−1

k

≈ 1 + e−k/ ln 2. Construction time: O(n3). Proof : Existence: Calkin. Construction: Solve a linear system. Theorem 2 . . . same . . . Construction time O(n1+δ).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

slide-57
SLIDE 57

New retrieval structures

Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ)rn bits can be built if 1 + γ > β−1

k

≈ 1 + e−k/ ln 2. Construction time: O(n3). Proof : Existence: Calkin. Construction: Solve a linear system. Theorem 2 . . . same . . . Construction time O(n1+δ). Proof : Theorem 1 plus “splitting”.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

slide-58
SLIDE 58

Construction time O(n1+δ)

t=n

1−δ/2 split 1 2

h

T:

S U

S S S

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 14

slide-59
SLIDE 59

Construction time O(n1+δ)

Construction time O(n1+δ), for δ > 0 constant: “Split” .

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

slide-60
SLIDE 60

Construction time O(n1+δ)

Construction time O(n1+δ), for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function hsplit: U → [t]. Splits S into t = n1−δ/2 chunks S0, . . . , St−1 of size O(nδ/2).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

slide-61
SLIDE 61

Construction time O(n1+δ)

Construction time O(n1+δ), for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function hsplit: U → [t]. Splits S into t = n1−δ/2 chunks S0, . . . , St−1 of size O(nδ/2). Construct separate retrieval data structure for each of the chunks: construction time t · O((nδ/2)3) = O(n1+δ). Extra space: o(n).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

slide-62
SLIDE 62

Construction time O(n1+δ)

Construction time O(n1+δ), for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function hsplit: U → [t]. Splits S into t = n1−δ/2 chunks S0, . . . , St−1 of size O(nδ/2). Construct separate retrieval data structure for each of the chunks: construction time t · O((nδ/2)3) = O(n1+δ). Extra space: o(n). Retrieval for y: access retrieval data structure for Shsplit(y).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

slide-63
SLIDE 63

Approximate membership

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

slide-64
SLIDE 64

Approximate membership

x ∈ S → Answer “yes” x / ∈ S → Pr(Answer “no”) ≥ 1 − 2−r

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

slide-65
SLIDE 65

Approximate membership

x ∈ S → Answer “yes” x / ∈ S → Pr(Answer “no”) ≥ 1 − 2−r Standard implementation: Bloom filters. Space ≈ nr/ ln 2.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

slide-66
SLIDE 66

Approximate membership

x ∈ S → Answer “yes” x / ∈ S → Pr(Answer “no”) ≥ 1 − 2−r Standard implementation: Bloom filters. Space ≈ nr/ ln 2. Can show ([Carter et al. 1978]): Need ≥ nr − O(1) bits.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

slide-67
SLIDE 67

Approximate membership

x ∈ S → Answer “yes” x / ∈ S → Pr(Answer “no”) ≥ 1 − 2−r Standard implementation: Bloom filters. Space ≈ nr/ ln 2. Can show ([Carter et al. 1978]): Need ≥ nr − O(1) bits. Or (folklore): Use (minimal) perfect hashing to store an r-bit fingerprint for x ∈ S.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

slide-68
SLIDE 68

Approximate membership

x ∈ S → Answer “yes” x / ∈ S → Pr(Answer “no”) ≥ 1 − 2−r Standard implementation: Bloom filters. Space ≈ nr/ ln 2. Can show ([Carter et al. 1978]): Need ≥ nr − O(1) bits. Or (folklore): Use (minimal) perfect hashing to store an r-bit fingerprint for x ∈ S. Space: nr + O(n) bits.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

slide-69
SLIDE 69

Approximate membership

General construction:

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-70
SLIDE 70

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-71
SLIDE 71

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r plus one fully random hash function q: U → R (fingerprint).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-72
SLIDE 72

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r plus one fully random hash function q: U → R (fingerprint). Given S, build retrieval structure Dretr for f = q|S.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-73
SLIDE 73

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r plus one fully random hash function q: U → R (fingerprint). Given S, build retrieval structure Dretr for f = q|S. On query y: Retrieve s = Dretr(y); answer “yes” if q(y) = s, “no” otherwise.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-74
SLIDE 74

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r plus one fully random hash function q: U → R (fingerprint). Given S, build retrieval structure Dretr for f = q|S. On query y: Retrieve s = Dretr(y); answer “yes” if q(y) = s, “no” otherwise. Performance: Error probability ≤ 2−r.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-75
SLIDE 75

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r plus one fully random hash function q: U → R (fingerprint). Given S, build retrieval structure Dretr for f = q|S. On query y: Retrieve s = Dretr(y); answer “yes” if q(y) = s, “no” otherwise. Performance: Error probability ≤ 2−r. New: Space (1 + e−k)nr bits, evaluation time O(k).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-76
SLIDE 76

Approximate membership

General construction: Assume any algorithm for retrieval structure for range R = {0, 1}r plus one fully random hash function q: U → R (fingerprint). Given S, build retrieval structure Dretr for f = q|S. On query y: Retrieve s = Dretr(y); answer “yes” if q(y) = s, “no” otherwise. Performance: Error probability ≤ 2−r. New: Space (1 + e−k)nr bits, evaluation time O(k). Construction time O(n3) resp. O(n1+δ).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

slide-77
SLIDE 77

Retrieval: Construction in linear time

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

slide-78
SLIDE 78

Retrieval: Construction in linear time

Theoretical construction, works for very large n.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

slide-79
SLIDE 79

Retrieval: Construction in linear time

Theoretical construction, works for very large n. Extra level of hashing: hsplit: U → [t].

t=n/b

split 1 2

h

T:

S S

S U

S

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

slide-80
SLIDE 80

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-81
SLIDE 81

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-82
SLIDE 82

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-83
SLIDE 83

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Construction time O(b) !

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-84
SLIDE 84

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Construction time O(b) ! Why?

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-85
SLIDE 85

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Construction time O(b) ! Why? Can arrange that matrix size ((1 + γ)b)2 is < 1

2 log n.

Use table-lookup for the linear algebra.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-86
SLIDE 86

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Construction time O(b) ! Why? Can arrange that matrix size ((1 + γ)b)2 is < 1

2 log n.

Use table-lookup for the linear algebra. Total construction time: O(n).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-87
SLIDE 87

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Construction time O(b) ! Why? Can arrange that matrix size ((1 + γ)b)2 is < 1

2 log n.

Use table-lookup for the linear algebra. Total construction time: O(n). Total space (1 + γ)nr + o(n) bits.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-88
SLIDE 88

Construction in linear time

Chunk size: b = 1

2

√log n. Number of chunks: t = n/b. hsplit splits S into chunks S0, . . . , St−1 of expected size b. Construct separate retrieval data structure for each chunk: Allocate space b′ = (1 + γ)b for each chunk, 1 + γ > β−1

k .

Construction time O(b) ! Why? Can arrange that matrix size ((1 + γ)b)2 is < 1

2 log n.

Use table-lookup for the linear algebra. Total construction time: O(n). Total space (1 + γ)nr + o(n) bits. Done?

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

slide-89
SLIDE 89

Construction in linear time

No!

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

slide-90
SLIDE 90

Construction in linear time

No! “Bad chunks”: (1) Overflow (2) Construction fails.

t=n/b

split 1 2

h

T:

S S

S U

S

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

slide-91
SLIDE 91

Construction in linear time

No! “Bad chunks”: (1) Overflow (2) Construction fails.

bad

t=n/b

split 1 2

h

T:

S

S U

S S

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

slide-92
SLIDE 92

Construction in linear time

Nice: #(keys in bad chunk) = o(n).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

slide-93
SLIDE 93

Construction in linear time

Nice: #(keys in bad chunk) = o(n). The (keys from) bad chunks are accommodated in a secondary structure with table T ′[0..o(n)], hash functions h′

1, . . . , h′ 3,

with o(n) construction time (e.g. [Chazelle et al. 2004]).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

slide-94
SLIDE 94

Construction in linear time

Nice: #(keys in bad chunk) = o(n). The (keys from) bad chunks are accommodated in a secondary structure with table T ′[0..o(n)], hash functions h′

1, . . . , h′ 3,

with o(n) construction time (e.g. [Chazelle et al. 2004]). Flag: v[i] = 0 if chunk i is bad and v[i] = 1 otherwise.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

slide-95
SLIDE 95

Construction in linear time

Segment for chunk Si in T: T[di . . di+1 − 1].

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

slide-96
SLIDE 96

Construction in linear time

Segment for chunk Si in T: T[di . . di+1 − 1]. Lookup operation: i ← hsplit(x), then . . . f(x) = v[i] ·

  • 1≤ℓ≤k

T [hℓ(x) + di] ⊕ v[i] ·

  • 1≤ℓ≤3

T ′[h′

ℓ(x)]

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

slide-97
SLIDE 97

Construction in linear time

Segment for chunk Si in T: T[di . . di+1 − 1]. Lookup operation: i ← hsplit(x), then . . . f(x) = v[i] ·

  • 1≤ℓ≤k

T [hℓ(x) + di] ⊕ v[i] ·

  • 1≤ℓ≤3

T ′[h′

ℓ(x)]

Time O(k). Nonadaptive reads!

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

slide-98
SLIDE 98

Almost optimal space, logarithmic evaluation time

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-99
SLIDE 99

Almost optimal space, logarithmic evaluation time

Use random sets A(x) = {h1(x), . . . , hk(x)(x)} ⊆ [n], with E(k(x)) = Θ(log n), binomially distributed.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-100
SLIDE 100

Almost optimal space, logarithmic evaluation time

Use random sets A(x) = {h1(x), . . . , hk(x)(x)} ⊆ [n], with E(k(x)) = Θ(log n), binomially distributed. [Cooper, 1999] ⇒ Pr(MA is regular) ≥ 0.28.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-101
SLIDE 101

Almost optimal space, logarithmic evaluation time

Use random sets A(x) = {h1(x), . . . , hk(x)(x)} ⊆ [n], with E(k(x)) = Θ(log n), binomially distributed. [Cooper, 1999] ⇒ Pr(MA is regular) ≥ 0.28. O(log n) trials will find good A with probability 1 − 1/nO(1).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-102
SLIDE 102

Almost optimal space, logarithmic evaluation time

Use random sets A(x) = {h1(x), . . . , hk(x)(x)} ⊆ [n], with E(k(x)) = Θ(log n), binomially distributed. [Cooper, 1999] ⇒ Pr(MA is regular) ≥ 0.28. O(log n) trials will find good A with probability 1 − 1/nO(1). (Must store index of this A, O(log log n) additional bits.)

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-103
SLIDE 103

Almost optimal space, logarithmic evaluation time

Use random sets A(x) = {h1(x), . . . , hk(x)(x)} ⊆ [n], with E(k(x)) = Θ(log n), binomially distributed. [Cooper, 1999] ⇒ Pr(MA is regular) ≥ 0.28. O(log n) trials will find good A with probability 1 − 1/nO(1). (Must store index of this A, O(log log n) additional bits.) Space for table T: nr bits (optimal).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-104
SLIDE 104

Almost optimal space, logarithmic evaluation time

Use random sets A(x) = {h1(x), . . . , hk(x)(x)} ⊆ [n], with E(k(x)) = Θ(log n), binomially distributed. [Cooper, 1999] ⇒ Pr(MA is regular) ≥ 0.28. O(log n) trials will find good A with probability 1 − 1/nO(1). (Must store index of this A, O(log log n) additional bits.) Space for table T: nr bits (optimal). Evaluation time: O(log n).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

slide-105
SLIDE 105

Multiple-choice hashing and retrieval structures

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 24

slide-106
SLIDE 106

Multiple-choice hashing and retrieval structures

Dictionary structure with multiple choices and retrieval:

... ... ... ... ... ... ... ... ... ...

h 1 2 x x y z T: 1 h3 h2 m−1

A(x) = {h1(x), h2(x), h3(x)}. x in T[h1(x)] or T[h2(x)] or T[h3(x)].

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 24

slide-107
SLIDE 107

Multiple-choice hashing and retrieval structures

Dictionary structure with multiple choices and retrieval:

... ... ... ... ... ... ... ... ... ...

h 1 2 x x y z T: 1 h3 h2 m−1

A(x) = {h1(x), h2(x), h3(x)}. x in T[h1(x)] or T[h2(x)] or T[h3(x)]. Similarity to retrieval structure!

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 24

slide-108
SLIDE 108

Multiple-choice hashing and retrieval structures

Dictionary structure with multiple choices and retrieval:

... ... ... ... ... ... ... ... ... ...

h 1 2 x x y z T: 1 h3 h2 m−1

A(x) = {h1(x), h2(x), h3(x)}. x in T[h1(x)] or T[h2(x)] or T[h3(x)]. Similarity to retrieval structure! Many variants: “. . . cuckoo hashing”, “balanced allocation”.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 24

slide-109
SLIDE 109

Multiple-choice hashing and retrieval structures

Dictionary structure with multiple choices and retrieval:

... ... ... ... ... ... ... ... ... ...

h 1 2 x x y z T: 1 h3 h2 m−1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 25

slide-110
SLIDE 110

Multiple-choice hashing and retrieval structures

Dictionary structure with multiple choices and retrieval:

... ... ... ... ... ... ... ... ... ...

h 1 2 x x y z T: 1 h3 h2 m−1

Definition: A is suitable for S = {x1, . . . , xn} if there is a mapping σ:

1−1

− → [m] with σ(i) ∈ A(xi).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 25

slide-111
SLIDE 111

Multiple-choice hashing and retrieval structures

Dictionary structure with multiple choices and retrieval:

... ... ... ... ... ... ... ... ... ...

h 1 2 x x y z T: 1 h3 h2 m−1

Definition: A is suitable for S = {x1, . . . , xn} if there is a mapping σ:

1−1

− → [m] with σ(i) ∈ A(xi). (. . . if and only if one can store xi in a slot in A(xi), without collision).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 25

slide-112
SLIDE 112

Multiple-choice hashing and retrieval structures

Observation 1: MA has full row rank ⇒ A is suitable for S.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 26

slide-113
SLIDE 113

Multiple-choice hashing and retrieval structures

Observation 1: MA has full row rank ⇒ A is suitable for S. Reason: MA has a regular square submatrix M ′

A.

[m] 1 2 3 4 5 6 7 A(x1): 1 1 1 A(x2): 1 1 1 A(x3): 1 1 1 A(x4): 1 1 1 A(x5): 1 1 1

det(M ′

A) = π∈Sn sign(π) 1≤i≤n piπ(i) = 0.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 26

slide-114
SLIDE 114

Multiple-choice hashing and retrieval structures

Consequence: Calkin’s result gives sufficient condition for dictionary structu- res to exist:

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 27

slide-115
SLIDE 115

Multiple-choice hashing and retrieval structures

Consequence: Calkin’s result gives sufficient condition for dictionary structu- res to exist: Space (1 + γ)n, 1 + γ > β−1

k , k probes.

Construction time O(n1+δ).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 27

slide-116
SLIDE 116

Multiple-choice hashing and retrieval structures

Consequence: Calkin’s result gives sufficient condition for dictionary structu- res to exist: Space (1 + γ)n, 1 + γ > β−1

k , k probes.

Construction time O(n1+δ). (Tiny improvement of space bound of [Fotakis et al., 2005], d-ary cuckoo hashing. — Conjecture: Optimal.)

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 27

slide-117
SLIDE 117

Multiple-choice hashing and retrieval structures

Observation 2: A suitable for S ⇒

[m] 1 2 3 4 5 6 7 A(x1): 1 1 1 A(x2): 1 1 1 A(x3): 1 1 1 A(x4): 1 1 1 A(x5): 1 1 1

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 28

slide-118
SLIDE 118

Multiple-choice hashing and retrieval structures

Observation 2: A suitable for S ⇒

[m] 1 2 3 4 5 6 7 A(x1): 1 1 1 A(x2): 1 1 1 A(x3): 1 1 1 A(x4): 1 1 1 A(x5): 1 1 1

MA has a square submatrix M ′

A with a nonvanishing term

sign(π)

1≤i≤n piπ(i) in the determinant . . .

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 28

slide-119
SLIDE 119

Multiple-choice hashing and retrieval structures

Observation 2: A suitable for S ⇒

[m] 1 2 3 4 5 6 7 A(x1): 8 4 5 A(x2): 6 9 2 A(x3): 9 8 5 A(x4): 7 7 4 A(x5): 5 2 7

If we replace the 1’s in MA with random elements of F − {0} (finite field F), yielding MA(F), then MA(F) has full row rank with probability ≥ 1 − n

|F| (Schwartz-Zippel Theorem).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 28

slide-120
SLIDE 120

Multiple-choice hashing and retrieval structures

Resulting retrieval function (g1(x), . . . , gk(x) are random hash values, calculate in F): f(x) =

  • 1≤ℓ≤k

gℓ(xi) · T [hℓ(x)]

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 29

slide-121
SLIDE 121

Multiple-choice hashing and retrieval structures

Resulting retrieval function (g1(x), . . . , gk(x) are random hash values, calculate in F): f(x) =

  • 1≤ℓ≤k

gℓ(xi) · T [hℓ(x)] With splitting trick applicable if |R| = |F| ≥ nδ.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 29

slide-122
SLIDE 122

Multiple-choice hashing and retrieval structures

Resulting retrieval function (g1(x), . . . , gk(x) are random hash values, calculate in F): f(x) =

  • 1≤ℓ≤k

gℓ(xi) · T [hℓ(x)] With splitting trick applicable if |R| = |F| ≥ nδ. Construction time: O(n3) or O(n1+δ).

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 29

slide-123
SLIDE 123

Multiple-choice hashing and retrieval structures

Blocked cuckoo hashing:

x h2

k=4

h

block size

1

Dictionaries with accesses in T only at two intervals.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 30

slide-124
SLIDE 124

Multiple-choice hashing and retrieval structures

Blocked cuckoo hashing:

x h2

k=4

h

block size

1

Dictionaries with accesses in T only at two intervals. Yield retrieval structures with cache-friendly access.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 30

slide-125
SLIDE 125

Multiple-choice hashing and retrieval structures

Blocked cuckoo hashing:

x h2

k=4

h

block size

1

Dictionaries with accesses in T only at two intervals. Yield retrieval structures with cache-friendly access. [D.,Weidling 2007] Read 2 × k locations, space (1 + e−ck)nr. [Czumaj et al. 2003] Read 2 × O(log n) locations, space nr.

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 30

slide-126
SLIDE 126

Summary

  • Retrieval in time O(k), space (1 + e−k)nr.
  • Construction time

O(n1+δ) (fair) or O(n) (contrived, asymptotic)

  • Close connection retrieval ↔ approximate membership
  • Close connection retrieval ↔ hash table `

a la cuckoo

  • Cache-friendly retrieval/approximate membership structures

with almost optimal space (range size ≥ nδ or error bound ≤ 1/nδ.)

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 31

slide-127
SLIDE 127

Thank you!

Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 32