succinct data structures for retrieval and approximate
play

Succinct Data Structures for Retrieval and Approximate Membership - PowerPoint PPT Presentation

Succinct Data Structures for Retrieval and Approximate Membership Martin Dietzfelbinger Technische Universit at Ilmenau Joint work with Rasmus Pagh February 18, 2008 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 Retrieval Adam 0 Peter


  1. Basic approach: Hash-read-add If M A has full row rank . . . then the system     a 0 f ( x 1 ) . . . . M A ·  = . .        a m − 1 f ( x n ) has a solution ( a 0 , . . . , a m − 1 ) T (goes into T [0 . . m − 1] ). If r = 1 : Linear algebra in Z 2 . If r ≥ 2 : Work in parallel in the components. Alternative: R ⊆ GF ( q ) , any finite field (like Z p ), calculate in GF ( q ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 8

  2. Previous work Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  3. Previous work [1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f ( x i ) = i − 1 ). Propose hash-read-add-scheme, experiments, no analysis. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  4. Previous work [1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f ( x i ) = i − 1 ). Propose hash-read-add-scheme, experiments, no analysis. [2] [Majewski, Wormald, Havas, Czech 1996] (Ordered) Perfect hashing, r ≥ 1 , m = O ( n ) . Analysis via random (hyper)graph theory. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  5. Previous work [1] [Seiden, Hirschberg 1994] “Ordered perfect hashing” (special case with f ( x i ) = i − 1 ). Propose hash-read-add-scheme, experiments, no analysis. [2] [Majewski, Wormald, Havas, Czech 1996] (Ordered) Perfect hashing, r ≥ 1 , m = O ( n ) . Analysis via random (hyper)graph theory. [3] [Chazelle, Kilian, Rubinfeld, Tal 2004] Implicit in work on “Bloomier filter”, r ≥ 1 , m = O ( n ) . Ad-hoc analysis. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 9

  6. Previous work In [2] + [3]: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  7. Previous work In [2] + [3]: Use sufficient condition “ hypergraph G A (with hyperedges A ( x i ) ) is acyclic ”. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  8. Previous work In [2] + [3]: Use sufficient condition “ hypergraph G A (with hyperedges A ( x i ) ) is acyclic ”. Equivalent: Each nonempty subset of the A ( x i ) ’s covers at least one node only once. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  9. Previous work In [2] + [3]: Use sufficient condition “ hypergraph G A (with hyperedges A ( x i ) ) is acyclic ”. Equivalent: Each nonempty subset of the A ( x i ) ’s covers at least one node only once. Equivalent: M A can be brought into echelon form by row and column exchanges . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 10

  10. Previous work [ m ] 7 3 1 6 0 2 4 5 A ( x 1 ) : 1 0 1 0 0 0 1 0 A ( x 4 ) : 0 1 1 0 0 1 0 0 A ( x 3 ) : 0 0 1 1 0 0 1 0 A ( x 2 ) : 0 0 0 1 1 1 0 0 A ( x 5 ) : 0 0 0 0 1 1 0 1 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

  11. Previous work [ m ] 7 3 1 6 0 2 4 5 A ( x 1 ) : 1 0 1 0 0 0 1 0 A ( x 4 ) : 0 1 1 0 0 1 0 0 A ( x 3 ) : 0 0 1 1 0 0 1 0 A ( x 2 ) : 0 0 0 1 1 1 0 0 A ( x 5 ) : 0 0 0 0 1 1 0 1 Thresholds: (Acyclicity whp if m ≥ (1 + γ ) n > γ k n ): k 2 3 4 5 6 asympt. γ k 2 1 . 222 1 . 295 1 . 425 1 . 570 ln k Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

  12. Previous work [ m ] 7 3 1 6 0 2 4 5 A ( x 1 ) : 1 0 1 0 0 0 1 0 A ( x 4 ) : 0 1 1 0 0 1 0 0 A ( x 3 ) : 0 0 1 1 0 0 1 0 A ( x 2 ) : 0 0 0 1 1 1 0 0 A ( x 5 ) : 0 0 0 0 1 1 0 1 Thresholds: (Acyclicity whp if m ≥ (1 + γ ) n > γ k n ): k 2 3 4 5 6 asympt. γ k 2 1 . 222 1 . 295 1 . 425 1 . 570 ln k Advantage: Solve linear system in time O ( nk ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 11

  13. New in this context: Calkin’s theorem Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  14. New in this context: Calkin’s theorem Theorem [Calkin 1997] There are constants β k < 1 , k = 3 , 4 , . . . , with the following properties: • If (1 + γ ) > β − 1 and m ≥ (1 + γ ) n and M A is as before, k then M A has full row rank with probability 1 − 1 /n ε . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  15. New in this context: Calkin’s theorem Theorem [Calkin 1997] There are constants β k < 1 , k = 3 , 4 , . . . , with the following properties: • If (1 + γ ) > β − 1 and m ≥ (1 + γ ) n and M A is as before, k then M A has full row rank with probability 1 − 1 /n ε . • β − 1 ≈ 1 + e − k / ln 2 , for growing k . k Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  16. New in this context: Calkin’s theorem Theorem [Calkin 1997] There are constants β k < 1 , k = 3 , 4 , . . . , with the following properties: • If (1 + γ ) > β − 1 and m ≥ (1 + γ ) n and M A is as before, k then M A has full row rank with probability 1 − 1 /n ε . • β − 1 ≈ 1 + e − k / ln 2 , for growing k . k Thresholds: k 2 3 4 5 6 asympt. β − 1 1 + e − k / ln 2 2 1 . 1243 1 . 034 1 . 011 1 . 0038 k γ k 2 1 . 222 1 . 295 1 . 425 1 . 570 ln k Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 12

  17. New retrieval structures Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  18. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  19. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Proof : Existence: Calkin. Construction: Solve a linear system. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  20. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Proof : Existence: Calkin. Construction: Solve a linear system. Theorem 2 . . . same . . . Construction time O ( n 1+ δ ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  21. New retrieval structures Theorem 1 A retrieval data structure using the hash-read-add scheme with k accesses for a query and space (1 + γ ) rn bits can be built if 1 + γ > β − 1 ≈ 1 + e − k / ln 2 . k Construction time: O ( n 3 ) . Proof : Existence: Calkin. Construction: Solve a linear system. Theorem 2 . . . same . . . Construction time O ( n 1+ δ ) . Proof : Theorem 1 plus “splitting”. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 13

  22. Construction time O ( n 1+ δ ) S U h split 1− δ /2 t=n S S S 0 1 2 T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 14

  23. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  24. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function h split : U → [ t ] . Splits S into t = n 1 − δ/ 2 chunks S 0 , . . . , S t − 1 of size O ( n δ/ 2 ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  25. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function h split : U → [ t ] . Splits S into t = n 1 − δ/ 2 chunks S 0 , . . . , S t − 1 of size O ( n δ/ 2 ) . Construct separate retrieval data structure for each of the chunks: construction time t · O (( n δ/ 2 ) 3 ) = O ( n 1+ δ ) . Extra space: o ( n ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  26. Construction time O ( n 1+ δ ) Construction time O ( n 1+ δ ) , for δ > 0 constant: “Split” . Introduce extra first level of hashing, using hash function h split : U → [ t ] . Splits S into t = n 1 − δ/ 2 chunks S 0 , . . . , S t − 1 of size O ( n δ/ 2 ) . Construct separate retrieval data structure for each of the chunks: construction time t · O (( n δ/ 2 ) 3 ) = O ( n 1+ δ ) . Extra space: o ( n ) . Retrieval for y : access retrieval data structure for S h split ( y ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 15

  27. Approximate membership Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  28. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  29. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  30. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Can show ([Carter et al. 1978]): Need ≥ nr − O (1) bits. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  31. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Can show ([Carter et al. 1978]): Need ≥ nr − O (1) bits. Or (folklore): Use (minimal) perfect hashing to store an r -bit fingerprint for x ∈ S . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  32. Approximate membership x ∈ S → Answer “yes” ∈ S → Pr ( Answer “no” ) ≥ 1 − 2 − r x / Standard implementation: Bloom filters . Space ≈ nr/ ln 2 . Can show ([Carter et al. 1978]): Need ≥ nr − O (1) bits. Or (folklore): Use (minimal) perfect hashing to store an r -bit fingerprint for x ∈ S . Space: nr + O ( n ) bits. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 16

  33. Approximate membership General construction: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  34. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  35. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  36. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  37. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  38. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Performance: Error probability ≤ 2 − r . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  39. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Performance: Error probability ≤ 2 − r . New: Space (1 + e − k ) nr bits, evaluation time O ( k ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  40. Approximate membership General construction: Assume any algorithm for retrieval structure for range R = { 0 , 1 } r plus one fully random hash function q : U → R (fingerprint). Given S , build retrieval structure D retr for f = q | S . On query y : Retrieve s = D retr ( y ) ; answer “yes” if q ( y ) = s , “no” otherwise. Performance: Error probability ≤ 2 − r . New: Space (1 + e − k ) nr bits, evaluation time O ( k ) . Construction time O ( n 3 ) resp. O ( n 1+ δ ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 17

  41. Retrieval: Construction in linear time Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

  42. Retrieval: Construction in linear time Theoretical construction, works for very large n . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

  43. Retrieval: Construction in linear time Theoretical construction, works for very large n . Extra level of hashing: h split : U → [ t ] . S U h split t=n/b S S S 0 1 2 T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 18

  44. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  45. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  46. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  47. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  48. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  49. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  50. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Total construction time: O ( n ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  51. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Total construction time: O ( n ) . Total space (1 + γ ) nr + o ( n ) bits. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  52. Construction in linear time √ log n . Chunk size: b = 1 2 Number of chunks: t = n/b . h split splits S into chunks S 0 , . . . , S t − 1 of expected size b . Construct separate retrieval data structure for each chunk: Allocate space b ′ = (1 + γ ) b for each chunk, 1 + γ > β − 1 k . Construction time O ( b ) ! Why? Can arrange that matrix size ((1 + γ ) b ) 2 is < 1 2 log n . Use table-lookup for the linear algebra. Total construction time: O ( n ) . Total space (1 + γ ) nr + o ( n ) bits. Done? Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 19

  53. Construction in linear time No! Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

  54. Construction in linear time No! “Bad chunks”: (1) Overflow (2) Construction fails. S U h split t=n/b S S S 0 1 2 T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

  55. Construction in linear time No! “Bad chunks”: (1) Overflow (2) Construction fails. S U h split t=n/b S S S 0 1 2 bad T: Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 20

  56. Construction in linear time Nice: # (keys in bad chunk) = o ( n ) . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

  57. Construction in linear time Nice: # (keys in bad chunk) = o ( n ) . The (keys from) bad chunks are accommodated in a secondary structure with table T ′ [0 ..o ( n )] , hash functions h ′ 1 , . . . , h ′ 3 , with o ( n ) construction time (e.g. [Chazelle et al . 2004]). Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

  58. Construction in linear time Nice: # (keys in bad chunk) = o ( n ) . The (keys from) bad chunks are accommodated in a secondary structure with table T ′ [0 ..o ( n )] , hash functions h ′ 1 , . . . , h ′ 3 , with o ( n ) construction time (e.g. [Chazelle et al . 2004]). Flag: v [ i ] = 0 if chunk i is bad and v [ i ] = 1 otherwise. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 21

  59. Construction in linear time Segment for chunk S i in T : T [ d i . . d i +1 − 1] . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

  60. Construction in linear time Segment for chunk S i in T : T [ d i . . d i +1 − 1] . Lookup operation: i ← h split ( x ) , then . . . � � T ′ [ h ′ f ( x ) = v [ i ] · T [ h ℓ ( x ) + d i ] ⊕ v [ i ] · ℓ ( x )] 1 ≤ ℓ ≤ k 1 ≤ ℓ ≤ 3 Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

  61. Construction in linear time Segment for chunk S i in T : T [ d i . . d i +1 − 1] . Lookup operation: i ← h split ( x ) , then . . . � � T ′ [ h ′ f ( x ) = v [ i ] · T [ h ℓ ( x ) + d i ] ⊕ v [ i ] · ℓ ( x )] 1 ≤ ℓ ≤ k 1 ≤ ℓ ≤ 3 Time O ( k ) . Nonadaptive reads! Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 22

  62. Almost optimal space, logarithmic evaluation time Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

  63. Almost optimal space, logarithmic evaluation time Use random sets A ( x ) = { h 1 ( x ) , . . . , h k ( x ) ( x ) } ⊆ [ n ] , with E ( k ( x )) = Θ(log n ) , binomially distributed. Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

  64. Almost optimal space, logarithmic evaluation time Use random sets A ( x ) = { h 1 ( x ) , . . . , h k ( x ) ( x ) } ⊆ [ n ] , with E ( k ( x )) = Θ(log n ) , binomially distributed. [Cooper, 1999] ⇒ Pr ( M A is regular ) ≥ 0 . 28 . Dietzfelbinger + Pagh Dagstuhl, Feb. 18, 2008 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend