Compressed Indexes for Fast Search of Semantic Data Ra ff aele - - PowerPoint PPT Presentation

compressed indexes for fast search of semantic data
SMART_READER_LITE
LIVE PREVIEW

Compressed Indexes for Fast Search of Semantic Data Ra ff aele - - PowerPoint PPT Presentation

Compressed Indexes for Fast Search of Semantic Data Ra ff aele Perego Giulio Ermanno Pibiri Rossano Venturini ISTI-CNR ISTI-CNR The University of Pisa Pisa, Italy Pisa, Italy Pisa, Italy The 10-th Italian Information Retrieval


slide-1
SLIDE 1

Compressed Indexes for Fast Search


  • f Semantic Data

Raffaele Perego

ISTI-CNR
 Pisa, Italy

Giulio Ermanno Pibiri

ISTI-CNR
 Pisa, Italy

Rossano Venturini

The University of Pisa
 Pisa, Italy

The 10-th Italian Information Retrieval Workshop (IIR 2019) 17/09/2019

slide-2
SLIDE 2

Resource Description Framework (RDF)

“RDF is a standard model for data interchange on the Web.” Source: https://www.w3.org/RDF Statements are encoded with triples:
 Subject (S) - Predicate (P) - Object (O)

slide-3
SLIDE 3

Resource Description Framework (RDF)

“RDF is a standard model for data interchange on the Web.” Source: https://www.w3.org/RDF Statements are encoded with triples:
 Subject (S) - Predicate (P) - Object (O)

<http://example.name#BobSmith12> <http://xmlns.com/foaf/0.1/knows> <http://example.name#JohnDoe34>

“Bob Smith knows John Doe.”

slide-4
SLIDE 4

The problem

Huge datasets: billions of triples. Storage space is an issue:
 compression is mandatory.

How to support triple selection patterns (with wildcards) efficiently?

slide-5
SLIDE 5

The problem

Huge datasets: billions of triples. Storage space is an issue:
 compression is mandatory.

How to support triple selection patterns (with wildcards) efficiently?

<Bob Smith> <knows> <???> <???> <???> John Doe <Bob Smith> <???> <Sara Parker>

slide-6
SLIDE 6

The problem

Huge datasets: billions of triples. Storage space is an issue:
 compression is mandatory.

How to support triple selection patterns (with wildcards) efficiently?

<Bob Smith> <knows> <???> <???> <???> John Doe <Bob Smith> <???> <Sara Parker> 1 wildcard:
 SP?
 S?O
 ?PO 2 wildcards:
 S??
 ?P?
 ??O 3 wildcards:
 ??? 0 wildcard:
 SPO

slide-7
SLIDE 7

State-of-the-art solutions

Too costly in terms of space.

  • Materialize all possible S-P-O permutations (6 separate indexes).

  • Do not use sophisticated compression techniques.

  • Expensive additional indexes to support retrieval.
slide-8
SLIDE 8

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:
 we deal with datasets of integer triples.

S P O
 S P ?
 S ? ?
 ? ? ?
 
 ? P O
 ? P ? S ? O
 ? ? O
 Selection patterns

slide-9
SLIDE 9

S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:
 we deal with datasets of integer triples.

S P O
 S P ?
 S ? ?
 ? ? ?
 
 ? P O
 ? P ? S ? O
 ? ? O
 Selection patterns

slide-10
SLIDE 10

P-O-S order S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:
 we deal with datasets of integer triples.

S P O
 S P ?
 S ? ?
 ? ? ?
 
 ? P O
 ? P ? S ? O
 ? ? O
 Selection patterns

slide-11
SLIDE 11

O-S-P order P-O-S order S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:
 we deal with datasets of integer triples.

S P O
 S P ?
 S ? ?
 ? ? ?
 
 ? P O
 ? P ? S ? O
 ? ? O
 Selection patterns

slide-12
SLIDE 12

O-S-P order P-O-S order S-P-O order

The Permuted Trie Index: preliminaries

Map URI strings to integers to reduce space requirements:
 we deal with datasets of integer triples.

S P O
 S P ?
 S ? ?
 ? ? ?
 
 ? P O
 ? P ? S ? O
 ? ? O
 Selection patterns

Store an integer trie data structure for each permutation.

slide-13
SLIDE 13

The Permuted Trie Index: organisation

1 2 3 4

slide-14
SLIDE 14

The Permuted Trie Index: organisation

  • Common prefixes are encoded once.

  • Two integer sequences per level (nodes and pointers).

  • Symmetrically support all selection patterns with 1 and 2 wildcards.

  • Cache-friendly memory layout.

1 2 3 4

slide-15
SLIDE 15

The Permuted Trie Index: organisation

  • Common prefixes are encoded once.

  • Two integer sequences per level (nodes and pointers).

  • Symmetrically support all selection patterns with 1 and 2 wildcards.

  • Cache-friendly memory layout.

Allows effective
 compression

1 2 3 4

slide-16
SLIDE 16

The Permuted Trie Index: organisation

  • Common prefixes are encoded once.

  • Two integer sequences per level (nodes and pointers).

  • Symmetrically support all selection patterns with 1 and 2 wildcards.

  • Cache-friendly memory layout.

Allows effective
 compression Fast retrieval

1 2 3 4

slide-17
SLIDE 17

The Permuted Trie Index: organisation

  • Common prefixes are encoded once.

  • Two integer sequences per level (nodes and pointers).

  • Symmetrically support all selection patterns with 1 and 2 wildcards.

  • Cache-friendly memory layout.

Allows effective
 compression Fast retrieval (1, 2, ?)

1 2 3 4

slide-18
SLIDE 18

The Permuted Trie Index: organisation

  • Common prefixes are encoded once.

  • Two integer sequences per level (nodes and pointers).

  • Symmetrically support all selection patterns with 1 and 2 wildcards.

  • Cache-friendly memory layout.

Allows effective
 compression Fast retrieval (1, 2, ?)

1 2 3 4

slide-19
SLIDE 19

The Permuted Trie Index: organisation

  • Common prefixes are encoded once.

  • Two integer sequences per level (nodes and pointers).

  • Symmetrically support all selection patterns with 1 and 2 wildcards.

  • Cache-friendly memory layout.

Allows effective
 compression Fast retrieval (1, 2, ?)

1 2 3 4

(1, 2, 0) (1, 2, 1)

slide-20
SLIDE 20

Permutation Elimination Cross Compression

The Permuted Trie Index: refinements

1 2

slide-21
SLIDE 21

Cross Compression

Fact: the same triple appears three times, but in different permutations.

slide-22
SLIDE 22

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2.

slide-23
SLIDE 23

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2. P Oi S1 Sn … Sj … Oi S1 Sn … Sj …

slide-24
SLIDE 24

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2. P Oi S1 Sn … Sj … Oi S1 Sn … Sj … p Represent Sj as its position p.

slide-25
SLIDE 25

Cross Compression

Fact: the same triple appears three times, but in different permutations.

We can represent the subjects in trie 1 by using the subjects in trie 2. P Oi S1 Sn … Sj … Oi S1 Sn … Sj … p Represent Sj as its position p. Why?

Number of children in Dbpedia.

slide-26
SLIDE 26

Permutation Elimination

Fact: predicates are few, thus S?O returns only few matches.

slide-27
SLIDE 27

Permutation Elimination

Fact: predicates are few, thus S?O returns only few matches.

We can pattern match S?O on the SPO trie, instead of the OSP trie.
 
 Given a (s,o) pair: for each child pi of s,
 check is o is a child of pi. If so, then (s,pi,o) is a match.

slide-28
SLIDE 28

Permutation Elimination

Fact: predicates are few, thus S?O returns only few matches.

We can pattern match S?O on the SPO trie, instead of the OSP trie.
 
 Given a (s,o) pair: for each child pi of s,
 check is o is a child of pi. If so, then (s,pi,o) is a match.

Less than 6 checks are needed on average!

Number of children in Dbpedia.

slide-29
SLIDE 29

Permutation Elimination

S P O
 S P ?
 S ? ? S ? O
 ? ? ? SPO trie OR

+

slide-30
SLIDE 30

Permutation Elimination

? P O
 ? ? O ? P ? Object-based retrieval OPS trie S P O
 S P ?
 S ? ? S ? O
 ? ? ? SPO trie OR

+

slide-31
SLIDE 31

Permutation Elimination

? P O
 ? ? O ? P ? Object-based retrieval OPS trie ? P O
 ? ? O ? P ? Predicate-based retrieval POS trie S P O
 S P ?
 S ? ? S ? O
 ? ? ? SPO trie OR

+

slide-32
SLIDE 32

Permutation Elimination

? P O
 ? ? O ? P ? Object-based retrieval OPS trie ? P O
 ? ? O ? P ? Predicate-based retrieval POS trie S P O
 S P ?
 S ? ? S ? O
 ? ? ? SPO trie OR We can eliminate a permutation, thus saving 1/3 of the space of the index.

+

slide-33
SLIDE 33

Experiments: setting

Compiler gcc 7.2.0 (with all optimizations) Machine i7-7700 CPU (@3.6 GHz), 64 GB of RAM DDR3 (@2.133 GHz)
 Linux 4.4.0, 64 bits Datasets

slide-34
SLIDE 34

Experiments: C++ code

C++ code at https://github.com/jermp/rdf_indexes

slide-35
SLIDE 35

Experiments: our solutions

Overall, 2Tp offers the best space/time tradeoff.

slide-36
SLIDE 36

Our selected trade-off configuration substantially outperforms the tested
 competitors in both space and time.

Experiments: overall comparison

slide-37
SLIDE 37

Conclusions

The triple indexing problem with pattern matching can be solved efficiently in both time and space regards. Our solution — the permuted trie index — achieves substantial performance improvement against the best previous solutions.

Paper available at https://arxiv.org/abs/1904.07619 C++ code available at https://github.com/jermp/rdf_indexes

Cross-compression Permutation-elimination

slide-38
SLIDE 38

Any questions?

Thanks for your attention, time, patience!