Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio - - PowerPoint PPT Presentation

dictionary compression
SMART_READER_LITE
LIVE PREVIEW

Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio - - PowerPoint PPT Presentation

Image: ALCZAR (S EGOVIA , SPAIN ) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017


slide-1
SLIDE 1

Dictionary Compression

Reducing Symbolic Redundancy in RDF

Antonio Fariña, Javier D. Fernández and

Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training School Keyword search in Big Linked Data

Image: ALCÁZAR (SEGOVIA, SPAIN)

slide-2
SLIDE 2
  • Introduction
  • What is Dictionary Compression?
  • Compressed String Dictionaries
  • Some Experimental Numbers
  • RDF Dictionaries
  • Foundations
  • RDF Dictionary-based Compression
  • Dictionaries in Practice
  • Conclusions

PAGE 2

Agenda

images: zurb.com

slide-3
SLIDE 3
  • What is Dictionary Compression?
  • Compressed String Dictionaries

Introduction

Dictionary Compression

slide-4
SLIDE 4

Dictionary compression is a simple but effective technique which replaces the

  • ccurrences of terms by identifiers which are

more compact to encode and easier and more efficient to handle.

What is Dictionary Compression?

DICTIONARY COMPRESSION PAGE 4

slide-5
SLIDE 5

Dictionary compression is a simple but effective technique which replaces the

  • ccurrences of (long, variable-length) terms by (short) identifiers which are

more compact to encode and easier and more efficient to handle.

  • Implementing this class of compression requires an efficient data

structure configuration (dictionary) which provides, at least, two basic mapping operations:

  • locate(t) returns i if the term t is the i-th element in the dictionary.
  • extract(i) returns the i-th term (t) in the dictionary.
  • The dictionary organizes all different terms (vocabulary) in the dataset.
  • Dictionary compression has been traditionally applied for natural language

processing purposes (e.g. information retrieval).

Dictionary Compression

DICTIONARY COMPRESSION PAGE 5

Dictionary Compression

slide-6
SLIDE 6

DICTIONARY COMPRESSION PAGE 6

Dictionary Compression

… la tarara sí la tarara no la tarara niña que la he visto yo …

ID String 1 he 2 la 3 niña 4 no 5 que 6 sí 7 tarara 8 visto 9 yo

data structure

slide-7
SLIDE 7

DICTIONARY COMPRESSION PAGE 7

Dictionary Compression

ID String 1 he 2 la 3 niña 4 no 5 que 6 sí 7 tarara 8 visto 9 yo

… 2 7 6 2 7 4 2 7 3 5 2 1 8 9 … data structure

slide-8
SLIDE 8

DICTIONARY COMPRESSION PAGE 8

Dictionary Compression

… la tarara sí la tarara no la tarara niña que la he visto yo … The original text takes 59 bytes

59 chars * 1 byte/char

slide-9
SLIDE 9

DICTIONARY COMPRESSION PAGE 9

Dictionary Compression

… 2 7 6 2 7 4 2 7 3 5 2 1 8 9 … The original text takes 59 bytes

59 chars * 1 byte/char

+ the cost of serializing the data structure. The dictionary compressed text takes 7 bytes

14 IDs * log2(9) bits/ID

slide-10
SLIDE 10
  • Dictionary Compression is used for optimizing applications of…
  • Natural Language Processing (e.g. Information Retrieval or

Machine Translation)

  • Web Graph Management.
  • Triplestores (e.g. RDF3X) and other semantic tools (e.g.

HDT)

  • NoSQL databases.
  • Bioinformatics search engines.
  • Internet Routing.
  • Geographic Information Systems.
  • ….

DICTIONARY COMPRESSION PAGE 10

Dictionary Compression

slide-11
SLIDE 11
  • Dictionaries have been traditionally implemented using

well-known data structures:

  • Hash tables or tries for resolving locate queries.
  • Arrays for resolving extract queries.
  • These solutions are efficient, but require high amounts
  • f memory for using them in practical scenarios.

DICTIONARY COMPRESSION PAGE 11

Data Structures

slide-12
SLIDE 12
  • Data sets are increasingly bigger and more varied:
  • Vocabularies are also larger and comprise more heterogeneous terms.
  • The dictionary size is a bottleneck for applications running under

restrictions of main memory.

  • The resulting dictionary data structure is very large and do not scale for

efficient in-memory management:

  • Dictionary management is becoming a scalability issue by itself and it

must be optimized for Big Data scenarios.

  • Preconditions:
  • Dictionaries are static (they are rebuilt from the scratch when the vocabulary is

changed).

  • Dictionaries are cached in main memory.

DICTIONARY COMPRESSION PAGE 12

The Problem…

slide-13
SLIDE 13

Compressed String Dictionaries are a particular class of compacta data structure which is optimize for dealing with string vocabularies from different domains.

Compressed String Dictionaries

DICTIONARY COMPRESSION PAGE 13

slide-14
SLIDE 14
  • Innovative compressed string dictionaries are proposed for managing big

vocabularies in main memory:

  • Traditional dictionaries are revisited for optimizing their memory footprint.
  • Existing compact data structures are tuned to perform as dictionaries.
  • New compact data structures has been designed as compressed string

dictionaries.

  • All these techniques ensure efficient in-memory query resolution:
  • locate and extract are resolved at microsecond level.
  • New interesting queries are also supported by these techniques:
  • Prefix-based queries retrieve IDs / terms matching a given prefix.
  • Substring-based queries retrieve IDs / terms matching a given substring.

DICTIONARY COMPRESSION PAGE 14

The Solutions…

slide-15
SLIDE 15

locate(“tarara”) extract(2) locatePrefix(“n”) extractPrefix(“n”) locateSubstring(“a”) extractSubstring(“a”)

DICTIONARY COMPRESSION PAGE 15

Queries

ID String 1 he 2 la 3 niña 4 no 5 que 6 sí 7 tarara 8 visto 9 yo

= 7 = “la” = {3,4} = {“niña”,”no”} = {2,3,7} = {“la”,”niña”,”tarara”}

slide-16
SLIDE 16
  • Compressed Hash:
  • The hash table is simulated using bitmaps.
  • Strings are stored in compressed form (Huffman/Re-Pair).
  • locate / extract operations are implemented using rank / select.
  • Differential Front-Coding Compression:
  • Front-Coding exploits that consecutive strings (in the vocabulary) are likely to share

a common prefix.

  • Plain Front-Coding dictionaries use byte-oriented compression.
  • Compressed Front-Coding dictionaries combines HuTucker and Huffman/Re-Pair

compression.

  • Primitive and prefix-based operations are implemented using binary search and

efficient sequential decoding.

  • Self-Indexes:
  • The FM-Index is adapted to perform as dictionary and the XBW introduce a self-

indexed trie.

  • All operations are implemented exploiting the BWT features.

DICTIONARY COMPRESSION PAGE 16

Techniques for Compressing Dictionaries

slide-17
SLIDE 17

DICTIONARY COMPRESSION PAGE 17

More Details…

slide-18
SLIDE 18

Compressed String Dictionaries answer queries at the level of microseconds, while compressing vocabularies up to 20 times.

Some Experimental Numbers

DICTIONARY COMPRESSION PAGE 18

slide-19
SLIDE 19
  • We analyze compression effectiveness and retrieval speed:
  • locate, extract.
  • Prefix-based operations (URIs)
  • Substring-based operations (Literals).
  • In practice, extract is the most important query:
  • It is used many times as results are retrieved from the compressed dataset.
  • 26,948,638 URIs from Uniprot:
  • Averaged length: 51.04 chars per URI.
  • Highly-repetitive.
  • 27,592,013 Literals from DBpedia:
  • Averaged length: 60.45 chars per Literal.

DICTIONARY COMPRESSION PAGE 19

Experimental Setup

slide-20
SLIDE 20

DICTIONARY COMPRESSION PAGE 20

Locate / Extract Performance (URIs)

PFC is the faster choice for locate/extract…

  • locate ≈ 1.6 μs/string.
  • extract ≈ 0.3-0.6 μs/ID.
  • ..but requires more space:
  • ≈ 9 − 19 % of the original space.

HTFC (compressed Front-Coding) reports the most balanced space/time tradeoffs:

  • locate ≈ 2.2-3 μs/string .
  • extract ≈ 0.7-1.6 μs/ID.
  • ≈ 5 − 13 % of the original space.
slide-21
SLIDE 21

DICTIONARY COMPRESSION PAGE 21

Locate / Extract Performance (Literals)

HTFC reports the best compression ratios, but its performance is less competitive:

  • locate ≈ 2-2.5 μs/string .
  • extract > 2.5 μs/ID.
  • ≈ 12 % of the original space.

HashDAC-rp (compressed Hashing) reports the best tradeoffs:

  • locate ≈ 1.5 μs/string .
  • extract ≈ 1μs/ID.
  • ≈ 15 % of the original space.
slide-22
SLIDE 22

DICTIONARY COMPRESSION PAGE 22

Domain Entity Retrieval (URIs)

PFC is the best choice for prefix-based operations:

  • Although it uses more space than the other approaches.
slide-23
SLIDE 23

DICTIONARY COMPRESSION PAGE 23

Full-Text Search (Literals)

Self-index based dictionaries are the only ones providing fullt-text search:

  • FMI is the fastest solution (≈

1μs/result) when uses more space than the original vocabulary.

  • XBW is the better choice for

this scenario:

  • ≈ 5-6 μs/result.
  • ≈ 40% of the original space.
slide-24
SLIDE 24
  • Foundations
  • RDF Dictionary-based Compression
  • Dictionaries in Practice

RDF Dictionaries

Dictionary Compression

slide-25
SLIDE 25

RDF Dictionaries are a core component of any compression or indexing approach desgined for semantic datasets.

Foundations

DICTIONARY COMPRESSION PAGE 25

slide-26
SLIDE 26
  • An RDF dictionary comprises all different terms used in the dataset:
  • Terms are drawn from 3 disjoint vocabularies: URIs, Literals, and blank nodes.
  • URIs are medium-size strings which share long prefixes:

http://example.org/property/age http://example.org/property/location http://example.org/person/abe-simpson http://example.org/person/bart-simpson

  • Literals tends to be large-size strings (with no predictable features), or

numbers, or dates…:

“742 Evergreen Terrace” “Bart Simpson” “Homer Simpson” 10

  • Blank node serialization is not standardized:
  • “Auto-incremental” strings are usually used → similar features than URIs.

DICTIONARY COMPRESSION PAGE 26

Basics

slide-27
SLIDE 27
  • Primitive Operations are exhaustively used:
  • locate operations are common when the dictionary is used for lookup

purposes (e.g. RDF stores, semantic search engines, etc.).

  • extract operations are common when the dictionary is used for data access

purposes (e.g. decompression, result retrieval, etc.).

  • Prefix-based operations are most relevant for URIs:
  • Finding all URIs in a given domain: e.g. retrieve all URIs from

http://example.org/person/.

  • Substring-based operations are an open challenge for Literals:
  • REGEX SPARQL queries: e.g. look for all literals containing the substring

“Simpson”.

DICTIONARY COMPRESSION PAGE 28

Dictionary Queries

slide-28
SLIDE 28
  • URIs and Literals should be compressed and managed independently…
  • Their structure is very different and they are queried in a different way.
  • …but they should be also organized to according to their role in the dataset:
  • Literals always play an object role.
  • URIs can be used as subject, predicate, and/or object.

DICTIONARY COMPRESSION PAGE 29

Decisions

slide-29
SLIDE 29

RDF Dictionary-based compression handles some dictionaries to optimize URIs and Literals compression.

RDF Dictionary-based Compression

DICTIONARY COMPRESSION PAGE 30

slide-30
SLIDE 30
  • A role-based partition is first performed:
  • Subjects are encoded in the range [1,|S|].
  • Predicates are encoded in the range [1,|P|].
  • Objects are encoded in the range [1,|O|].
  • URIs playing as subject and object are

encoded once:

  • IDs in [1,|SO|] encode terms playing as

subjects and objects.

  • Subjects are encoded in [|SO+1|,|S|].
  • Objects are encoded using two dictionaries:
  • [|SO+1|,|Ox|] encode URIs which only

performs as objects.

  • [|Ox +1|,|O|] encode Literals.
  • Predicates are encoded in [1,|P|].

DICTIONARY COMPRESSION PAGE 31

Dictionary Organization

slide-31
SLIDE 31

DICTIONARY COMPRESSION PAGE 32

RDF Dictionaries in Practice

person:homer-simpson person:abe-simpson "Homer Simpson" property:name "742 Evergreen Terrace" property:address property:father person:marge-simpson property:address "Marge Simpson" property:name location:springfield property:location property:location person:bart-simpson "Springfield" property:mother property:father property:name "Bart Simpson" 10 property:name property:age 83 "Bart Simpson" property:age property:name

slide-32
SLIDE 32

<http://example.org/location/springfield> <http://example.org/property/name> "Springfield" . <http://example.org/person/abe-simpson> <http://example.org/property/age> 83 . <http://example.org/person/abe-simpson> <http://example.org/property/name> "Abe Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/age> 10 . <http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/father> <http://example.org/person/homer-simpson> . <http://example.org/person/bart-simpson> <http://example.org/property/mother> <http://example.org/person/marge-simpson> . <http://example.org/person/homer-simpson> <http://example.org/property/address> "742 Evergreen Terrace" . <http://example.org/person/homer-simpson> <http://example.org/property/name> "Homer Simpson" . <http://example.org/person/homer-simpson> <http://example.org/property/location> <http://example.org/location/springfield> . <http://example.org/person/homer-simpson> <http://example.org/property/father> <http://example.org/person/abe-simpson> . <http://example.org/person/marge-simpson> <http://example.org/property/address> "742 Evergreen Terrace" . <http://example.org/person/marge-simpson> <http://example.org/property/name> "Marge Simpson" . <http://example.org/person/marge-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .

DICTIONARY COMPRESSION PAGE 33

Looking for Subject-Object (SO) terms…

slide-33
SLIDE 33

1

<http://example.org/property/name> "Springfield" .

2

<http://example.org/property/age> 83 .

2

<http://example.org/property/name> "Abe Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/age> 10 . <http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/father> 3 . <http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .

3

<http://example.org/property/address> "742 Evergreen Terrace" .

3

<http://example.org/property/name> "Homer Simpson" .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> "742 Evergreen Terrace" .

4

<http://example.org/property/name> "Marge Simpson" .

4

<http://example.org/property/location> 1 . <http://example.org/location/springfield> <http://example.org/property/name> "Springfield" . <http://example.org/person/abe-simpson> <http://example.org/property/age> 83 . <http://example.org/person/abe-simpson> <http://example.org/property/name> "Abe Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/age> 10 . <http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/father> <http://example.org/person/homer-simpson> . <http://example.org/person/bart-simpson> <http://example.org/property/mother> <http://example.org/person/marge-simpson> . <http://example.org/person/homer-simpson> <http://example.org/property/address> "742 Evergreen Terrace" . <http://example.org/person/homer-simpson> <http://example.org/property/name> "Homer Simpson" . <http://example.org/person/homer-simpson> <http://example.org/property/location> <http://example.org/location/springfield> . <http://example.org/person/homer-simpson> <http://example.org/property/father> <http://example.org/person/abe-simpson> . <http://example.org/person/marge-simpson> <http://example.org/property/address> "742 Evergreen Terrace" . <http://example.org/person/marge-simpson> <http://example.org/property/name> "Marge Simpson" . <http://example.org/person/marge-simpson> <http://example.org/property/location> <http://example.org/location/springfield> .

DICTIONARY COMPRESSION PAGE 34

Building SO Dictionary & Compressing terms

DICTIONARY COMPRESSION

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

slide-34
SLIDE 34

1

<http://example.org/property/name> "Springfield" .

2

<http://example.org/property/age> 83 .

2

<http://example.org/property/name> "Abe Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/age> 10 . <http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/father> 3 . <http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .

3

<http://example.org/property/address> "742 Evergreen Terrace" .

3

<http://example.org/property/name> "Homer Simpson" .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> "742 Evergreen Terrace" .

4

<http://example.org/property/name> "Marge Simpson" .

4

<http://example.org/property/location> 1 .

DICTIONARY COMPRESSION

Looking for Subject (S) terms…

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

slide-35
SLIDE 35

1

<http://example.org/property/name> "Springfield" .

2

<http://example.org/property/age> 83 .

2

<http://example.org/property/name> "Abe Simpson" .

5

<http://example.org/property/age> 10 .

5

<http://example.org/property/name> "Bart Simpson" .

5

<http://example.org/property/father> 3 .

5

<http://example.org/property/mother> 4 .

3

<http://example.org/property/address> "742 Evergreen Terrace" .

3

<http://example.org/property/name> "Homer Simpson" .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> "742 Evergreen Terrace" .

4

<http://example.org/property/name> "Marge Simpson" .

4

<http://example.org/property/location> 1 .

1

<http://example.org/property/name> "Springfield" .

2

<http://example.org/property/age> 83 .

2

<http://example.org/property/name> "Abe Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/age> 10 . <http://example.org/person/bart-simpson> <http://example.org/property/name> "Bart Simpson" . <http://example.org/person/bart-simpson> <http://example.org/property/father> 3 . <http://example.org/person/bart-simpson> <http://example.org/property/mother> 4 .

3

<http://example.org/property/address> "742 Evergreen Terrace" .

3

<http://example.org/property/name> "Homer Simpson" .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> "742 Evergreen Terrace" .

4

<http://example.org/property/name> "Marge Simpson" .

4

<http://example.org/property/location> 1 .

DICTIONARY COMPRESSION PAGE 36

Building S Dictionary & Compressing terms

PAGE 36 PAGE 36

ID RDF Term 5 http://example.org/person/bart-simpson

S

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

slide-36
SLIDE 36

DICTIONARY COMPRESSION PAGE 37

Looking for Object (O) terms…

PAGE 37 PAGE 37

1

<http://example.org/property/name> "Springfield" .

2

<http://example.org/property/age> 83 .

2

<http://example.org/property/name> "Abe Simpson" .

5

<http://example.org/property/age> 10 .

5

<http://example.org/property/name> "Bart Simpson" .

5

<http://example.org/property/father> 3 .

5

<http://example.org/property/mother> 4 .

3

<http://example.org/property/address> "742 Evergreen Terrace" .

3

<http://example.org/property/name> "Homer Simpson" .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> "742 Evergreen Terrace" .

4

<http://example.org/property/name> "Marge Simpson" .

4

<http://example.org/property/location> 1 . ID RDF Term 5 http://example.org/person/bart-simpson

S

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

slide-37
SLIDE 37

1

<http://example.org/property/name> 10 .

2

<http://example.org/property/age> 12 .

2

<http://example.org/property/name> 6 .

5

<http://example.org/property/age> 11 .

5

<http://example.org/property/name> 7 .

5

<http://example.org/property/father> 3 .

5

<http://example.org/property/mother> 4 .

3

<http://example.org/property/address> 5 .

3

<http://example.org/property/name> 8 .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> 5 .

4

<http://example.org/property/name> 9 .

4

<http://example.org/property/location> 1 .

DICTIONARY COMPRESSION PAGE 38

Building O Dictionary & Compressing Terms

ID RDF Term 5 http://example.org/person/bart-simpson

S

PAGE 38 PAGE 38

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

1

<http://example.org/property/name> "Springfield" .

2

<http://example.org/property/age> 83 .

2

<http://example.org/property/name> "Abe Simpson" .

5

<http://example.org/property/age> 10 .

5

<http://example.org/property/name> "Bart Simpson" .

5

<http://example.org/property/father> 3 .

5

<http://example.org/property/mother> 4 .

3

<http://example.org/property/address> "742 Evergreen Terrace" .

3

<http://example.org/property/name> "Homer Simpson" .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> "742 Evergreen Terrace" .

4

<http://example.org/property/name> "Marge Simpson" .

4

<http://example.org/property/location> 1 . ID RDF Term 5 "742 Evergreen Terrace" 6 "Abe Simpson" 7 "Bart Simpson" 8 "Homer Simpson" 9 "Marge Simpson" 10 "Springfield" 11 10 12 83

O

slide-38
SLIDE 38

1

<http://example.org/property/name> 10 .

2

<http://example.org/property/age> 12 .

2

<http://example.org/property/name> 6 .

5

<http://example.org/property/age> 11 .

5

<http://example.org/property/name> 7 .

5

<http://example.org/property/father> 3 .

5

<http://example.org/property/mother> 4 .

3

<http://example.org/property/address> 5 .

3

<http://example.org/property/name> 8 .

3

<http://example.org/property/location> 1 .

3

<http://example.org/property/father> 2 .

4

<http://example.org/property/address> 5 .

4

<http://example.org/property/name> 9 .

4

<http://example.org/property/location> 1 .

DICTIONARY COMPRESSION PAGE 39

Looking for Predicate (P) terms…

ID RDF Term 5 http://example.org/person/bart-simpson

S

PAGE 39 PAGE 39

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

ID RDF Term 5 "742 Evergreen Terrace" 6 "Abe Simpson" 7 "Bart Simpson" 8 "Homer Simpson" 9 "Marge Simpson" 10 "Springfield" 11 10 12 83

O

slide-39
SLIDE 39

1

<http://example.org/property/name>

10 . 2

<http://example.org/property/age>

12 . 2

<http://example.org/property/name>

6 . 5

<http://example.org/property/age>

11 . 5

<http://example.org/property/name>

7 . 5

<http://example.org/property/father>

3 . 5

<http://example.org/property/mother>

4 . 3

<http://example.org/property/address>

5 . 3

<http://example.org/property/name>

8 . 3

<http://example.org/property/location>

1 . 3

<http://example.org/property/father>

2 . 4

<http://example.org/property/address>

5 . 4

<http://example.org/property/name>

9 . 4

<http://example.org/property/location>

1 .

DICTIONARY COMPRESSION PAGE 40

Building P Dictionary & Compressing terms

ID RDF Term 5 http://example.org/person/bart-simpson

S

PAGE 40 PAGE 40

ID RDF Term 1 http://example.org/location/springfield 2 http://example.org/person/abe-simpson 3 http://example.org/person/homer-simpson 4 http://example.org/person/marge-simpson

SO

ID RDF Term 5 "742 Evergreen Terrace" 6 "Abe Simpson" 7 "Bart Simpson" 8 "Homer Simpson" 9 "Marge Simpson" 10 "Springfield" 11 10 12 83

O

ID RDF Term 1 http://example.org/property/address 2 http://example.org/property/age 3 http://example.org/property/father 4 http://example.org/property/location 5 http://example.org/property/mother 6 http://example.org/property/name

P

1 6 10 . 2 2 12 . 2 6 6 . 5 2 11 . 5 6 7 . 5 3 3 . 5 5 4 . 3 1 5 . 3 6 8 . 3 4 1 . 3 3 2 . 4 1 5 . 4 6 9 . 4 4 1 .

slide-40
SLIDE 40

Dictionary Compression

ID RDF Term 1 http://example.org/property/address 2 http://example.org/property/age 3 http://example.org/property/father 4 http://example.org/property/location 5 http://example.org/property/mother 6 http://example.org/property/name

P

  • Dictionaries are now compressed.
  • Let’s see how the predicate dictionary is

compressed using Plain Front Coding.

  • 1. Terms are concatenated in lexicographic order:

http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$ht tp://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$

  • 2. Terms are then organized into buckets of b strings (e.g. b=3)

B1 = http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$ B2 = http://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$

  • 3. Each bucket is independently compressed:
  • The first term is preserved “as is”.
  • Each internal string is differentially encoded to its predecessor.

DICTIONARY COMPRESSION PAGE 33

slide-41
SLIDE 41

Dictionary Compression

  • 4. The compressed result is managed into a simple byte array (Tpfc):
  • Prefix-numbers are encoded using VByte and suffixes are encoded byte-to-byte

(ASCII).

  • 5. An additional integer array (ptrs) is used to store the position of the first

byte of each bucket.

Bucket 1 Bucket 2

http://example.org/property/address$ 29 ge$ 28 ather$ http://example.org/property/location$ 28 mother$ 28 name$ B1 = http://example.org/property/address$http://example.org/property/age$http://example.org/property/father$ B2 = http://example.org/property/location$http://example.org/property/mother$http://example.org/property/name$

47

slide-42
SLIDE 42

RDF Dictionaries are used for SPARQL resolution, but also allows other interesting queries to be efficiently resolved in the Linked Data workflow.

Dictionaries in Practice

DICTIONARY COMPRESSION PAGE 43

slide-43
SLIDE 43

DICTIONARY COMPRESSION PAGE 44

Normative SPARQL

PAGE 44 PAGE 44

1 6 10 2 2 12 2 6 6 5 2 11 5 6 7 5 3 3 5 5 4 3 1 5 3 6 8 3 4 1 3 3 2 4 1 5 4 6 9 4 4 1

Retrieve all people living in Springfield.

@prefix … SELECT ?Who WHERE { ?Who property:location location:springfield }

P.locate(http://example.org/property/location/) SO.locate(http://example.org/location/springfield/) Looking for (?Who 4 1) SO.extract(3)

SO.extract(4)

4 1 3 4

http://example.org/person/homer-simpson http://example.org/person/marge-simpson

slide-44
SLIDE 44

DICTIONARY COMPRESSION PAGE 45

Domain Entity Retrieval

PAGE 45 PAGE 45

1 6 10 2 2 12 2 6 6 5 2 11 5 6 7 5 3 3 5 5 4 3 1 5 3 6 8 3 4 1 3 3 2 4 1 5 4 6 9 4 4 1

Retrieve all people in our domain:

http://explample.org/people/

SO.extractPrefix(http://example.org/people/) S.extractPrefix(http://example.org/people/)

O.extractPrefix(http://example.org/people/)

http://example.org/person/abe-simpson http://example.org/person/homer-simpson http://example.org/person/marge-simpson http://example.org/person/bart-simpson

slide-45
SLIDE 45

DICTIONARY COMPRESSION PAGE 46

Full-Text Search

PAGE 46 PAGE 46

1 6 10 2 2 12 2 6 6 5 2 11 5 6 7 5 3 3 5 5 4 3 1 5 3 6 8 3 4 1 3 3 2 4 1 5 4 6 9 4 4 1

Retrieve all terms which include “Simpson”:

O.extractSubstring(”Simpson”)

“Abe Simpson” “Bart Simpson” “Homer Simpson” “Marge Simpson”

slide-46
SLIDE 46

Conclusions

Dictionary Compression

slide-47
SLIDE 47
  • RDF dictionaries are highly compressible:
  • URIs are very redundant and Literals also show non-negligible symbolic

redundancy.

  • This redundancy can be detected and removed within specific data

structures for dictionaries:

  • Structures for URIs use up to 20 times less space than the original

dictionaries.

  • For Literals, the corresponding structures use 6 − 8 times less space than

the original dictionaries.

  • All these structures report data retrieval performance at microsecond

level:

  • This functionality includes both simple and advanced operations.

DICTIONARY COMPRESSION PAGE 48

Conclusions

slide-48
SLIDE 48

DICTIONARY COMPRESSION PAGE 49

Conclusions

  • Compressed string dictionaries are available in the libCSD C++ library

(beta):

  • We are working on a new release including more techniques and more

search functionality (e.g. top K).

https://github.com/migumar2/libCSD

slide-49
SLIDE 49

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 50

Bibliography

1. Julian Arz and Johannes Fischer. LZ-compressed string dictionaries. In Procedings of DCC, pages 322–331, 2014. 2. Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Compressed string dictionaries. In Proceedings of SEA, pages 136–147, 2011. 3. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd edition, 2001. 4. Paolo Ferragina and Giovanni Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. 5. Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007. 6. Roberto Grossi and Giuseppe Ottaviano. Fast Compressed Tries through Path Decompositions. In Proceedings of ALENEX, pages 65–74, 2012. 7. T.C. Hu and Alan C. Tucker. Optimal Computer-Search Trees and Variable-Length Alphabetic Codes. SIAM Journal

  • f Applied Mathematics, 21:514–532, 1971.

8. David A. Huffman. A method for the construction of minimum-redundancy codes. Proc. of the Institute of Radio Engineers, 40(9):1098–1101, 1952.

slide-50
SLIDE 50

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 51

Bibliography

9. Donald E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison Wesley, 1973.

  • 10. N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. Proceedings of the IEEE, 88:1722–

1732, 2000.

  • 11. Veli

Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3):article 32, 2008.

  • 12. Miguel A. Martınez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro. Practical

compressed string dictionaries. Information Systems, 56: 73-108, 2016.

  • 13. Miguel A. Mart ́ınez-Prieto, Javier D. Fernáandez, and Rodrigo Cánovas. Querying RDF Dictionaries in Compressed
  • Space. SIGAPP Applied Computing Review, 12(2):64–77, 2012.
  • 14. Hugh E. Williams and Justin Zobel. Compressing integers for fast file access. The Computer Journal, 42:193–201,

1999.

  • 15. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents

and Images. Morgan Kaufmann, 1999.

slide-51
SLIDE 51

Triples Compression

Let’s the lecture continues…

Image: ROYAL MINT & ALCÁZAR (SEGOVIA, SPAIN)