Triples compression and Indexing Antonio Faria, Javier D. Fernndez - - PowerPoint PPT Presentation

triples compression and
SMART_READER_LITE
LIVE PREVIEW

Triples compression and Indexing Antonio Faria, Javier D. Fernndez - - PowerPoint PPT Presentation

Triples compression and Indexing Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda RDF management overview K 2 -Tree data


slide-1
SLIDE 1

Triples compression and Indexing

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training School Keyword search in Big Linked Data

slide-2
SLIDE 2
  • RDF management overview
  • K2-Tree data structure
  • K2-Triples
  • Compressed Suffix Array (CSA)
  • RDF-CSA

PAGE 2

Agenda

images: zurb.com

slide-3
SLIDE 3

Recall we can set string from RDF into a dictionary and then handle a set of RDF-triples as a set of ID-based-triples

RDF magament overview

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 3

slide-4
SLIDE 4
  • f 51

4

RDF management overview

UK London M.Lalmas R.Raman A.Gionis inv-speaker Finland SPIRE held on capital of lives in lives in p

  • s

i t i

  • n

lives in a t t e n d s attends a t t e n d s works in (SPIRE, held on, London) (London, capital of, UK) (A.Gionis, attends, SPIRE) (R.Raman, attends, SPIRE) (M.Lalmas, attends, SPIRE) (M.Lalmas, lives in, UK) (M.Lalmas, works in, London) (A.Gionis, lives in, Finland) (R.Raman, lives in, UK) (R.Raman, position, inv-speaker) Original Triplets

London SPIRE A.Gionis M.Lalmas R.Raman Finland inv-speaker UK attends capital of 1 2 3 4 5 3 4 5 1 2 held on lives in position works in 3 4 5 6 SO S O P Dictionary Encoding

(2,3,1) (1,2,5) (3,1,2) (5,1,2) (4,1,2) (4,4,5) (4,6,1) (3,4,3) (5,4,5) (5,5,4) Id-based Triplets

slide-5
SLIDE 5
  • RDF management overview
  • K2-Tree data structure
  • K2-Triples
  • Compressed Suffix Array (CSA)
  • RDF-CSA

PAGE 5

Agenda

images: zurb.com

slide-6
SLIDE 6

A k2-tree permits a compact representation of an adjacency matrix.

K2-tree data structure

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 6

slide-7
SLIDE 7
  • f 51

7

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

K2-Tree Motivation

  • Structure for representing adjacency matrix
  • Originally designed for web graphs
  • Simple directed graph

1 2 3 7 4 5 6 8

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

9

10 11

slide-8
SLIDE 8
  • f 51

8

K2-Tree Construction process

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0100 0100 0011 0010 0010 10101000 0110 0010

Example with K=2

T = 101111010100100011001000000101011110 L = 010000110010001010101000011000100100

slide-9
SLIDE 9
  • f 51

9

K2-Tree Direct neighbor operation

1 1 1 1 1 1 1 1 1 1 1 1

1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0100 0100 0011 0010 0010 10101000 0110 0010

T = 101111010100100011001000000101011110 L = 010000110010001010101000011000100100

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

children(2) = rank1(T,2)* k2 = 2*4=8

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 35

children(9) = rank1(T,9)* k2 = 7*4=28

36 38 40 42 44 46 48 50 52 54 56 …

children(31) = rank1(T,31)* k2 = 14*4=56

slide-10
SLIDE 10
  • RDF management overview
  • K2-Tree data structure
  • K2-Triples
  • Compressed Suffix Array (CSA)
  • RDF-CSA

PAGE 10

Agenda

images: zurb.com

slide-11
SLIDE 11

k2-triples applies vertical partitioning of an RDF dataset by predicate. Then, |P| k2-trees permit to represent all the triples involving a given predicate.

K2-triples

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 11

slide-12
SLIDE 12
  • f 51

12

K2-Triples Data structure

RDF triples Mapped triples Dictionary

  • Dictionary encoding
  • Triples as a set of identifiers
slide-13
SLIDE 13
  • f 51

13

K2-Triples Data structure

  • Vertical partitioning (by predicates)
  • One K2-tree per predicate

1

(8,5,4) (4,2,3) (4,4,6) (4,1,7) (7,2,3) (3,3,5) (5,2,1) (1,3,5) (6,2,2) (2,3,5)

1 1 1 1 1 1 1

1

1

P1 P2 P3 P4 P5

(S,P ,O)

1 1 1 1 1

S4 O 7

slide-14
SLIDE 14
  • f 51

14

K2-Triples

  • perations: solving triple patterns

Query: (4,2,3)

1 1 1 1

P2

1 1 1 1

Result: (4,2,3)

  • SPO  checking a cell
  • SP?
  • ?PO
  • S?O
  • S??
  • ??O
  • ?P?
slide-15
SLIDE 15
  • f 51

15

K2-Triples

  • perations: solving triple patterns

Query: (4,2,?)

1 1 1 1

P2

1 1 1 1

Result: (4,2,3)

  • SPO  checking a cell
  • SP?  direct neighbours
  • ?PO
  • S?O
  • S??
  • ??O
  • ?P?
slide-16
SLIDE 16
  • f 51

16

K2-Triples

  • perations: solving triple patterns

Query: (?,2,3)

1 1 1 1

P2

1 1 1 1

Result: (4,2,3), (7,2,3)

  • SPO  checking a cell
  • SP?  direct neighbours
  • ?PO  reverse neighbours
  • S?O
  • S??
  • ??O
  • ?P?
slide-17
SLIDE 17
  • f 51

17

  • SPO  checking a cell
  • SP?  direct neighbours
  • ?PO  reverse neighbours
  • S?O  checking |P| cells
  • S??
  • ??O
  • ?P?

K2-Triples

  • perations: solving triple patterns

Query: (4,?,6) Result: (4,4,6)

1 1 1 1 1 1 1 1

1

1

P3 P4 P5

1 1 1 1 1

P1 P2

slide-18
SLIDE 18
  • f 51

18

K2-Triples

  • perations: solving triple patterns

Query: (4,?,?) Result: (4,1,7), (4,2,3), (4,4,6)

1 1 1 1 1 1 1 1

1

1

P3 P4 P5

1 1 1 1 1

P1 P2

  • SPO  checking a cell
  • SP?  direct neighbours
  • ?PO  reverse neighbours
  • S?O  checking |P| cells
  • S??  |P| direct neighbours
  • ??O
  • ?P?
slide-19
SLIDE 19
  • f 51

19

  • SPO  checking a cell
  • SP?  direct neighbours
  • ?PO  reverse neighbours
  • S?O  checking |P| cells
  • S??  |P| direct neighbours
  • ??O  |P| reverse neighbours
  • ?P?

K2-Triples

  • perations: solving triple patterns

Query: (?,?,4) Result: (8,5,4)

1 1 1 1 1 1 1 1

1

1

P3 P4 P5

1 1 1 1 1

P1 P2

slide-20
SLIDE 20
  • f 51

20

K2-Triples

  • perations: solving triple patterns

Query: (?,2,?)

1 1 1 1

P2

1 1 1 1

Result: (4,2,3), (5,2,1),(6,2,2),(7,2,3)

  • SPO  checking a cell
  • SP?  direct neighbours
  • ?PO  reverse neighbours
  • S?O  checking |P| cells
  • S??  |P| direct neighbours
  • ??O  |P| reverse neighbours
  • ?P?  full adjacency matrix

20 OF 51

slide-21
SLIDE 21
  • f 51

21

  • Weakness of vertical partitioning  unbounded predicates
  • (S,?,?), (?,?,O), (S,?,O)
  • Checking the |P| K2-trees!
  • They proposed indexes SP and OP

K2-Triples SP & OP indexes

(8,5,4) (4,2,3) (4,4,6) (4,1,7) (7,2,3) (3,3,5) (5,2,1) (1,3,5) (6,2,2) (2,3,5)

(S,P,O)

S Predicates 1 3 2 3 3 3 4 1,2,4 5 2 6 2 7 2 8 5

SP INDEX Statistically compressed Direct access with DAC

slide-22
SLIDE 22
  • f 51

22

K2-Triples SP & OP indexes

1

1 1 1 1 1 1 1

1

1

P1 P2 P3 P4 P5 SP INDEX

Subject 4? Predicate list: 1,2,4

  • Query (4,?,?)
slide-23
SLIDE 23
  • f 51

23

  • Independent join
  • Chain join
  • Interactive join

K2-Triples Joins

  • They implemented three join strategies
  • Taking advantage of the K2-triples structure

Query: (8,5,?X) (?X,2,?)

Best strategy depends on the dataset and the type of join

merge-join index-join

slide-24
SLIDE 24
  • f 51

24

  • Real datasets from different domains
  • Space results (Mbytes)

K2-Triples Experiments

Dataset Size(MB) #Triples #Predicates #Subjects #Objects Jamendo 144.18 1,049,639 28 335,926 440,604 DBLP 7.58 46,597,620 27 2,840,639 19,639,731 Geonames 12,347.70 112,235,492 26 8,147,136 41,111,569 Dbpedia 33,912.71 232,542,405 39,672 18,425,128 65,200,769 Dataset MonetDB RDF-3X Hexastore K2-triples K2-triples+ Jamendo 8.76 37.73 1,371.25 0.74 1.28 DBLP 358.44 1,643.31 82.48 99.24 Geonames 859.66 3,584.80 152.20 188.63 Dbpedia 1,811.74 9,757.58 931.44 1178.38

slide-25
SLIDE 25
  • f 51

25

K2-Triples Experiments

  • Triple patterns (DBPEDIA)
slide-26
SLIDE 26
  • f 51

26

K2-Triples Experiments

slide-27
SLIDE 27
  • RDF management overview
  • K2-Tree data structure
  • K2-Triples
  • Compressed Suffix Array (CSA)
  • RDF-CSA

PAGE 28

Agenda

images: zurb.com

slide-28
SLIDE 28

CSA is a suffix-array-based self-index. It reduces its huge space needs and still allows fast binary-search-type lookups

The compressed Suffix Array

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 29

slide-29
SLIDE 29
  • f 51

30

Compressed Suffix Array (CSA) Back to Suffix Arrays

  • Binary search for any pattern: “ab”

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

locations

Noccs = (4-3)+1 Occs = A[3] .. A[4] = {8, 1} Fast Space O(m lg n) O(4n) O(m lg n + noccs) + |T|

P = a b

slide-30
SLIDE 30
  • f 51

31

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$

P = a b

  • Can we reduce the space needs of a Suffix Array?

Compressed Suffix Array (CSA) CSA Basics

slide-31
SLIDE 31
  • f 51

32

  • Ψ
  • A[Ψ(i)] = A[i] +1

racadabra$ a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$

1 2 3 4 5 6 7 8 9 10 11 12

Ψ=

Compressed Suffix Array (CSA) CSA Basics

slide-32
SLIDE 32
  • f 51

33

  • Ψ
  • A[Ψ(i)] = A[i] +1

racadabra$ a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$

1 2 3 4 5 6 7 8 9 10 11 12

Ψ=

Compressed Suffix Array (CSA) CSA Basics

slide-33
SLIDE 33
  • f 51

34

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 7 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$

3

1 2 3 4 5 6 7 8 9 10 11 12

Ψ=

Compressed Suffix Array (CSA) CSA Basics

  • Ψ
  • A[Ψ(10)] = A[3] = A[10] +1 = 8

7

slide-34
SLIDE 34
  • f 51

35

  • Ψ and F
  • Ψ and F are enought to perform binary search

(P=“ada…”) and to recover the source data!!

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T = 12 11 8 1 4 6 9 2 5 10 3

1 2 3 4 5 6 7 8 9 10 11 12

A =

abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$

4 1 7 8 9 10 11 12 6 2 5

1 2 3 4 5 6 7 8 9 10 11 12

Ψ=

3 7 F =

Compressed Suffix Array (CSA) CSA Basics

slide-35
SLIDE 35
  • f 51

36

  • Ψ and F (reducing space needs)

$ a a a a a b b c d r r

1 2 3 4 5 6 7 8 9 10 11 12

F = 1 1 1 1 1 1 = D $ a b c d r

1 2 3 4 5 6

= S

Bitmap Sorted alphabet

Compressed Suffix Array (CSA) CSA Basics

slide-36
SLIDE 36
  • f 51

37

  • Ψ and F (reducing space needs)
  • Example: F[8] = S[rank1(D, 8)] = S[3] = ‘b’

Rank1(D,i):: Time O(1), by using o(n) extra space

$ a a a a a b b c d r r

1 2 3 4 5 6 7 8 9 10 11 12

F = 1 1 1 1 1 1 = D $ a b c d r

1 2 3 4 5 6

= S

Bitmap Sorted alphabet

rank1(D, 8)

Compressed Suffix Array (CSA) Representing F

slide-37
SLIDE 37
  • f 51

38

  • Absolute samples (k=sample period)
  • Gap encoding on increasing values: Huffman & run-encoding
  • Huffman with a N entries dictionary
  • k reserved Huffman codes to encode 1-runs of size s ϵ [1..k-1]
  • 32 + 32 Huffman codes representing the size (en bits) of large values [ + or - ]
  • They are followd by that number encoded with log (v) bits
  • The remaining N – k -32 -32 entries correspond to the most frequent gap values.

11 6 7 12 1 4 9 10 8 2 3 5

1 2 3 4 5 6 7 8 9 10 11 12

Ψ =

1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0

11 1 1 18 8 32

sΨ =

Δ =

Compressed Suffix Array (CSA) Compressing Ψ

slide-38
SLIDE 38
  • f 51

39

sampling sampling sampling + gap encoding +

  • delta codes*
  • Huffman-based
  • encoding de runs

– Ψ(sampled), D, S  bsearch (count)

1 1 1 1 1 1 = D 4 1 7 8 9 10 11 12 6 2 5

Ψ=

3 $ a b c d r = S

1 2 3 4 5 6 7 8 9 10 11 12

a b r a c a d a b r a $

1 2 3 4 5 6 7 8 9 10 11 12

T =

Parameters: space/time “trade-off”

Compressed Suffix Array (CSA) Complete structure

slide-39
SLIDE 39
  • RDF management overview
  • K2-Tree data structure
  • K2-Triples
  • Compressed Suffix Array (CSA)
  • RDF-CSA

PAGE 40

Agenda

images: zurb.com

slide-40
SLIDE 40

Let us reorganize an id-based RDF-dataset. Then, let us build a CSA on it (with a small modification of Y), et voilà… … we obtain an RDF self-index !!

RDF-CSA

BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 41

slide-41
SLIDE 41
  • f 51

42

  • Step 1  Integer dictionary encoding of s, p, o
  • Step 2  Ordered list of n triples (sequence of 3n elements)

We first sort by subject, then by predicate, and finally by object

RDF-CSA Construction process

slide-42
SLIDE 42
  • f 51

43

  • Step 3  Sid is transformed into S, in order to keep disjoint

alphabets

Range [1, ns] for subjects Range [ns+1, ns + np] for predicates Range [ns + np + 1, ns + np + no] for objects

Due to this alphabet mapping, every subject is smaller than every predicate, and this in turn is smaller than every object !!!

RDF-CSA Construction process

slide-43
SLIDE 43
  • f 51

44

  • Step 4  We build an iCSA on S
  • A has three ranges: each range points to suffixes starting with a subject, a predicate, or an
  • bject
  • Ψ cycles around the components of the same triple; that is, the object of a triple k does not

point to the subject of the triple k+1 in S, but to the subject of the same triple  we can start at position A[i], pointing to any place within a triple (s,p,o), and recover the triple by succesive applications of Ψ

Querying?  binary search + applying Ψ

RDF-CSA Construction process

slide-44
SLIDE 44
  • f 51

45

  • (S,P

,O), (?,P ,O), (S,?,O), (S,P ,?), (?,?,O), (S,?,?), (?,P ,?), (?,?,?)

  • Pattern (?,?,?) retrieves all the triples, so it can be solved by retrieving every ith

triple, using Ψ

  • For the rest of the patterns: binary iCSA search
  • SPO  bsearch(SPO,3)
  • ?PO  bsearch (PO,2) … S?O  bsearch (OS,2)
  • Optimizations:
  • D-select+forward-check strategy:
  • D-select+backward-check strategy:

Optimizations are applicable to pattern (S,P,O), and those with just one unbounded term!!

180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SPO 1 1 1 1 1 D

RDF-CSA Solving triple patterns

slide-45
SLIDE 45
  • f 51

46

  • (S,P,O) optimizations
  • D-select+forward-check strategy: find valid intervals into S, P and O ranges,

and check matches with Ψ into those intervals, starting from the shortest one.

  • D-select+backward-check strategy: use binary search to limit valid intervals,

instead of sequentially verifying each position of the shortest interval.

180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SP SPO 180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SPO PO

RDF-CSA Solving triple patterns

slide-46
SLIDE 46
  • f 51

47

RDF-CSA Experiments Dbpedia

(space % VS micros/occ)

slide-47
SLIDE 47
  • f 51

48

RDF-CSA Experiments Dbpedia

(space % VS micros/occ)

slide-48
SLIDE 48
  • f 51

49

  • RDFCSA / K2-Triples / MonetDb / RDF3X
  • SPACE
  • K2Triples: around 30-40% size of the uncompressed triplets
  • RDFCSA: around 60%.
  • Similar to MonetDB, 6 times less space than RDF3X
  • QUERYING SPARQL triple patterns
  • K2Triples: Very fast at SPO.
  • RDFCSA: Up to 3 orders of magnitude faster than MonetDB and RDF3X
  • Faster than K2Triples (in the most used SPARQL queries, it is up to 2 orders of

magnitude faster)

  • Strong and predictible response times (~1-10 µsec / recovered-triplet)

RDF-CSA Summary

slide-49
SLIDE 49
  • f 51

50

Bibliography

1. Antonio Fariña, Nieves R. Brisaboa, Gonzalo Navarro, Francisco Claude, Ángeles S. Places, Eduardo Rodríguez: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. (TOIS) 30(1):1 (2012) 2. Brisaboa, N. R.; Cerdeira-Pena, A.; Fariña, A.; Navarro, G.: "A Compact RDF Store using Suffix Arrays", en Proc.

  • f the 22th Int. Symp. on String Processing and Information Retrieval (SPIRE 2015), LNCS 9309, Springer,

London (Reino Unido), 2015 , pp. 103-115. 3. Cathrin Weiss, Panagiotis Karras, Abraham Bernstein: Hexastore: sextuple indexing for semantic web data

  • management. PVLDB 1(1): 1008-1019 (2008)

4. Kunihiko Sadakane: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2): 294- 313 (2003) 5. MonetDB (2013), http://www.monetdb.org 6. Nieves R. Brisaboa, Susana Ladra, Gonzalo Navarro: Compact representation of Web graphs with extended

  • functionality. Inf. Syst. 39: 152-174 (2014)

7. Sandra Álvarez-García, Nieves R. Brisaboa, Javier D. Fernández, Miguel A. Martínez-Prieto, Gonzalo Navarro: Compressed vertical partitioning for efficient RDF management. Knowl. Inf. Syst. 44(2): 439-474 (2015) 8. Thomas Neumann, Gerhard Weikum: RDF-Stores und RDF-Query-Engines. Datenbank-Spektrum 11(1): 63-66 (2011)

slide-50
SLIDE 50

Triples compression and Indexing

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training School Keyword search in Big Linked Data

(Thanks: slides partially by: Susana Ladra, Sandra Álvarez, Ana Cerdeira-Pena)