Triples compression and Indexing
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
Triples compression and Indexing Antonio Faria, Javier D. Fernndez - - PowerPoint PPT Presentation
Triples compression and Indexing Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 Agenda RDF management overview K 2 -Tree data
Triples compression and Indexing
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
PAGE 2
Agenda
images: zurb.com
Recall we can set string from RDF into a dictionary and then handle a set of RDF-triples as a set of ID-based-triples
RDF magament overview
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 3
4
RDF management overview
UK London M.Lalmas R.Raman A.Gionis inv-speaker Finland SPIRE held on capital of lives in lives in p
i t i
lives in a t t e n d s attends a t t e n d s works in (SPIRE, held on, London) (London, capital of, UK) (A.Gionis, attends, SPIRE) (R.Raman, attends, SPIRE) (M.Lalmas, attends, SPIRE) (M.Lalmas, lives in, UK) (M.Lalmas, works in, London) (A.Gionis, lives in, Finland) (R.Raman, lives in, UK) (R.Raman, position, inv-speaker) Original Triplets
London SPIRE A.Gionis M.Lalmas R.Raman Finland inv-speaker UK attends capital of 1 2 3 4 5 3 4 5 1 2 held on lives in position works in 3 4 5 6 SO S O P Dictionary Encoding
(2,3,1) (1,2,5) (3,1,2) (5,1,2) (4,1,2) (4,4,5) (4,6,1) (3,4,3) (5,4,5) (5,5,4) Id-based Triplets
PAGE 5
Agenda
images: zurb.com
A k2-tree permits a compact representation of an adjacency matrix.
K2-tree data structure
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 6
7
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
K2-Tree Motivation
1 2 3 7 4 5 6 8
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
9
10 11
8
K2-Tree Construction process
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0100 0100 0011 0010 0010 10101000 0110 0010
Example with K=2
T = 101111010100100011001000000101011110 L = 010000110010001010101000011000100100
9
K2-Tree Direct neighbor operation
1 1 1 1 1 1 1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0100 0100 0011 0010 0010 10101000 0110 0010
T = 101111010100100011001000000101011110 L = 010000110010001010101000011000100100
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
children(2) = rank1(T,2)* k2 = 2*4=8
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 35
children(9) = rank1(T,9)* k2 = 7*4=28
36 38 40 42 44 46 48 50 52 54 56 …
children(31) = rank1(T,31)* k2 = 14*4=56
PAGE 10
Agenda
images: zurb.com
k2-triples applies vertical partitioning of an RDF dataset by predicate. Then, |P| k2-trees permit to represent all the triples involving a given predicate.
K2-triples
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 11
12
K2-Triples Data structure
RDF triples Mapped triples Dictionary
13
K2-Triples Data structure
1
(8,5,4) (4,2,3) (4,4,6) (4,1,7) (7,2,3) (3,3,5) (5,2,1) (1,3,5) (6,2,2) (2,3,5)
1 1 1 1 1 1 1
1
1
P1 P2 P3 P4 P5
(S,P ,O)
1 1 1 1 1
S4 O 7
14
K2-Triples
Query: (4,2,3)
1 1 1 1
P2
1 1 1 1
Result: (4,2,3)
15
K2-Triples
Query: (4,2,?)
1 1 1 1
P2
1 1 1 1
Result: (4,2,3)
16
K2-Triples
Query: (?,2,3)
1 1 1 1
P2
1 1 1 1
Result: (4,2,3), (7,2,3)
17
K2-Triples
Query: (4,?,6) Result: (4,4,6)
1 1 1 1 1 1 1 1
1
1
P3 P4 P5
1 1 1 1 1
P1 P2
18
K2-Triples
Query: (4,?,?) Result: (4,1,7), (4,2,3), (4,4,6)
1 1 1 1 1 1 1 1
1
1
P3 P4 P5
1 1 1 1 1
P1 P2
19
K2-Triples
Query: (?,?,4) Result: (8,5,4)
1 1 1 1 1 1 1 1
1
1
P3 P4 P5
1 1 1 1 1
P1 P2
20
K2-Triples
Query: (?,2,?)
1 1 1 1
P2
1 1 1 1
Result: (4,2,3), (5,2,1),(6,2,2),(7,2,3)
20 OF 51
21
K2-Triples SP & OP indexes
(8,5,4) (4,2,3) (4,4,6) (4,1,7) (7,2,3) (3,3,5) (5,2,1) (1,3,5) (6,2,2) (2,3,5)
(S,P,O)
S Predicates 1 3 2 3 3 3 4 1,2,4 5 2 6 2 7 2 8 5
SP INDEX Statistically compressed Direct access with DAC
22
K2-Triples SP & OP indexes
1
1 1 1 1 1 1 1
1
1
P1 P2 P3 P4 P5 SP INDEX
Subject 4? Predicate list: 1,2,4
23
K2-Triples Joins
Query: (8,5,?X) (?X,2,?)
Best strategy depends on the dataset and the type of join
merge-join index-join
24
K2-Triples Experiments
Dataset Size(MB) #Triples #Predicates #Subjects #Objects Jamendo 144.18 1,049,639 28 335,926 440,604 DBLP 7.58 46,597,620 27 2,840,639 19,639,731 Geonames 12,347.70 112,235,492 26 8,147,136 41,111,569 Dbpedia 33,912.71 232,542,405 39,672 18,425,128 65,200,769 Dataset MonetDB RDF-3X Hexastore K2-triples K2-triples+ Jamendo 8.76 37.73 1,371.25 0.74 1.28 DBLP 358.44 1,643.31 82.48 99.24 Geonames 859.66 3,584.80 152.20 188.63 Dbpedia 1,811.74 9,757.58 931.44 1178.38
25
K2-Triples Experiments
26
K2-Triples Experiments
PAGE 28
Agenda
images: zurb.com
CSA is a suffix-array-based self-index. It reduces its huge space needs and still allows fast binary-search-type lookups
The compressed Suffix Array
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 29
30
Compressed Suffix Array (CSA) Back to Suffix Arrays
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
locations
Noccs = (4-3)+1 Occs = A[3] .. A[4] = {8, 1} Fast Space O(m lg n) O(4n) O(m lg n + noccs) + |T|
P = a b
31
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$
P = a b
Compressed Suffix Array (CSA) CSA Basics
32
racadabra$ a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12A =
abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$
1 2 3 4 5 6 7 8 9 10 11 12Ψ=
Compressed Suffix Array (CSA) CSA Basics
33
racadabra$ a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12A =
abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$
1 2 3 4 5 6 7 8 9 10 11 12Ψ=
Compressed Suffix Array (CSA) CSA Basics
34
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12T = 12 11 8 1 4 6 9 2 5 7 10 3
1 2 3 4 5 6 7 8 9 10 11 12A =
abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$
3
1 2 3 4 5 6 7 8 9 10 11 12Ψ=
Compressed Suffix Array (CSA) CSA Basics
7
35
(P=“ada…”) and to recover the source data!!
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T = 12 11 8 1 4 6 9 2 5 10 3
1 2 3 4 5 6 7 8 9 10 11 12
A =
abracadabra$ acadabra$ $ a$ adabra$ bra$ bracadabra$ cadabra$ abra$ dabra$ ra$ racadabra$
4 1 7 8 9 10 11 12 6 2 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ=
3 7 F =
Compressed Suffix Array (CSA) CSA Basics
36
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F = 1 1 1 1 1 1 = D $ a b c d r
1 2 3 4 5 6
= S
Bitmap Sorted alphabet
Compressed Suffix Array (CSA) CSA Basics
37
Rank1(D,i):: Time O(1), by using o(n) extra space
$ a a a a a b b c d r r
1 2 3 4 5 6 7 8 9 10 11 12
F = 1 1 1 1 1 1 = D $ a b c d r
1 2 3 4 5 6
= S
Bitmap Sorted alphabet
rank1(D, 8)
Compressed Suffix Array (CSA) Representing F
38
11 6 7 12 1 4 9 10 8 2 3 5
1 2 3 4 5 6 7 8 9 10 11 12
Ψ =
1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0
11 1 1 18 8 32
sΨ =
Δ =
Compressed Suffix Array (CSA) Compressing Ψ
39
sampling sampling sampling + gap encoding +
– Ψ(sampled), D, S bsearch (count)
1 1 1 1 1 1 = D 4 1 7 8 9 10 11 12 6 2 5
Ψ=
3 $ a b c d r = S
1 2 3 4 5 6 7 8 9 10 11 12
a b r a c a d a b r a $
1 2 3 4 5 6 7 8 9 10 11 12
T =
Parameters: space/time “trade-off”
Compressed Suffix Array (CSA) Complete structure
PAGE 40
Agenda
images: zurb.com
Let us reorganize an id-based RDF-dataset. Then, let us build a CSA on it (with a small modification of Y), et voilà… … we obtain an RDF self-index !!
RDF-CSA
BIG (LINKED) SEMANTIC DATA COMPRESSION PAGE 41
42
We first sort by subject, then by predicate, and finally by object
…
RDF-CSA Construction process
43
alphabets
Range [1, ns] for subjects Range [ns+1, ns + np] for predicates Range [ns + np + 1, ns + np + no] for objects
Due to this alphabet mapping, every subject is smaller than every predicate, and this in turn is smaller than every object !!!
RDF-CSA Construction process
44
point to the subject of the triple k+1 in S, but to the subject of the same triple we can start at position A[i], pointing to any place within a triple (s,p,o), and recover the triple by succesive applications of Ψ
Querying? binary search + applying Ψ
RDF-CSA Construction process
45
,O), (?,P ,O), (S,?,O), (S,P ,?), (?,?,O), (S,?,?), (?,P ,?), (?,?,?)
triple, using Ψ
Optimizations are applicable to pattern (S,P,O), and those with just one unbounded term!!
180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SPO 1 1 1 1 1 D
RDF-CSA Solving triple patterns
46
and check matches with Ψ into those intervals, starting from the shortest one.
instead of sequentially verifying each position of the shortest interval.
180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SP SPO 180 231 301 550 600 602 10 11 12 180 200 230 231 232 300 301 550 600 601 602 S=8 P=4 O=261 SPO PO
RDF-CSA Solving triple patterns
47
RDF-CSA Experiments Dbpedia
(space % VS micros/occ)
48
RDF-CSA Experiments Dbpedia
(space % VS micros/occ)
49
magnitude faster)
RDF-CSA Summary
50
Bibliography
1. Antonio Fariña, Nieves R. Brisaboa, Gonzalo Navarro, Francisco Claude, Ángeles S. Places, Eduardo Rodríguez: Word-based self-indexes for natural language text. ACM Trans. Inf. Syst. (TOIS) 30(1):1 (2012) 2. Brisaboa, N. R.; Cerdeira-Pena, A.; Fariña, A.; Navarro, G.: "A Compact RDF Store using Suffix Arrays", en Proc.
London (Reino Unido), 2015 , pp. 103-115. 3. Cathrin Weiss, Panagiotis Karras, Abraham Bernstein: Hexastore: sextuple indexing for semantic web data
4. Kunihiko Sadakane: New text indexing functionalities of the compressed suffix arrays. J. Algorithms 48(2): 294- 313 (2003) 5. MonetDB (2013), http://www.monetdb.org 6. Nieves R. Brisaboa, Susana Ladra, Gonzalo Navarro: Compact representation of Web graphs with extended
7. Sandra Álvarez-García, Nieves R. Brisaboa, Javier D. Fernández, Miguel A. Martínez-Prieto, Gonzalo Navarro: Compressed vertical partitioning for efficient RDF management. Knowl. Inf. Syst. 44(2): 439-474 (2015) 8. Thomas Neumann, Gerhard Weikum: RDF-Stores und RDF-Query-Engines. Datenbank-Spektrum 11(1): 63-66 (2011)
Triples compression and Indexing
Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto
23TH AUGUST 2017
3rd KEYSTONE Training School Keyword search in Big Linked Data
(Thanks: slides partially by: Susana Ladra, Sandra Álvarez, Ana Cerdeira-Pena)