Compressed RDF: Practical Uses & Hands-on Antonio Faria, Javier - - PowerPoint PPT Presentation

compressed rdf practical uses hands on
SMART_READER_LITE
LIVE PREVIEW

Compressed RDF: Practical Uses & Hands-on Antonio Faria, Javier - - PowerPoint PPT Presentation

Compressed RDF: Practical Uses & Hands-on Antonio Faria, Javier D. Fernndez and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search in Big Linked Data 23 TH AUGUST 2017 General agenda Session I (09:00 - 10:30) "


slide-1
SLIDE 1

Compressed RDF: Practical Uses & Hands-on

Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto

23TH AUGUST 2017

3rd KEYSTONE Training School Keyword search in Big Linked Data

slide-2
SLIDE 2
  • Session I (09:00 - 10:30) "Basics of Compression for Big Linked Data Management“
  • Big (Linked) Semantic Data Compression: motivation & challenges
  • Compact Data Structures
  • Session II (13:30 - 15:00) “RDF Compression“
  • RDF Compression. HDT
  • RDF Dictionaries
  • RDF Triples
  • Session III (15:30-17:00) “Compressed RDF: Practical Uses & Hands-on”
  • Practical Uses (LOD-a-lot, RDF Archiving, etc.)
  • Hands on

PAGE 2

General agenda

images: zurb.com

slide-3
SLIDE 3
  • Practical uses
  • LOD-a-lot: Web-scale queries in your pocket
  • RDF archiving
  • Linked Data markets (Linked Close Data)
  • Hands on
  • HDT-it
  • Command line tools
  • HDT and Fuseki
  • HDT and Linked Data Fragments
  • HDT and C++/Java
  • HDT and Jena

PAGE 3

Agenda of this session

images: zurb.com

slide-4
SLIDE 4

LOD-a-lot

Use case 1

slide-5
SLIDE 5
  • E.g. retrieve all entities in LOD with the label “Axel Polleres“
  • Options:
  • Crawl and index LOD locally (-no-)
  • Follow-your-nose (where should I start?)
  • Federated querying (as good as the endpoints you query)
  • Use LOD Laundromat as a “good approximation” (still querying 650K datasets)

5

Still… what about Web-scale queries

select distinct ?x { ?x rdfs:label “Axel Polleres" }

slide-6
SLIDE 6

6

LOD Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

SPARQL endpoint (metadata)

LOD Laundromat

slide-7
SLIDE 7

LOD-a-lot

7

But what about Web-scale queries

  • flashback -
slide-8
SLIDE 8

The real motivation

consume

slide-9
SLIDE 9

The real motivation

http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/

Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking

consume

slide-10
SLIDE 10

The real motivation

Oh man I’m hungry and I don’ t even know if I will like whatever you are cooking

consume

http://www.kunsan.af.mil/News/ Article/413995/serving-the-masses/

slide-11
SLIDE 11

LOD-a-lot

11

But what about Web-scale queries But one could be really hungry

https://hwy55burgers.wordpress.com/tag/food-challenge/

slide-12
SLIDE 12

12

LOD Laundromat

Dataset 1

N-Triples (zip)

Dataset 650K

N-Triples (zip)

Linked Open Data

LOD-a-lo lot

SPARQL endpoint (metadata)

LOD-a-lot

Kudos Javier D. Fernandez, Wouter Beek, Miguel A. Martínez-Prieto, and Mario Arias

28B triples

slide-13
SLIDE 13
  • Disk size:
  • HDT: 304 GB
  • HDT-FoQ (additional indexes): 133 GB
  • Memory footprint (to query):
  • 15.7 GB of RAM (3% of the size)
  • 144 seconds loading time
  • 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
  • LDF page resolution in milliseconds.

13

LOD-a-lot (some numbers)

305€

(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)

slide-14
SLIDE 14

14

LOD-a-lot

https://datahub.io/dataset/lod-a-lot http://purl.org/HDT/lod-a-lot

slide-15
SLIDE 15
  • Query resolution at Web scale
  • Evaluation and Benchmarking
  • No excuse 
  • RDF metrics and analytics

15

LOD-a-lot (some use cases)

subjects predicates

  • bjects
slide-16
SLIDE 16

16

ACKs LOD-a-lot

slide-17
SLIDE 17

Archiving

Use case 2

slide-18
SLIDE 18

ANDREAS HARTH - STREAM REASONING IN MIXED REALITY APPLICATIONS, STREAM REASONING WORKSHOP 2015

So far so good... But RDF is evolving

Number

  • f

sources Update rate month year week day hour minute second 104 105 106 101 100 102 103

DBpedia BTC Dyldo Internet

  • f Things

Virtual/Augmented Reality

versions?

LOD-a-lot

slide-19
SLIDE 19

3

Most semantic Web/Linked Data tools are focused on this “static view” but do not consider versioning/evolution

Linked Data Archives: The missing link in the RDF evolution

Sindice, SWSE, Swoogle, LOD Cache, LOD-Laundromat… so far, no versions!

slide-20
SLIDE 20
  • Web archives: Common Crawl, Internet Memory, Internet Archive, …

20

Preservation matters

slide-21
SLIDE 21

21

…in the last few years:

Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7)

Research projects Archives Tools Benchmarking

  • ne of the fundamental problems in the Web of Data

BEnchmark of RDF ARchives

RDF evolution at Scale

v-RDFCSA

slide-22
SLIDE 22

22

…in the last few years:

Managing the Evolution and Preservation of the Data Web (FP7) Preserving Linked Data (FP7)

Research projects Archives Tools Benchmarking

  • ne of the fundamental problems in the Web of Data

BEnchmark of RDF ARchives

RDF evolution at Scale

v-RDFCSA

slide-23
SLIDE 23

23

RDF Archiving. Archiving policies

V1

ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S3 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P2 . ex:C1 ex:hasProfessor ex:S2 . ex:S1 ex:study ex:C1 . ex:S3 ex:study ex:C1 .

V2 V3

ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S2 ex:study ex:C1 .

V1

ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S2 ex:study ex:C1 . ex:S3 ex:study ex:C1 . ex:S2 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P1 . ex:C1 ex:hasProfessor ex:P2 . ex:C1 ex:hasProfessor ex:S2 .

V1,2,

3

ex:C1 ex:hasProfessor ex:P1 [V1,V2]. ex:C1 ex:hasProfessor ex:P2 [V3]. ex:C1 ex:hasProfessor ex:S2 [V3]. ex:S1 ex:study ex:C1 [V1,V2,V3]. ex:S2 ex:study ex:C1 [V1]. ex:S3 ex:study ex:C1 [V2,V3].

a) Independent Copies/Snapshots (IC) b) Change-based approach (CB) c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

slide-24
SLIDE 24

24

BEAR

https://aic.ai.wu.ac.at/qadlod/bear.html

slide-25
SLIDE 25
  • Queries and systems
  • We implemented and evaluate archiving systems on Jena-TDB and HDT,

based on IC, CB and TB policies.

  • Serve as an initial baseline to compare archiving systems
  • More info: https://aic.ai.wu.ac.at/qadlod/bear.html

25

BEAR: Benchmarking the Efficiency of RDF Archiving

slide-26
SLIDE 26

26

RDF Archiving. Archiving policies

V1

ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S3 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P2 . ex:C1 ex:hasProfessor ex:S2 . ex:S1 ex:study ex:C1 . ex:S3 ex:study ex:C1 .

V2 V3

ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S2 ex:study ex:C1 .

V1

ex:C1 ex:hasProfessor ex:P1 . ex:S1 ex:study ex:C1 . ex:S2 ex:study ex:C1 . ex:S3 ex:study ex:C1 . ex:S2 ex:study ex:C1 . ex:C1 ex:hasProfessor ex:P1 . ex:C1 ex:hasProfessor ex:P2 . ex:C1 ex:hasProfessor ex:S2 .

V1,2,

3

ex:C1 ex:hasProfessor ex:P1 [V1,V2]. ex:C1 ex:hasProfessor ex:P2 [V3]. ex:C1 ex:hasProfessor ex:S2 [V3]. ex:S1 ex:study ex:C1 [V1,V2,V3]. ex:S2 ex:study ex:C1 [V1]. ex:S3 ex:study ex:C1 [V2,V3].

a) Independent Copies/Snapshots (IC) b) Change-based approach (CB) c) Timestamp-based approach (TB)

RETRIEVAL MEDIATOR RETRIEVAL MEDIATOR

RETRIEVAL MEDIATOR

slide-27
SLIDE 27
  • Instantiation of archive queries in AnQL [1]
  • Mat(Q,V1)
  • version materialization
  • Diff(Q,V1,V2)
  • Ver(Q)
  • join(Q1,vi,Q2,vj)
  • Change(Q)

27

Benchmarking: Define the queries

SELECT * WHERE { Q :[v1] }

[1] Antoine Zimmermann, Nuno Lopes, Axel Polleres, and Umberto Straccia. A general framework for representing, reasoning and querying with annotated Semantic Web data. Journal of Web Semantics (JWS), 12:72--95, March 2012.

slide-28
SLIDE 28
  • Instantiation of archive queries in AnQL
  • Mat(Q,V1)
  • Diff(Q,V1,V2)
  • delta materialization
  • Ver(Q)
  • join(Q1,vi,Q2,vj)
  • Change(Q)

28

Benchmarking: Define the queries

SELECT * WHERE { { { {Q :[v1]} MINUS {Q :[v2]} } BIND (v1 AS ?V ) } UNION { { {Q :[v2] } MINUS {Q :[v1]}} BIND (v2 AS ?V ) }

slide-29
SLIDE 29
  • Instantiation of archive queries in AnQL
  • Mat(Q,V1)
  • Diff(Q,V1,V2)
  • Ver(Q)
  • results of Q annotated with the version
  • join(Q1,vi,Q2,vj)
  • Change(Q)

29

Benchmarking: Define the queries

SELECT * WHERE { Q :?V }

slide-30
SLIDE 30
  • Instantiation of archive queries in AnQL
  • Mat(Q,V1)
  • Diff(Q,V1,V2)
  • Ver(Q)
  • join(Q1,v1,Q2,v2)
  • Change(Q)

30

Benchmarking: Define the queries

SELECT * WHERE { {Q :[v1]} {Q :[v2]} }

slide-31
SLIDE 31
  • Instantiation of archive queries in AnQL
  • Mat(Q,V1)
  • Diff(Q,V1,V2)
  • Ver(Q)
  • join(Q1,vi,Q2,vj)
  • Change(Q)
  • Returns consecutive versions in which Diff of a query is not null

31

Benchmarking: Define the queries

SELECT ?V1 ?V2 WHERE { {{Q :?V1 } MINUS {Q :?V2}} UNION {{Q :?V2 } MINUS {Q :?V1}} FILTER( abs(?V1-?V2) = 1 ) }

Open question remains: What is the right query syntax for archive queries?

slide-32
SLIDE 32

32

Time-based access. Queries

Materialize (s,?,? ; version)

slide-33
SLIDE 33

33

Time-based access. Queries

diff(?,?,o ; version0 ; version t)

slide-34
SLIDE 34
  • RDFCSA: Compressed Suffix Array
  • v-RDFCSA[2] is designed as a lightweight TB approach
  • Version information encoding
  • Any triple can be identified by the position of its subject within SA
  • Let be N the number of different versions and n the set of version-oblivious

triples

  • Two encoding strategies
  • tpv: N bitsequences 𝐂𝐰 𝐣 [𝟐, 𝐨] to encode what triple appears in version i
  • vpt: n bitsequences 𝐂𝐮 i [1, N ] to encode versions where the kth triple occurs

34

Self-Indexing RDF Archives: v-RDFCSA

Bv

1

1 1 1 Bv

2

1 1 Bv

3

1 1 Triples 1 2 3 4 5

tpv

Versions 1 2 3 Bt

1

1 1 1 1 1 1 1 Triples 1 2 3 4 5

vpt

Version s 1 2 3 Bt

2 Bt 3 Bt 4 Bt 5

[2] Ana Cerdeira-Pena, Antonio Fariña, Javier D. Fernández, and Miguel A. Martínez-Prieto. Self- Indexing RDF Archives. Data Compression Conference (DCC), 2016.

Performs more than one order of magnitude faster than Jena-TDB for query resolution

slide-35
SLIDE 35

Linked Open/Close Data (Linked Data markets)

Use case 3

slide-36
SLIDE 36

G3b G1b

Linked Open Data Cloud Linked Closed Data Cloud

dbpedia G3a G4a G1a G2a G1c G2c G2b

So far so good but.. Linked Open/Close Data

“Deep Semantic Web”

slide-37
SLIDE 37

Linked Open/Close Data

slide-38
SLIDE 38
  • A) Efficient Exchange: Compression + Encryption (hdtcrypt)

38

Linked Open/Close Data

slide-39
SLIDE 39
  • B) A secure LD Endpoint

39

Linked Open/Close Data

Self-Enforcing Access Control for Encrypted RDF Javier D. Fernández, Sabrina Kirrane, Axel Polleres and Simon Steyskal. In ESWC’17

Future work:

slide-40
SLIDE 40

Hands on!

Find these slides in:

https://aic.ai.wu.ac.at/qadlod/presentations/ keystoneHandsOn2017.pdf https://aic.ai.wu.ac.at/qadlod/presentations/ codeKeystone2017

slide-41
SLIDE 41
  • 1) Desktop tool HDT-it!
  • Thanks to Mario Arias

Consuming HDT

slide-42
SLIDE 42
  • 1) Desktop tool HDT-it!
  • Download the tool for your OS:
  • http://www.rdfhdt.org/downloads/
  • Get an HDT dataset from the web
  • http://www.rdfhdt.org/datasets/

OR

  • http://lodlaundromat.org/wardrobe/

OR convert your RDF dataset with the tool.

  • As a suggestion of small datasets:
  • SWDF (242K triples) or the bigger DBLP (55M triples)

Consuming HDT

slide-43
SLIDE 43
  • 2) Command line Tools (C++ and Java)

Consuming HDT

rdfhdt.org HDT-C++ HDT-Java Command Line tools X X TP search X X Full SPARQL

  • with Jena

Parametrizable Compression

X

  • Full text support

X

  • Practical Uses

LDF

Jena, Fuseki

slide-44
SLIDE 44
  • 2) Command line Tools (c++ and Java)
  • For simplicity, in this lecture we will use Java
  • Download hdt-java library from https://github.com/rdfhdt/hdt-java/
  • git clone https://github.com/rdfhdt/hdt-java.git
  • r download https://github.com/rdfhdt/hdt-java/archive/master.zip
  • Install the library with maven:
  • mvn install
  • Query an HDT file:
  • Go to HDT-cli and execute:
  • ./bin/hdhSearch.sh /path/to/your/hdt
  • This will open a simple console where you can query triple patterns
  • Export/Import
  • $> rdf2hdt file.nt output.hdt
  • $> hdt2rdf file.hdt output.nt

Consuming HDT

slide-45
SLIDE 45
  • 3) Set up a SPARQL Endpoint with HDT and Fuseki
  • Go to hdt-fuseki and compile adding the dependencies:
  • mvn package dependency:copy-dependencies
  • Run fuseki
  • ./bin/hdtEndpoint.sh --hdt path/to/dataset.hdt /mydataset
  • Open your Web Browser and go to: http://localhost:3030
  • Select Control Panel / Dataset / myDataset and click Select
  • Type your SPARQL Query and see the results.
  • Be careful with the number of results, here there is no limitation in the number
  • f results such as in e.g. virtuoso:
  • select * WHERE{ ?s ?p ?o} LIMIT 400

Consuming HDT

slide-46
SLIDE 46
  • 4) Set up a Linked Data Fragments Endpoint with HDT
  • Download LDF Server (Node.js is the best one but we will use java for

simplicity in the installation).

  • git clone https://github.com/LinkedDataFragments/Server.Java.git
  • r download https://github.com/LinkedDataFragments/Server.Java/archive/master.zip
  • Install the server, avoid the test (it fails :)
  • mvn install -Dmaven.test.skip=true
  • Open the file config-example.json and modify the settings to point to your

hdt, e.g.

  • "settings": { "file": "/home/user/myfile.hdt" }
  • Run the server
  • java -jar target/ldf-server.jar
  • Access http://localhost:8080

Consuming HDT

slide-47
SLIDE 47
  • 5) Access with the HDT C++/Java libraries (again, we restrict here to

Java)

  • JAVADOC:
  • http://purl.org/HDT/javadoc/api
  • http://purl.org/HDT/javadoc/core
  • I will refer to Eclipse and Maven but you can use your preferred environment

Consuming HDT

slide-48
SLIDE 48
  • Setting up the environment…
  • Create a new maven project

Consuming HDT / HDT-java library

slide-49
SLIDE 49
  • Setting up the environment…
  • Create a new maven project
  • Select to create a simple project (skip archetype selection)

Consuming HDT / HDT-java library

slide-50
SLIDE 50
  • Setting up the environment…
  • Create a new maven project
  • With a simple archetype
  • And any metadata

Consuming HDT / HDT-java library

slide-51
SLIDE 51
  • Setting up the environment…
  • Include the maven dependency of hdt-java-core in the pom.xml

Consuming HDT / HDT-java library

slide-52
SLIDE 52
  • Setting up the environment…
  • Include the maven dependency of hdt-java-core in the pom.xml
  • Finally, let’s create a new Class and query our HDT

Consuming HDT / HDT-java library

  • Test other queries
  • get the S, P, O of the solution
slide-53
SLIDE 53
  • Let’s access the dictionary of terms in HDT

Consuming HDT / HDT-java library

  • Open two HDT files
  • Use the dictionaries to get the common predicates used in both
slide-54
SLIDE 54
  • Let’s access the terms as IDs

Consuming HDT / HDT-java library

  • Use the estimation of results to count the cardinality of all

subjects

  • We can build an histogram and see the distribution
slide-55
SLIDE 55
  • 6) Query full SPARQL with Jena and HDT
  • First, include the hdt-jena dependency in pom.xml

Consuming HDT

slide-56
SLIDE 56
  • 6) Query full SPARQL with Jena and HDT
  • First, include the hdt-jena dependency in pom.xml
  • Import HDT into a model and query!

Consuming HDT

  • Test other queries over your data
slide-57
SLIDE 57
  • +) Query LOD-a-lot
  • First, get the correct hdt-java branch to deal with really long IDs
  • git clone -b long-dict-id https://github.com/rdfhdt/hdt-java/
  • Install, avoid the test
  • mvn install -Dmaven.test.skip=true
  • Change java head space
  • export MAVEN_OPTS="-Xmx25G"
  • In hdt-java-cli
  • ./bin/hdtSearch.sh /media/javi/data/lod-a-lot/LOD_a_lot_v1.hdt

Consuming HDT

slide-58
SLIDE 58

Let’s the lecture… end

slide-59
SLIDE 59
  • We are currently facing Big Linked Data challenges
  • Generation, publication and consumption
  • Thanks to compression, the Big Linked Data today

will be the “pocket” data tomorrow

  • Compression is not just about space
  • Fast exchange
  • Fast processing/management
  • Fast querying
  • Compression democratizes the access to Big

Linked Data

= Cheap, scalable consumers

PAGE 59

Take-home messages

slide-60
SLIDE 60

Thank you!

Let’s the lecture… end