[PPT] - On the Efficient Distributed Evaluation of SPARQL Queries Damien PowerPoint Presentation

SLIDE 1

On the Efficient Distributed Evaluation of SPARQL Queries

Damien Graux

Supervisor: Nabil Laya¨ ıda Co-Supervisor: Pierre Genev` es Funded by: Datalyse Project Universit´ e Grenoble Alpes, Inria, LIG Tyrex Team

<tyrex.inria.fr>

December 15th, 2016

SLIDE 2

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)?

1 / 34

SLIDE 3

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?”

1 / 34

SLIDE 4

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?”

Planes Subways POIs Reviews

1 / 34

SLIDE 5

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?”

Planes Relational Subways GTFS POIs RDF Reviews Various

1 / 34

SLIDE 6

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?”

Planes Relational Thousand Subways GTFS Million POIs RDF Billion Reviews Various Billion

1 / 34

SLIDE 7

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?”

Planes Relational Thousand Static Subways GTFS Million Static POIs RDF Billion Dynamic Reviews Various Billion Dynamic

1 / 34

SLIDE 8

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?”

Planes Relational Thousand Static Subways GTFS Million Static POIs RDF Billion Dynamic Reviews Various Billion Dynamic

Finally,. . . . . . Linking the blocks!

1 / 34

SLIDE 9

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?” Complex Problem

Planes Relational Thousand Static Subways GTFS Million Static POIs RDF Billion Dynamic Reviews Various Billion Dynamic

Finally,. . . . . . Linking the blocks!

1 / 34

SLIDE 10

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

Context: Large datasets available Heterogeneous data

1 / 34

SLIDE 11

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

Context: Large datasets available Heterogeneous data Objectives: Efficiently request these datasets Aggregate results to build complex applications

1 / 34

SLIDE 12

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Context & Objectives driven by an example

A practical usecase: What did you miss (touristically) last time you travelled (by plane)? More specifically: “Is it possible to sightsee at stopovers?” Complex Problem

Planes Relational Thousand Static Subways GTFS Million Static POIs RDF Billion Dynamic Reviews Various Billion Dynamic

Finally, . . . . . . Linking the blocks!

1 / 34

SLIDE 13

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

My PhD topic

Focuses

1 Focusing on evaluating SPARQL queries, 2 On large amounts of RDF data, 3 In a distributed context.

2 / 34

SLIDE 14

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

My PhD topic

Focuses

1 Focusing on evaluating SPARQL queries, 2 On large amounts of RDF data, 3 In a distributed context.

Problem How to design efficient distributed SPARQL evaluators?

2 / 34

SLIDE 15

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Section 1 RDF & SPARQL

3 / 34

SLIDE 16

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Resource Description Framework [HM04]

Dutch School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

4 / 34

SLIDE 17

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Resource Description Framework [HM04]

Dutch School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

subject predicate

bject

Dutch School type Museum Dutch School creationDate 2016 Dutch School use Louvre Louvre type Museum Rembrandt type Painter Hals type Painter Vermeer type Painter Van Dyck type Painter Dutch School mainTopic Rembrandt Collection shows Rembrandt Dutch School shows Rembrandt Dutch School shows Hals Dutch School shows Vermeer Dutch School shows Van Dyck

4 / 34

SLIDE 18

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Resource Description Framework [HM04]

RDF essentials rdf is a w3c standard rdf designed to provide, share and exchange datasets An rdf graph is a set of rdf triples An rdf triple has three components:

a subject (s) a predicate (p) a object (o)

4 / 34

SLIDE 19

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Dutch School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

SELECT ?s ?g WHERE { ?s type Museum ?g type Painter ?s shows ?g }

5 / 34

SLIDE 20

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Ducth School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

?s type Museum ?g type Painter ?s shows ?g ?s: Ducth School, Louvre

5 / 34

SLIDE 21

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Dutch School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

?s type Museum ?g type Painter ?s shows ?g ?s: Ducth School, Louvre ?g: Rembrandt, Hals, Vermeer, Van Dyck

5 / 34

SLIDE 22

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Dutch School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

?s type Museum ?g type Painter ?s shows ?g ?s: Ducth School, Louvre ?g: Rembrandt, Hals, Vermeer, Van Dyck (?s,?g): (Ducth School,Rembrandt), (Ducth School,Hals), (Ducth School,Vermeer), (Ducth School,Van Dyck),(Collection,Rembrandt)

5 / 34

SLIDE 23

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Dutch School 2016 Museum Louvre

creationDate type use type

Rembrandt Collection Hals Vermeer Van Dyck Painter

shows mainTopic shows shows shows shows type type type type

?s type Museum ?g type Painter ?s shows ?g ?s: Ducth School, Louvre ?g: Rembrandt, Hals, Vermeer, Van Dyck (?s,?g): (Ducth School,Rembrandt), (Ducth School,Hals), (Ducth School,Vermeer), (Ducth School,Van Dyck),(Collection,Rembrandt) Solution (?s,?g): (Ducth School,Rembrandt), (Ducth School,Hals), (Ducth School,Vermeer), (Ducth School,Van Dyck)

5 / 34

SLIDE 24

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Considered SPARQL Fragment Basic Graph Pattern (BGP) fragment composed of conjunctions of Triple Patterns (TPs). Triple Pattern (TP)

SELECT ?s ?g WHERE { ?s type Museum ?g type Painter ?s shows ?g }

One BGP Composed of 3 TPs

5 / 34

SLIDE 25

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Protocol and RDF Query Language [G+13]

Considered SPARQL Fragment Basic Graph Pattern (BGP) fragment composed of conjunctions of Triple Patterns (TPs). Triple Pattern (TP)

SELECT ?s ?g WHERE { ?s type Museum ?g type Painter ?s shows ?g }

One BGP Composed of 3 TPs Solutions A candidate solution satisfies a TP when the replacement of the variables of the TP with their value corresponds to a triple that appears in the RDF data. A query solution is a candidate solution that satisfies all the TPs of the query.

5 / 34

SLIDE 26

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Section 2 Distributed Frameworks

6 / 34

SLIDE 27

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

MapReduce Strategy

The paradigm Parallel processing of massive datasets [DG08] A job has two separate phases:

1 Map phase which takes k/v pairs, performs computations and

returns k/v pairs

2 Reduce phase where k/v pairs from the Map are ingested to return a

single set of results.

Intermediate results sometimes need to be shuffled – exchanged and/or merge-sorted – across the network to be reduced. In brief, MapReduce proposes to not only consider dataset as distributed and fragmented on each machine but also to develop the computation as small blocks (the Map part) which are finally grouped together (the Reduce part).

7 / 34

SLIDE 28

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Distributed Frameworks

Hadoop Framework for distributed systems based on MapReduce It is twofold:

a distributed file system (including replication) a MapReduce library

Cluster Computing Frameworks Provide an interface with implicit data parallelism and fault-tolerance Offer a set of low-level functions e.g. map, join, collect. . . For instance: PigLatin, Flink, Spark . . .

8 / 34

SLIDE 29

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Apache Spark[ZCD+12]

Spark in a nutshell Master/Worker(s) Architecture Various file system sources supported e.g. HDFS One of the most active Apache project e.g. 1000+ contributors

2002 2004 2006 2008 2010 2012 2014 2016 2002 MapReduce @ Google 2004 MapReduce Paper 2006 Hadoop @ Yahoo! 2008 Hadoop Summit 2010 Spark Open-Source May 2014 Apache Spark 1.0 July 2016 Apache Spark 2.0 9 / 34

SLIDE 30

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Apache Spark[ZCD+12]

Spark in a nutshell Master/Worker(s) Architecture Various file system sources supported e.g. HDFS One of the most active Apache project e.g. 1000+ contributors Resilient Distributed Datasets Distributed object collections Split into partitions stored in RAM or disks Created through deterministic operations Fault-tolerant: automatically re-built

9 / 34

SLIDE 31

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Section 3 SPARQL Evaluators

10 / 34

SLIDE 32

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Jumble of Evaluators

4store CouchBaseRDF BitMat YARS Hexastore CliqueSquare RYA Parliament Virtuoso RDF-3X . . .

11 / 34

SLIDE 33

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Jumble of Evaluators

. . . Some Previous Surveys When? Who? What? 2001 Barstow [Bar01] Focuses on open-source solutions; and looks at some of their specificities 2002 Beckett [Bec02] Updates 2003 Beckett [BG03] Focuses on the use of relational database management systems to store rdf datasets 2004 Lee [Lee04] Updates 2012 Faye [FCB12] Lists the various rdf storage approaches mainly used by single-node systems 2015 Kaoudi [KM15] Presents a survey focusing only on rdf in the clouds

11 / 34

SLIDE 34

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

RDF Storage Strategies

rdf Storage Strategies native In-memory On Disks Standalone Embedded non-native Web APIs DBMS-based Schema-Carefree Triple Table Schema-Aware Vertical Partitioning Property Table

12 / 34

SLIDE 35

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

RDF Storage Strategies

rdf Storage Strategies native In-memory BitMat On Disks Standalone Virtuoso RDF-3X Hexastore Embedded non-native Web APIs DBMS-based Schema-Carefree Triple Table 3store Schema-Aware Vertical Partitioning swStore Property Table

12 / 34

SLIDE 36

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Distributed Evaluation Methods

Distributed rdf Storage Methods Federation Horizontal Fragmentation Graph Partitioning Key-Value Stores Triple-based Graph-based Independent Distributed File System Triple Table Vertical Partitioning Property Table

13 / 34

SLIDE 37

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Distributed Evaluation Methods

Distributed rdf Storage Methods Federation Horizontal Fragmentation Graph Partitioning Key-Value Stores Triple-based RYA Graph-based Independent 4store CouchBaseRDF Distributed File System Triple Table PigSPARQL Vertical Partitioning S2RDF Property Table

13 / 34

SLIDE 38

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Distributed SPARQL Evaluator State-of-the-art Summary

Observations

1 Multiple RDF storage strategies 2 Various methods to distribute data and to compute queries

14 / 34

SLIDE 39

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Distributed SPARQL Evaluator State-of-the-art Summary

Observations

1 Multiple RDF storage strategies 2 Various methods to distribute data and to compute queries

How to pick an efficient evaluator?

14 / 34

SLIDE 40

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Distributed SPARQL Evaluator State-of-the-art Summary

Observations

1 Multiple RDF storage strategies 2 Various methods to distribute data and to compute queries

How to pick an efficient evaluator? Experimental Evaluation!

14 / 34

SLIDE 41

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Section 4 Multi-Criteria Experimental Ranking

15 / 34

SLIDE 42

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Experimental Studies

When? Who? What? 2002 Magkanaraki [MKA+02] Reviews solutions dealing with on- tologies 2009 Stegmaier [SGD+09] Reviews solutions according to several parameters such as their licenses, their architectures and compares them using a scalable test dataset 2013 Cudr´ e [CMEF+13] Realizes an empirical study of distributed sparql evaluators (native rdf stores and several NoSQL solutions they adapted)

16 / 34

SLIDE 43

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Popular Benchmarks

Name SPARQL Fragment LUBM [GPH05] BGP WatDiv [AH¨ OD14] BGP SP2Bench [SHLP09] BGP + FILTER UNION OPTIONAL + Solu- tion Modifiers + ASK BolowgnaB [DEW+11] BGP + aggregator (e.g. COUNT) BSBM [BS09] BGP + FILTER UNION OPTIONAL + So- lution Modifiers + Logical negation + CONSTRUCT DBPSB [MLAN11] Use actually posed queries against dbpedia RBench [Q¨ O15] Generate queries according to considered datasets

17 / 34

SLIDE 44

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Popular Benchmarks

Name SPARQL Fragment LUBM [GPH05] BGP WatDiv [AH¨ OD14] BGP SP2Bench [SHLP09] BGP + FILTER UNION OPTIONAL + Solu- tion Modifiers + ASK BolowgnaB [DEW+11] BGP + aggregator (e.g. COUNT) BSBM [BS09] BGP + FILTER UNION OPTIONAL + So- lution Modifiers + Logical negation + CONSTRUCT DBPSB [MLAN11] Use actually posed queries against dbpedia RBench [Q¨ O15] Generate queries according to considered datasets

17 / 34

SLIDE 45

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Experimental Comparative Analysis

Considered Benchmarks LUBM: generated datasets and 14 queries (Q1-Q14) WatDiv: generated datasets and 20 queries Competitors Selection criteria: OpenSource, Popular or Recent Two types of evaluators:

Conventional (with preprocessing): 4store, CumulusRDF, CouchBaseRDF, RYA, CliqueSquare and S2RDF Direct: PigSPARQL

18 / 34

SLIDE 46

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

We learned:

1 Considering the same dataset, loading times are spread over several

magnitude orders

19 / 34

SLIDE 47

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

With the following RDF datasets: Dataset Number of Triples Original File Size WatDiv1k 109 million 15 GB Lubm1k 134 million 23 GB Lubm10k 1.38 billion 232 GB

4store CliqueSquare RYA S2RDF CouchBaseRDF CumulusRDF

watdiv1k lubm1k lubm10k 103 104 105 Time(s)

Figure : Preprocessing Time.

19 / 34

SLIDE 48

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

We learned:

1 Considering the same dataset, loading times are spread over several

magnitude orders

2 For the same query on the same dataset, elapsed times can differ

very significantly

19 / 34

SLIDE 49

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

4store CliqueSquare CouchBaseRDF CumulusRDF PigSPARQL RYA S2RDF

Q1 Q2 Q3 100 101 102 103 104 Time(s)

Figure : Query Response Time with Lubm1k (134 million triples).

Q1 SELECT ?X WHERE { ?X rdf:type ub:GraduateStudent . ?X ub:takesCourse GraduateCourse0 } Q2 SELECT ?X ?Y ?Z WHERE { ?X rdf:type ub:GraduateStudent . ?Y rdf:type ub:University . ?Z rdf:type ub:Department . ?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y } Q3 SELECT ?X WHERE { ?X rdf:type ub:Publication . ?X ub:publicationAuthor AssistantProfessor0 } 19 / 34

SLIDE 50

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

We learned:

1 Considering the same dataset, loading times are spread over several

magnitude orders

2 For the same query on the same dataset, elapsed times can differ

very significantly

3 Even with large datasets, most queries are not harmful per se, i.e.

queries that incurr long running times with some implementations still remain in the “comfort zone” for other implementations

19 / 34

SLIDE 51

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

C1C2C3F1F2F3F4F5L1L2L3L4L5S1S2S3S4S5S6S7 100 101

(a) 4store

C1C2C3F1F2F3F4F5L1L2L3L4L5S1S2S3S4S5S6S7 102 102.5

(b) S2RDF

C1 F1F2F3F4F5L1L2L3L4L5S1S2S3S4S5S6S7 101 102 103 104

(c) RYA

C1C2C3F1F2F3F4F5L1L2L3L4L5S1S2S3S4S5S6S7 102 103

(d) PigSPARQL Figure : Obtained results with WatDiv1k.

19 / 34

SLIDE 52

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 1 – Obtained Results

We learned:

1 Considering the same dataset, loading times are spread over several

magnitude orders

2 For the same query on the same dataset, elapsed times can differ

very significantly

3 Even with large datasets, most queries are not harmful per se, i.e.

queries that incurr long running times with some implementations still remain in the “comfort zone” for other implementations Ok, but. . . . . . how to rank evaluators?

19 / 34

SLIDE 53

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

An extended set of metrics

Usual metrics: Time always Disk Footprint

nly sometimes

20 / 34

SLIDE 54

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

An extended set of metrics

Usual metrics: Time always Disk Footprint

nly sometimes

Our additions: Disk Activity new

20 / 34

SLIDE 55

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

An extended set of metrics

Usual metrics: Time always Disk Footprint

nly sometimes

Our additions: Disk Activity new Network Traffic new

20 / 34

SLIDE 56

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

An extended set of metrics

Usual metrics: Time always Disk Footprint

nly sometimes

Our additions: Disk Activity new Network Traffic new Resources: CPU, RAM, SWAP new

20 / 34

SLIDE 57

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Multi-Criteria Reading Grid

Criteria List Velocity: the fastest possible answers Query Time

21 / 34

SLIDE 58

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Multi-Criteria Reading Grid

Criteria List Velocity: the fastest possible answers Query Time Resiliency: trying to avoid as much as possible to recompute everything when a machine fails Footprint

21 / 34

SLIDE 59

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Multi-Criteria Reading Grid

Criteria List Velocity: the fastest possible answers Query Time Resiliency: trying to avoid as much as possible to recompute everything when a machine fails Footprint Immediacy: evaluating some sparql queries only once Preprocessing Time

21 / 34

SLIDE 60

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Multi-Criteria Reading Grid

Criteria List Velocity: the fastest possible answers Query Time Resiliency: trying to avoid as much as possible to recompute everything when a machine fails Footprint Immediacy: evaluating some sparql queries only once Preprocessing Time Dynamicity: dealing with dynamic data Preprocessing Time & Disk Activity

21 / 34

SLIDE 61

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Multi-Criteria Reading Grid

Criteria List Velocity: the fastest possible answers Query Time Resiliency: trying to avoid as much as possible to recompute everything when a machine fails Footprint Immediacy: evaluating some sparql queries only once Preprocessing Time Dynamicity: dealing with dynamic data Preprocessing Time & Disk Activity Parsimony: minimizing some of the resources CPU, RAM, . . .

21 / 34

SLIDE 62

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Ranking

Velocity Lubm1k Velocity WatDiv1k Immediacy Parsimony Dynamicity Resiliency

22 / 34

SLIDE 63

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Ranking

Velocity Lubm1k Velocity WatDiv1k Immediacy Parsimony Dynamicity Resiliency

4store

22 / 34

SLIDE 64

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Ranking

Velocity Lubm1k Velocity WatDiv1k Immediacy Parsimony Dynamicity Resiliency

PigSPARQL 4store

22 / 34

SLIDE 65

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 2 – Ranking

Velocity Lubm1k Velocity WatDiv1k Immediacy Parsimony Dynamicity Resiliency

CliqueSquare CouchBaseRDF CumulusRDF RYA S2RDF PigSPARQL 4store

22 / 34

SLIDE 66

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Section 5 Efficient Distributed SPARQL Evaluation

23 / 34

SLIDE 67

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 3 – Efficient Distributed SPARQL evaluation

We designed: SPARQLGX SDE RDFHive Available from: <https://github.com/tyrex-team>

24 / 34

SLIDE 68

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 3 – Efficient Distributed SPARQL evaluation

These evaluators in nutshells: SPARQLGX a distributed SPARQL evaluator with Apache Spark SDE a direct SPARQL evaluator with Apache Spark RDFHive a direct evaluation of SPARQL with Apache Hive Available from: <https://github.com/tyrex-team>

24 / 34

SLIDE 69

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Contrib. 3 – Efficient Distributed SPARQL evaluation

Considering the reading grid, we have: SPARQLGX velocity, resiliency SDE immediacy, dynamicity, resiliency RDFHive immediacy, dynamicity, resiliency, parsimony Available from: <https://github.com/tyrex-team>

24 / 34

SLIDE 70

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Details of SPARQLGX

1 Selected storage model 2 SPARQL translation process 3 Optimization strategies

25 / 34

SLIDE 71

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Vertical Partitioning [Abadi et al. 2007]

SPARQLGX Storage Model

RDF predicates carry the semantic information, thereby: Limited number of distinct predicates e.g. few hundreds [Gallego et

al. 2011]

Predicates rarely variable in queries [Gallego et al. 2011]

26 / 34

SLIDE 72

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Vertical Partitioning [Abadi et al. 2007]

SPARQLGX Storage Model

RDF predicates carry the semantic information, thereby: Limited number of distinct predicates e.g. few hundreds [Gallego et

al. 2011]

Predicates rarely variable in queries [Gallego et al. 2011] Vertical Partitioning Splitting by predicate and saving two-column files

26 / 34

SLIDE 73

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Vertical Partitioning [Abadi et al. 2007]

SPARQLGX Storage Model

RDF predicates carry the semantic information, thereby: Limited number of distinct predicates e.g. few hundreds [Gallego et

al. 2011]

Predicates rarely variable in queries [Gallego et al. 2011] Vertical Partitioning Splitting by predicate and saving two-column files Advantages Natural compression and indexing Straightforward implementation

26 / 34

SLIDE 74

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Vertical Partitioning [Abadi et al. 2007]

SPARQLGX Storage Model

dataset Dutch School type Museum Dutch School creationDate 2016 Dutch School use Louvre Louvre type Museum Rembrandt type Painter Hals type Painter Vermeer type Painter Van Dyck type Painter Collection shows Rembrandt Dutch School mainTopic Rembrandt Dutch School shows Rembrandt Dutch School shows Hals Dutch School shows Vermeer Dutch School shows Van Dyck type.txt Dutch School Museum Louvre Museum Rembrandt Painter Hals Painter Vermeer Painter Van Dyck Painter shows.txt Collection Rembrandt Dutch School Rembrandt Dutch School Hals Dutch School Vermeer Dutch School Van Dyck creationDate.txt Dutch School 2016 use.txt Dutch School Louvre mainTopic.txt Dutch School Rembrandt

26 / 34

SLIDE 75

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala

Dealing with one TP . . . textFile to access relevant files filter to keep matching triples

27 / 34

SLIDE 76

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala

Dealing with one TP . . . textFile to access relevant files filter to keep matching triples

?s type Museum .

textFile(“type.txt”) .filter{case(s,o)=>o.equals(“Museum”)} .map{case(s,o)=>s}

27 / 34

SLIDE 77

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala

Dealing with one TP . . . textFile to access relevant files filter to keep matching triples

?s type Museum .

textFile(“type.txt”) .filter{case(s,o)=>o.equals(“Museum”)} .map{case(s,o)=>s}

. . . with a conjunction of TPs Translate each TP Join them one by one

27 / 34

SLIDE 78

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala ?s type Museum . ?g type Painter . ?s shows ?g

27 / 34

SLIDE 79

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala ?s type Museum . ?g type Painter . ?s shows ?g tp1=sc.textFile(‘‘type.txt’’) .filter{case(s,o)=>o.equals(‘‘Museum’’)} .map{case(s,o)=>s} .keyBy{case(s)=>s}

27 / 34

SLIDE 80

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala ?s type Museum . ?g type Painter . ?s shows ?g tp1=sc.textFile(‘‘type.txt’’) .filter{case(s,o)=>o.equals(‘‘Museum’’)} .map{case(s,o)=>s} .keyBy{case(s)=>s} tp2=sc.textFile(‘‘type.txt’’) .filter{case(g,o)=>o.equals(‘‘Painter’’)} .map{(g,o)=>g} .keyBy{case(g)=>g}

27 / 34

SLIDE 81

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala ?s type Museum . ?g type Painter . ?s shows ?g tp1=sc.textFile(‘‘type.txt’’) .filter{case(s,o)=>o.equals(‘‘Museum’’)} .map{case(s,o)=>s} .keyBy{case(s)=>s} tp2=sc.textFile(‘‘type.txt’’) .filter{case(g,o)=>o.equals(‘‘Painter’’)} .map{(g,o)=>g} .keyBy{case(g)=>g} tp3=sc.textFile(‘‘shows.txt’’) .keyBy{case(s,g)=>(s,g)}

27 / 34

SLIDE 82

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

SPARQL Translation Process

SPARQL→Scala ?s type Museum . ?g type Painter . ?s shows ?g tp1=sc.textFile(‘‘type.txt’’) .filter{case(s,o)=>o.equals(‘‘Museum’’)} .map{case(s,o)=>s} .keyBy{case(s)=>s} tp2=sc.textFile(‘‘type.txt’’) .filter{case(g,o)=>o.equals(‘‘Painter’’)} .map{(g,o)=>g} .keyBy{case(g)=>g} tp3=sc.textFile(‘‘shows.txt’’) .keyBy{case(s,g)=>(s,g)} bgp=tp1.cartesian(tp2).values .keyBy{case(s,g)=>(s,g)} .join(tp3).value

27 / 34

SLIDE 83

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Join Order

SPARQL→Scala

To minimize size of intermediate results, we try:

1 Avoiding cartesian product 2 Exploiting statistics on data

28 / 34

SLIDE 84

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Join Order

SPARQL→Scala

To minimize size of intermediate results, we try:

1 Avoiding cartesian product 2 Exploiting statistics on data

Selectivity Selectivity of an element located at pos is: either its occurrence number at pos if it is a constant or the total number of triples if it is a variable. Selectivity of a TP is the min of its element selectivities. We just sort the TPs of a BGP in ascending order of their selectivities.

28 / 34

SLIDE 85

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Join Order

SPARQL→Scala Initial BGP: ?s type Museum . ?g type Painter . ?s shows ?g

28 / 34

SLIDE 86

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Join Order

SPARQL→Scala Initial BGP: ?s type Museum . ?g type Painter . ?s shows ?g New BGP: ?s shows ?g ?s type Museum . ?g type Painter

28 / 34

SLIDE 87

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Join Order

SPARQL→Scala Initial BGP: ?s type Museum . ?g type Painter . ?s shows ?g New BGP: ?s shows ?g ?s type Museum . ?g type Painter Associated Scala code: tp1=sc.textFile(‘‘shows.txt’’) .keyBy{case(s,g)=>s} tp2=sc.textFile(‘‘type.txt’’) .filter{case(s,o)=>o.equals(‘‘Museum’’)} .map{case(s,o)=>s} .keyBy{case(s)=>s} tp3=sc.textFile(‘‘type.txt’’) .filter{case(s,o)=>o.equals(‘‘Painter’’)} .map{case(g,o)=>g} .keyBy{case(g)=>g} bgp=tp1.join(tp2).values .keyBy{case(s,g)=>(g)} .join(tp3).value

28 / 34

SLIDE 88

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Direct SPARQL Evaluation

29 / 34

SLIDE 89

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Direct SPARQL Evaluation

SDE (SPARQLGX as Direct Evaluator) Directly considering the initial rdf dataset Designed to evaluate on single query

29 / 34

SLIDE 90

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Direct SPARQL Evaluation

SDE (SPARQLGX as Direct Evaluator) Directly considering the initial rdf dataset Designed to evaluate on single query RDFHive Based on Apache Hive (relational solution on the HDFS) Translation of queries into Hive-QL Offers the possibility of merging relational and rdf datasets

29 / 34

SLIDE 91

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Direct SPARQL Evaluation

4store CliqueSquare CouchBaseRDF PigSPARQL RDFHive RYA S2RDF SDE SPARQLGX 1 10 20 30 40 50 60 70 80 90 100 102 103 104 105 106

Figure : Tradeoff between preprocessing and query evaluation times (seconds) linear WatDiv.

29 / 34

SLIDE 92

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Section 6 Conclusion & Perspectives

30 / 34

SLIDE 93

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Conclusion

Summary of Contributions

1 Update comparative Cudr´

e et al. survey Submitted

31 / 34

SLIDE 94

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Conclusion

Summary of Contributions

1 Update comparative Cudr´

e et al. survey Submitted

2 Provide a new reading grid (new set of metrics)

Submitted

31 / 34

SLIDE 95

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Conclusion

Summary of Contributions

1 Update comparative Cudr´

e et al. survey Submitted

2 Provide a new reading grid (new set of metrics)

Submitted

3 Develop several distributed SPARQL evaluators:

Reusability Openly available under the CeCILL license from: <https://github.com/tyrex-team>

31 / 34

SLIDE 96

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Conclusion

Summary of Contributions

1 Update comparative Cudr´

e et al. survey Submitted

2 Provide a new reading grid (new set of metrics)

Submitted

3 Develop several distributed SPARQL evaluators:

SPARQLGX ISWC 2016 SDE ISWC 2016 RDFHive

Reusability Openly available under the CeCILL license from: <https://github.com/tyrex-team>

31 / 34

SLIDE 97

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Conclusion

Velocity Lubm1k Velocity WatDiv1k Immediacy Parsimony Dynamicity Resiliency

4store CliqueSquare CouchBaseRDF CumulusRDF PigSPARQL RYA S2RDF

31 / 34

SLIDE 98

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

Conclusion

Velocity Lubm1k Velocity WatDiv1k Immediacy Parsimony Dynamicity Resiliency

4store CliqueSquare CouchBaseRDF CumulusRDF PigSPARQL RYA S2RDF RDFHive SDE SPARQLGX

31 / 34

SLIDE 99

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

I – Perspectives: SPARQL Benchmarking

Uniform test-suite for dynamicity Short-Term Designing a benchmark for the SPARQL UPDATE fragment Staying up to date Continuous Adding new evaluators Considering other test suites Benchmarking on other clusters Varying the number of nodes Mid-Term Validating our results on larger clusters New kind of limitation?

32 / 34

SLIDE 100

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

II – Perspectives: SPARQL Evaluators

Improving our evaluators On going Extending the supported SPARQL fragment Improving the storage models Designing criteria-specific evaluators Mid-Term Implementing a parsimonious and resilient evaluator Developing evaluators in highly dynamic context Storage-adaptative distributed evaluators Long-Term Adapting the idea of Alu¸ c et al. [A¨ OD14] in a distributed context Considering SPARQL query shapes = ⇒ Choosing its storage model dynamically!

33 / 34

SLIDE 101

RDF & SPARQL Distributed Frameworks SPARQL Evaluators Experiments Distributed Evaluation Conclusion

III – Perspectives: Integration in ETL systems

Designing SPARQL pipeline Mid-Term Using CONSTRUCT to refine existing RDF datasets Aggregating several sources into a single one Creating heterogeneous data pipeline Mid/Long-Term We provide a prototype for trip planning ISWC 2016 Development of a dedicated language

34 / 34

SLIDE 102

Thanks for your attention!

SLIDE 103

Appendices

36 / 34

SLIDE 104

Appendices

Appendices Hadoop Spark Cluster

37 / 34

SLIDE 105

Appendices

Concept

Map Reduce HDFS Donn´ ees Donn´ ees Donn´ ees Donn´ ees Map Map Map Map Paire <K,V> Paire <K,V> Paire <K,V> Reduce Reduce Reduce R´ esultats

38 / 34

SLIDE 106

Appendices

Architecture

Spark

Driver Program SparkContext Cluster Manager Worker Node Executor Cache Task Task HDFS Datanode Worker Node Executor Cache Task Task HDFS Datanode

1 Resource allocation via cluster manager through master 2 Executors acquisition on the cluster nodes 3 Code transfert from the application to the executors 4 Task transfert to the executors

39 / 34

SLIDE 107

Appendices

Technical Details

Cluster of 10 nodes with 17GB of RAM each Dataset Number of Triples Original File Size WatDiv1k 109 million 15 GB Lubm1k 134 million 23 GB Lubm10k 1.38 billion 232 GB

40 / 34

SLIDE 108

References

41 / 34

SLIDE 109

References

G¨ une¸ s Alu¸ c, Olaf Hartig, M Tamer ¨ Ozsu, and Khuzaima Daudjee. Diversified stress testing of RDF data management systems. In ISWC, pages 197–212. Springer, 2014. G¨ une¸ s Alu¸ c, M Tamer ¨ Ozsu, and Khuzaima Daudjee. Workload matters: Why rdf databases need a new design. Proceedings of the VLDB Endowment, 7(10):837–840, 2014. A Barstow. Survey of rdf/triple data stores. World Wide Web Consortium. Retrieved April, 10:2003, 2001. Dave Beckett. Scalability and storage: Survey of free software/open source rdf storage systems. Latest version is available at http://www. w3. org/2001/sw/Europe/reports/rdf scalable storage report, 2002. Dave Beckett and Jan Grant. Mapping semantic web data with RDBMSes. W3C Semantic Web Advanced Development for Europe (SWAD-Europe), 2003. Christian Bizer and Andreas Schultz. The berlin SPARQL benchmark. IJSWIS, 2009. Philippe Cudr´ e-Mauroux, Iliya Enchev, Sever Fundatureanu, Paul Groth, Albert Haque, Andreas Harth, Felix Leif Keppmann, Daniel Miranker, Juan F Sequeda, and Marcin Wylot. NoSQL databases for RDF: An empirical evaluation. ISWC, pages 310–325, 2013. Gianluca Demartini, Iliya Enchev, Marcin Wylot, Jo¨ el Gapany, and Philippe Cudr´ e-Mauroux. Bowlognabench – Benchmarking RDF Analytics. In International Symposium on Data-Driven Process Discovery and Analysis, pages 82–102. Springer, 2011. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. David C Faye, Olivier Cur´ e, and Guillaume Blin. A survey of RDF storage approaches. Arima Journal, 15:11–35, 2012.

SLIDE 110

References

W3C SPARQL Working Group et al. SPARQL 1.1 overview, 2013. http://www.w3.org/TR/sparql11-overview/. Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. LUBM: A benchmark for OWL knowledge base systems. Web Semantics, 2005. Patrick Hayes and Brian McBride. RDF semantics. W3C recommendation, 10, 2004. www.w3.org/TR/rdf-concepts/. Zoi Kaoudi and Ioana Manolescu. RDF in the clouds: a survey. The VLDB Journal, 24(1):67–91, 2015. Ryan Lee. Scalability report on triple store applications. Massachusetts institute of technology, 2004. Aimilia Magkanaraki, Grigoris Karvounarakis, Ta Tuan Anh, Vassilis Christophides, and Dimitris Plexousakis. Ontology storage and querying. Ics-forth Technical Report, 308, 2002. Mohamed Morsey, Jens Lehmann, S¨

ren Auer, and Axel-Cyrille Ngonga Ngomo.

DBpedia SPARQL Benchmark – Performance assessment with real queries on real data. ISWC, pages 454–469, 2011. Shi Qiao and Z Meral ¨ Ozsoyo˘ glu. Rbench: Application-specific RDF benchmarking. In SIGMOD, pages 1825–1838. ACM, 2015. Florian Stegmaier, Udo Gr¨

bner, Mario D¨
ller, Harald Kosch, and Gero Baese.

Evaluation of current rdf database solutions. In Proceedings of the 10th International Workshop on Semantic Multimedia Database Technologies (SeMuDaTe), 4th International Conference on Semantics And Digital Media Technologies (SAMT), pages 39–55. Citeseer, 2009.

SLIDE 111

References

Michael Schmidt, Thomas Hornung, Georg Lausen, and Christoph Pinkel. SP2Bench: a SPARQL performance benchmark. ICDE, pages 222–233, 2009. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012.