On Data Placement Strategies in Distributed RDF Stores Int. - PowerPoint PPT Presentation

On Data Placement Strategies in Distributed RDF Stores Int. Workshop on Semantic Big Data (SBD 2017) Daniel Janke , Steffen Staab, Matthias Thimm 19.05.2017 Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Distributed RDF Stores  Requirement for trillion triples stores arose in the last years  Scalable RDF stores in the cloud Compute Node 1 Compute Node 2 foaf:knows foaf:givenname “Martin“ gesis:Dog west:martin gesis:wanja ex:employs foaf:givenname ex:ownedBy foaf:knows west:WeST rdf:type foaf:knows ex:employs ex:employs west:daniel gesis:bello “Wanja“ “Daniel“ gesis:Gesis foaf:givenname Challenges:  Data placement strategies  Distributed query processing  Handling failures of compute nodes Daniel Janke On Data Placement Strategies in Distributed RDF Stores 2

Distributed RDF Stores  Requirement for trillion triples stores arose in the last years  Scalable RDF stores in the cloud Compute Node 1 Compute Node 2 foaf:knows foaf:givenname “Martin“ gesis:Dog west:martin gesis:wanja ex:employs foaf:givenname ex:ownedBy foaf:knows west:WeST rdf:type foaf:knows ex:employs ex:employs west:daniel gesis:bello “Wanja“ “Daniel“ gesis:Gesis foaf:givenname Challenges:  Data placement strategies Focus of our research  Distributed query processing  Handling failures of compute nodes Daniel Janke On Data Placement Strategies in Distributed RDF Stores 3

Data Placement Strategies and Scalability SELECT ?org ?name WHERE {?org ex:employs ?pers . ?pers foaf:givenname ?name} Compute Node 1 Compute Node 2 foaf:knows foaf:givenname “Martin“ gesis:Dog gesis:wanja west:martin ex:employs rdf:type ex:ownedBy foaf:givenname foaf:knows west:WeST ex:employs ex:employs foaf:knows west:daniel “Daniel“ gesis:bello gesis:Gesis “Wanja“ foaf:givenname Horizontal containment  Computation of individual query results on local data  Indicator for robust query processing when scaling horizontally Vertical parallelization  Parallel computation of different query results on different compute nodes  Indicator for query processing scaling with growing result set sizes when scaling horizontally Daniel Janke On Data Placement Strategies in Distributed RDF Stores 4

Data Placement Strategies and Scalability SELECT ?org ?name WHERE {?org ex:employs ?pers . ?pers foaf:givenname ?name} Compute Node 1 Compute Node 2 foaf:knows foaf:givenname “Martin“ gesis:Dog gesis:wanja west:martin ex:employs rdf:type ex:ownedBy foaf:givenname foaf:knows west:WeST ex:employs ex:employs Commonly held belief: foaf:knows west:daniel “Daniel“ gesis:bello gesis:Gesis “Wanja“ Horizontal containment dominates query processing effort foaf:givenname (cf. [Huang2011SSQ, Lee2013EDP, Zhang2013ETS, …]) Horizontal containment  Computation of individual query results on local data  Indicator for robust query processing when scaling horizontally Vertical parallelization  Parallel computation of different query results on different compute nodes  Indicator for query processing scaling with growing result set sizes when scaling horizontally Daniel Janke On Data Placement Strategies in Distributed RDF Stores 5

Outline 1) Data Placement Strategies 2) Benchmark methodology showing the interdependencies of data placement strategies and query processing 3) Analysis indicating that vertical parallelization may dominate horizontal containment 4) Conclusion Daniel Janke On Data Placement Strategies in Distributed RDF Stores 6

Graph Cover Compute Node 1 Compute Node 2 foaf:knows foaf:givenname “Martin“ gesis:Dog gesis:wanja west:martin ex:employs rdf:type ex:ownedBy foaf:givenname foaf:knows west:WeST ex:employs ex:employs foaf:knows west:daniel “Daniel“ gesis:bello gesis:Gesis “Wanja“ foaf:givenname Graph cover Assignment of each triple to at least one compute node Graph chunk Set of triples assigned to a single compute node Daniel Janke On Data Placement Strategies in Distributed RDF Stores 7

Common Graph Cover Strategies ab Hash cover [e.g. Harth2007YAF] aa bb  ac Triple placement bases on subject hash modulo ba bc number of compute nodes Hierarchical cover [Lee2013SQO] ac bc bb ab  aa Triple placement bases on hash of subject IRI ba prefixes ab ac bb Minimal edge-cut cover [Karypis1998AFA] bc aa ba ● Assign vertices (subjects and objects) to partitions such that – Number of edges between vertices of different partitions is minimized and – Each partition contains approximately vertices Daniel Janke On Data Placement Strategies in Distributed RDF Stores 8

Common Evaluation Strategies 1) Evaluations of graph cover strategies using different databases => other components might bias evaluation results e.g. [Wu2014SAS, Zeng2013ADG] Car 1 using fuel A Car 2 using fuel B Does fuel A or B allow for a higher speed? Images from https://openclipart.org Daniel Janke On Data Placement Strategies in Distributed RDF Stores 9

Common Evaluation Strategies 1) Evaluations of graph cover strategies using different databases => other components might bias evaluation results e.g. [Wu2014SAS, Zeng2013ADG] Car 1 using fuel A Car 2 using fuel B Does fuel A or B allow for a higher speed? 2) Usage of slow communication means like Hadoop File System => Increased importance of horizontal containment e.g. [Huang2011SSQ, Lee2013EDP] Images from https://openclipart.org Daniel Janke On Data Placement Strategies in Distributed RDF Stores 10

Benchmark Methodology Goal : Investigating effect of graph cover on the scalability Query execution Dataset Queries strategy Distributed RDF store Evaluation Benchmark Graph cover for arbitrary graph covers measures Results strategies Benchmark Daniel Janke On Data Placement Strategies in Distributed RDF Stores 11

Query execution Dataset Queries Strategy for Generating Queries strategy Distributed RDF store Evaluation for arbitrary graph covers measures Benchmark Query Generator: SPLODGE [Görlitz2012SSG]  Generates SPARQL queries for arbitrary datasets  Generates queries based on – Number of joins – Join pattern – Selectivity – Number of data sources Daniel Janke On Data Placement Strategies in Distributed RDF Stores 12

Query execution Dataset Queries Query Execution Strategy strategy Distributed RDF store Evaluation for arbitrary graph covers measures Benchmark  Query optimizers fitting for arbitrary graph covers difficult  Execution of several query execution trees: Left-linear Right-linear Bushy 4 1 2 3 1 2 3 4 3 4 1 2 Daniel Janke On Data Placement Strategies in Distributed RDF Stores 13

Query execution Dataset Queries Koral strategy Distributed RDF store Evaluation for arbitrary graph covers measures Benchmark  Graph cover independent distributed RDF store  Inspired by TriAD [GurajadaTheobald2014TAD] Master Dictionary Graph Cover Query Execution Network Encoder Creator Coordinator Manager Dictionary Statistics Slave 1 Slave n Query Network Query Network Executor Manager Executor Manager . . . Local Local Triple Indices Triple Indices Daniel Janke On Data Placement Strategies in Distributed RDF Stores 14

Query execution Dataset Queries Evaluation Measures strategy Distributed RDF store Evaluation for arbitrary graph covers measures Benchmark Overall performance  Query execution time Horizontal Containment  Data transfer : variable bindings transferred between compute nodes Vertical Parallelization (VP)  Workload Entropy : entropy of join comparisons on each compute node low high low low VP low VP high high VP low-medium VP Daniel Janke On Data Placement Strategies in Distributed RDF Stores 15

Experimental Setup Compared graph cover strategies :  Hash, hierarchical hash and minimal edge-cut cover Dataset :  1 trillion triple subset of BTC2014 [Käfer2014BTC] Queries :  Number of joins : 2 and 8 triple patterns  Join pattern : path-shaped and star-shaped  Selectivity : 0.001% and 0.01% (1M and 10M triples)  Number of data sources : 1 and 3 Computer environment :  1 Master à 4 cores, 8 GB RAM, 1 TB HDD  20 Slaves à 1 core, 2 GB RAM, 300 GB HDD  1 Gbit ethernet Daniel Janke On Data Placement Strategies in Distributed RDF Stores 16

Graph Cover Creation Time 35 30 Cover Creation Time (in h) 25 20 15 10 5 0 HASH HIERARCHICAL MIN EDGE CUT  Minimal edge-cut cover requires most time for creation  Hash cover is created the fastest Daniel Janke On Data Placement Strategies in Distributed RDF Stores 17

Overall Query Performance HIERARCHICAL MIN EDGE CUT 10 4 10 3 Execution Time (log scale, change to HASH in %) 10 2 10 1 0 − 10 1 − 10 2 1 2 3 4 5 6 7 8 9 0 1 2 Q Q Q Q Q Q Q Q Q 1 1 1 Q Q Q Queries  Bushy query execution outperforms other execution strategies  Minimal edge-cut causes slowest query execution in most cases  None of the hash-based covers is faster in general Daniel Janke On Data Placement Strategies in Distributed RDF Stores 18

On Data Placement Strategies in Distributed RDF Stores Int. - PowerPoint PPT Presentation

On Data Placement Strategies in Distributed RDF Stores Int. Workshop on Semantic Big Data (SBD 2017) Daniel Janke , Steffen Staab, Matthias Thimm 19.05.2017 Institute for Web Science and Technologies University of Koblenz-Landau, Germany

The Resource Description Framework (RDF 1.1) M2 CPS RDF RDF is to the Semantic Web what HTML

The RDF* and SPARQL* Approach to Annotate Statements in RDF and to Reconcile RDF and Property

SPARQL Query Language for RDF Motivation RDF, RDF Schema, OWL provide data and meta- data

RDF Topics Finish up XML. What is RDF? Why is it interesting? SPARQL: The

Economic and Environmental Rationales The RDF Industry Group welcomes you RDF Export: Analysis of

RDF* and SPARQL* An Alternatjve Approach to Statement-Level Metadata in RDF Olaf Hartjg

Thoughts on Validating RDF Healthcare Data David Booth, Ph.D. KnowMED, Inc. 2013 W3C RDF

Module 15 RDF, SPARQL and Semantic Repositories Module 15 Outline 9.45-11.00 RDF/S and OWL

RDF Beyond RDF Beyond Outline Outline RDFa RDFa Microformat Schema.org S h RDFa

CHS Field Solar Arrays RDF Advisory Group Presentation July 11, 2017 EP4-34 RDF Grant Contract

RDF and SRF Market Trends May 2019 Harriet Parke, RDF Industry Group Secretariat Agenda

A Transition from RDF to Petri Nets Jan Paredaens Universiteit Antwerpen 11.11.11 Jan Paredaens

RDF Grant Project Briefing for Xcel Energy RDF Advisory Group April 12, 2016 1 Agenda 1.

RDF Syntax RDF (Resource Description Framework) S ubj ect, Predicate and Obj ect Triplets

Introduction to RDF Sandro Hawke, W3C @sandhawke Semantic Web Tutorial ISWC 2010 Overview

RDF pro an Extensible Tool for Building Stream- an Extensible Tool for Building Stream-

Privacy Analysis of a Hidden Friendship Protocol Florian Kammller and Sren Preibusch

RDF in a .NET World Kal Ahmed (@kal_ahmed) About Me Developer, Consultant Co-founder of

An Introduction to Linked Data Dr Tom Heath Platform Division Talis Information Ltd

Find matching triples Es. SELECT * WHERE { ?s rdf:type dbo:Film. } LIMIT 10 ?s is a

Introduction to the Semantic Web and FOAF Gajo Petrovi c University of Novi Sad, Faculty of

Federated Semantic Data Management 25-30 June 2017 - dagstuhl - Germany Hala Skaf-Molli Pascal

!"#$%&'()!"(+(,#-!"%./&0%) 10)/+%&2(&%/34#,5("6-*

SPARQL - Querying the Web of Data Seminar WS 2008/2009 RDF and the Web of Data Olaf Hartig

On Data Placement Strategies in Distributed RDF Stores Int. - PowerPoint PPT Presentation

On Data Placement Strategies in Distributed RDF Stores Int. Workshop on Semantic Big Data (SBD 2017) Daniel Janke , Steffen Staab, Matthias Thimm 19.05.2017 Institute for Web Science and Technologies University of Koblenz-Landau, Germany

The Resource Description Framework (RDF 1.1) M2 CPS RDF RDF is to the Semantic Web what HTML

The RDF* and SPARQL* Approach to Annotate Statements in RDF and to Reconcile RDF and Property

SPARQL Query Language for RDF Motivation RDF, RDF Schema, OWL provide data and meta- data

RDF Topics Finish up XML. What is RDF? Why is it interesting? SPARQL: The

Economic and Environmental Rationales The RDF Industry Group welcomes you RDF Export: Analysis of

RDF* and SPARQL* An Alternatjve Approach to Statement-Level Metadata in RDF Olaf Hartjg

Thoughts on Validating RDF Healthcare Data David Booth, Ph.D. KnowMED, Inc. 2013 W3C RDF

Module 15 RDF, SPARQL and Semantic Repositories Module 15 Outline 9.45-11.00 RDF/S and OWL

RDF Beyond RDF Beyond Outline Outline RDFa RDFa Microformat Schema.org S h RDFa

CHS Field Solar Arrays RDF Advisory Group Presentation July 11, 2017 EP4-34 RDF Grant Contract

RDF and SRF Market Trends May 2019 Harriet Parke, RDF Industry Group Secretariat Agenda

A Transition from RDF to Petri Nets Jan Paredaens Universiteit Antwerpen 11.11.11 Jan Paredaens

RDF Grant Project Briefing for Xcel Energy RDF Advisory Group April 12, 2016 1 Agenda 1.

RDF Syntax RDF (Resource Description Framework) S ubj ect, Predicate and Obj ect Triplets

Introduction to RDF Sandro Hawke, W3C @sandhawke Semantic Web Tutorial ISWC 2010 Overview

RDF pro an Extensible Tool for Building Stream- an Extensible Tool for Building Stream-

Privacy Analysis of a Hidden Friendship Protocol Florian Kammller and Sren Preibusch

RDF in a .NET World Kal Ahmed (@kal_ahmed) About Me Developer, Consultant Co-founder of

An Introduction to Linked Data Dr Tom Heath Platform Division Talis Information Ltd

Find matching triples Es. SELECT * WHERE { ?s rdf:type dbo:Film. } LIMIT 10 ?s is a

Introduction to the Semantic Web and FOAF Gajo Petrovi c University of Novi Sad, Faculty of

Federated Semantic Data Management 25-30 June 2017 - dagstuhl - Germany Hala Skaf-Molli Pascal

!&quot;#$%&amp;'()*!&quot;(+(,#-*!&quot;%./&amp;0*%)* 10)/+%&amp;*2(&amp;%/3*4#,5(&quot;6-*

SPARQL - Querying the Web of Data Seminar WS 2008/2009 RDF and the Web of Data Olaf Hartig

!"#$%&'()!"(+(,#-!"%./&0%) 10)/+%&2(&%/34#,5("6-*