On Data Placement Strategies in Distributed RDF Stores Int. - - PowerPoint PPT Presentation

on data placement strategies in distributed rdf stores
SMART_READER_LITE
LIVE PREVIEW

On Data Placement Strategies in Distributed RDF Stores Int. - - PowerPoint PPT Presentation

On Data Placement Strategies in Distributed RDF Stores Int. Workshop on Semantic Big Data (SBD 2017) Daniel Janke , Steffen Staab, Matthias Thimm 19.05.2017 Institute for Web Science and Technologies University of Koblenz-Landau, Germany


slide-1
SLIDE 1

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

On Data Placement Strategies in Distributed RDF Stores

  • Int. Workshop on Semantic Big Data (SBD 2017)

Daniel Janke, Steffen Staab, Matthias Thimm 19.05.2017

slide-2
SLIDE 2

On Data Placement Strategies in Distributed RDF Stores 2 Daniel Janke

Distributed RDF Stores

Requirement for trillion triples stores arose in the last years

Scalable RDF stores in the cloud

Compute Node 1 Compute Node 2

Challenges:

Data placement strategies

Distributed query processing

Handling failures of compute nodes

west:martin “Martin“ gesis:wanja “Wanja“ west:daniel “Daniel“ west:WeST gesis:Gesis foaf:givenname foaf:givenname foaf:givenname ex:employs ex:employs ex:employs foaf:knows foaf:knows foaf:knows gesis:bello rdf:type ex:ownedBy gesis:Dog

slide-3
SLIDE 3

On Data Placement Strategies in Distributed RDF Stores 3 Daniel Janke

Distributed RDF Stores

Requirement for trillion triples stores arose in the last years

Scalable RDF stores in the cloud

Compute Node 1 Compute Node 2

Challenges:

Data placement strategies

Distributed query processing

Handling failures of compute nodes

west:martin “Martin“ gesis:wanja “Wanja“ west:daniel “Daniel“ west:WeST gesis:Gesis foaf:givenname foaf:givenname foaf:givenname ex:employs ex:employs ex:employs foaf:knows foaf:knows foaf:knows gesis:bello rdf:type ex:ownedBy gesis:Dog

Focus of our research

slide-4
SLIDE 4

On Data Placement Strategies in Distributed RDF Stores 4 Daniel Janke

Compute Node 1 Compute Node 2

west:martin “Martin“ gesis:wanja “Wanja“ west:daniel “Daniel“ west:WeST gesis:Gesis foaf:givenname foaf:givenname foaf:givenname ex:employs ex:employs ex:employs foaf:knows foaf:knows foaf:knows gesis:bello rdf:type ex:ownedBy gesis:Dog

Data Placement Strategies and Scalability

Horizontal containment

Computation of individual query results on local data

Indicator for robust query processing when scaling horizontally Vertical parallelization

Parallel computation of different query results on different compute nodes

Indicator for query processing scaling with growing result set sizes when scaling horizontally

SELECT ?org ?name WHERE {?org ex:employs ?pers . ?pers foaf:givenname ?name}

slide-5
SLIDE 5

On Data Placement Strategies in Distributed RDF Stores 5 Daniel Janke

Compute Node 1 Compute Node 2

west:martin “Martin“ gesis:wanja “Wanja“ west:daniel “Daniel“ west:WeST gesis:Gesis foaf:givenname foaf:givenname foaf:givenname ex:employs ex:employs ex:employs foaf:knows foaf:knows foaf:knows gesis:bello rdf:type ex:ownedBy gesis:Dog

Data Placement Strategies and Scalability

Horizontal containment

Computation of individual query results on local data

Indicator for robust query processing when scaling horizontally Vertical parallelization

Parallel computation of different query results on different compute nodes

Indicator for query processing scaling with growing result set sizes when scaling horizontally

SELECT ?org ?name WHERE {?org ex:employs ?pers . ?pers foaf:givenname ?name}

Commonly held belief: Horizontal containment dominates query processing effort (cf. [Huang2011SSQ, Lee2013EDP, Zhang2013ETS, …])

slide-6
SLIDE 6

On Data Placement Strategies in Distributed RDF Stores 6 Daniel Janke

Outline

1) Data Placement Strategies 2) Benchmark methodology showing the interdependencies of data placement strategies and query processing 3) Analysis indicating that vertical parallelization may dominate horizontal containment 4) Conclusion

slide-7
SLIDE 7

On Data Placement Strategies in Distributed RDF Stores 7 Daniel Janke

Graph Cover

Graph cover

Assignment of each triple to at least one compute node

Graph chunk Set of triples assigned to a single compute node

Compute Node 1 Compute Node 2

west:martin “Martin“ gesis:wanja “Wanja“ west:daniel “Daniel“ west:WeST gesis:Gesis foaf:givenname foaf:givenname foaf:givenname ex:employs ex:employs foaf:knows foaf:knows foaf:knows gesis:bello rdf:type ex:employs ex:ownedBy gesis:Dog

slide-8
SLIDE 8

On Data Placement Strategies in Distributed RDF Stores 8 Daniel Janke

Common Graph Cover Strategies

Hash cover [e.g. Harth2007YAF]

Triple placement bases on subject hash modulo number of compute nodes Hierarchical cover [Lee2013SQO]

Triple placement bases on hash of subject IRI prefixes Minimal edge-cut cover [Karypis1998AFA]

  • Assign vertices (subjects and objects) to partitions such that

– Number of edges between vertices of different partitions is minimized and – Each partition contains approximately vertices

aa ab bb ba ac bc aa ab bb ba ac bc aa ab bb ba ac bc

slide-9
SLIDE 9

On Data Placement Strategies in Distributed RDF Stores 9 Daniel Janke

Common Evaluation Strategies

1) Evaluations of graph cover strategies using different databases => other components might bias evaluation results e.g. [Wu2014SAS, Zeng2013ADG]

Car 1 using fuel A Car 2 using fuel B

Does fuel A or B allow for a higher speed?

Images from https://openclipart.org

slide-10
SLIDE 10

On Data Placement Strategies in Distributed RDF Stores 10 Daniel Janke

Common Evaluation Strategies

1) Evaluations of graph cover strategies using different databases => other components might bias evaluation results e.g. [Wu2014SAS, Zeng2013ADG] 2) Usage of slow communication means like Hadoop File System => Increased importance of horizontal containment e.g. [Huang2011SSQ, Lee2013EDP]

Car 1 using fuel A Car 2 using fuel B

Does fuel A or B allow for a higher speed?

Images from https://openclipart.org

slide-11
SLIDE 11

On Data Placement Strategies in Distributed RDF Stores 11 Daniel Janke

Benchmark Methodology

Goal: Investigating effect of graph cover on the scalability

Graph cover strategies Evaluation measures Dataset Queries Query execution strategy Distributed RDF store for arbitrary graph covers Benchmark Benchmark Results

slide-12
SLIDE 12

On Data Placement Strategies in Distributed RDF Stores 12 Daniel Janke

Strategy for Generating Queries

Query Generator: SPLODGE [Görlitz2012SSG]

Generates SPARQL queries for arbitrary datasets

Generates queries based on – Number of joins – Join pattern – Selectivity – Number of data sources

Evaluation measures Dataset Queries Query execution strategy Distributed RDF store for arbitrary graph covers Benchmark

slide-13
SLIDE 13

On Data Placement Strategies in Distributed RDF Stores 13 Daniel Janke

Query Execution Strategy

Query optimizers fitting for arbitrary graph covers difficult

Execution of several query execution trees:

Evaluation measures Dataset Queries Query execution strategy Distributed RDF store for arbitrary graph covers Benchmark

1 2 3 4 3 1 2 4 2 1 3 4

Bushy Left-linear Right-linear

slide-14
SLIDE 14

On Data Placement Strategies in Distributed RDF Stores 14 Daniel Janke

Koral

Graph cover independent distributed RDF store

Inspired by TriAD [GurajadaTheobald2014TAD]

Slave1

Local Triple Indices Query Executor Network Manager

. . .

Graph Cover Creator Query Execution Coordinator

Master Dictionary Statistics

Network Manager Dictionary Encoder

Slaven

Local Triple Indices Query Executor Network Manager

Evaluation measures Dataset Queries Query execution strategy Distributed RDF store for arbitrary graph covers Benchmark

slide-15
SLIDE 15

On Data Placement Strategies in Distributed RDF Stores 15 Daniel Janke

Evaluation Measures

Overall performance

Query execution time Horizontal Containment

Data transfer : variable bindings transferred between compute nodes Vertical Parallelization (VP)

Workload Entropy : entropy of join comparisons on each compute node

Evaluation measures Dataset Queries Query execution strategy Distributed RDF store for arbitrary graph covers Benchmark

low high low low VP low VP high high VP low-medium VP

slide-16
SLIDE 16

On Data Placement Strategies in Distributed RDF Stores 16 Daniel Janke

Experimental Setup

Compared graph cover strategies:

Hash, hierarchical hash and minimal edge-cut cover Dataset:

1 trillion triple subset of BTC2014 [Käfer2014BTC] Queries:

Number of joins: 2 and 8 triple patterns

Join pattern: path-shaped and star-shaped

Selectivity: 0.001% and 0.01% (1M and 10M triples)

Number of data sources: 1 and 3 Computer environment:

1 Master à 4 cores, 8 GB RAM, 1 TB HDD

20 Slaves à 1 core, 2 GB RAM, 300 GB HDD

1 Gbit ethernet

slide-17
SLIDE 17

On Data Placement Strategies in Distributed RDF Stores 17 Daniel Janke

Graph Cover Creation Time

Minimal edge-cut cover requires most time for creation

Hash cover is created the fastest

HASH HIERARCHICAL MIN EDGE CUT 5 10 15 20 25 30 35 Cover Creation Time (in h)

slide-18
SLIDE 18

On Data Placement Strategies in Distributed RDF Stores 18 Daniel Janke

Overall Query Performance

Bushy query execution outperforms other execution strategies

Minimal edge-cut causes slowest query execution in most cases

None of the hash-based covers is faster in general

Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 9 Q 1 Q 1 1 Q 1 2 Queries −102 −101 101 102 103 104 Execution Time (log scale, change to HASH in %) HIERARCHICAL MIN EDGE CUT

slide-19
SLIDE 19

On Data Placement Strategies in Distributed RDF Stores 19 Daniel Janke

Horizontal Containment

 Star-shaped queries produce no data transfer  Minimal edge-cut covers produces less data transfer  Hash-based covers similar data transfer

Q1 Q2 Q3 Q4 Q5 Q6 Queries −40 −30 −20 −10 10 Data Transfer (change to HASH in %) HIERARCHICAL MIN EDGE CUT

slide-20
SLIDE 20

On Data Placement Strategies in Distributed RDF Stores 20 Daniel Janke

Vertical Parallelization

Minimal edge-cut cover has the least balanced workload

Hash-based covers have similar balanced workloads

Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 9 Q 1 Q 1 1 Q 1 2 Queries 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Workload Entropy HASH HIERARCHICAL MIN EDGE CUT

slide-21
SLIDE 21

On Data Placement Strategies in Distributed RDF Stores 21 Daniel Janke

Conclusion

Minimal edge-cut cover – Longest cover creation time – Lowest data transfer => high horizontal containment – Lowest workload balance => low vertical parallelization – Overall performance worse than hash-based covers

Hash-based covers have similar performance

Vertical parallelization might be more important than horizontal containment Future work: Benchmarking of workload-aware graph cover strategies

slide-22
SLIDE 22

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany

Thank you for your Attention!

On Data Placement Strategies in Distributed RDF Stores

Daniel Janke, Steffen Staab, Matthias Thimm

Contributions: 1) Benchmark methodology showing the interdependencies of graph cover strategies and query processing 2) A flexible open-source platform for performing the benchmark 3) Analysis indicating that vertical parallelization may dominate horizontal containment

slide-23
SLIDE 23

On Data Placement Strategies in Distributed RDF Stores 23 Daniel Janke

References

[Görlitz2012SSG] Görlitz, O., Thimm, M., & Staab, S. (2012). Splodge: Systematic generation of sparql benchmark queries for linked open data. The Semantic Web–ISWC 2012, 116–132. [GurajadaTheobald2014TAD] Gurajada, S., Seufert, S., Miliaraki, I., & Theobald, M. (2014). TriAD: A Distributed Shared- nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (pp. 289–300). New York, NY, USA: ACM. [Harth2007YAF] Harth, A., Umbrich, J., Hogan, A., & Decker, S. (2007). YARS2: A Federated Repository for Querying Graph Structured Data from the Web. In K. Aberer, K.-S. Choi, N. Noy, D. Allemang, K.-I. Lee, L. Nixon, … P. Cudré-Mauroux (Eds.), The Semantic Web (Vol. 4825,

  • pp. 211–224). Springer Berlin Heidelberg.

[Huang2011SSQ] Huang, J., Abadi, D. J., & Ren, K. (2011). Scalable SPARQL Querying of Large RDF Graphs. PVLDB, 4(11), 1123–1134.

slide-24
SLIDE 24

On Data Placement Strategies in Distributed RDF Stores 24 Daniel Janke

References

[Käfer2014BTC] Käfer, T., & Harth, A. (2014). Billion Triples Challenge data set. [Karypis1998AFA] Karypis, G., & Kumar, V. (1998). A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput., 20(1), 359–392. [Lee2013EDP] Lee, K., & Liu, L. (2013). Efficient Data Partitioning Model for Heterogeneous Graphs in the

  • Cloud. In Proceedings of the International Conference on High Performance Computing,

Networking, Storage and Analysis (p. 46:1--46:12). New York, NY, USA: ACM. [Lee2013SQO] Lee, K., & Liu, L. (2013). Scaling Queries over Big RDF Graphs with Semantic Hash

  • Partitioning. Proc. VLDB Endow., 6(14), 1894–1905.

[Wu2014SAS] Wu, B., Zhou, Y., Yuan, P., Jin, H., & Liu, L. (2014). SemStore: A Semantic-Preserving Distributed RDF Triple Store. In 23rd ACM International Conference on Information and Knowledge Management (CIKM). Shanghai.

slide-25
SLIDE 25

On Data Placement Strategies in Distributed RDF Stores 25 Daniel Janke

References

[Zeng2013ADG] Zeng, K., Yang, J., Wang, H., Shao, B., & Wang, Z. (2013). A Distributed Graph Engine for Web Scale RDF Data. Proc. VLDB Endow., 6(4), 265–276. [Zhang2013ETS] Zhang, X., Chen, L., Tong, Y., & Wang, M. (2013). EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on (pp. 565–576).

slide-26
SLIDE 26

On Data Placement Strategies in Distributed RDF Stores 26 Daniel Janke

Hash Cover [e.g. Harth2007YAF]

Triple placement bases on subject hash modulo number of compute nodes

slide-27
SLIDE 27

On Data Placement Strategies in Distributed RDF Stores 27 Daniel Janke

Hierarchical Hash Cover [Lee2013SQO]

Triple placement bases on prefixes of subject IRIs

IRI: http://www.w3.org/1999/02/22-rdf-syntax-ns#type

Path hierarchy: org/w3/www/1999/02/22-rdf-syntax-ns/type

Determine path hierarchy prefix such that – There exist at least hierarchy prefixes – That are shared by at least of all triples

slide-28
SLIDE 28

On Data Placement Strategies in Distributed RDF Stores 28 Daniel Janke

Minimal Edge-Cut Cover

Tries to solve the k-way graph partitioning problem

[Karypis1998AFA]

1) Assign vertices (subjects and objects) to partitions such that – Number of edges between vertices of different partitions is minimized and – Each partition contains approximately vertices 2) Assign triple to the partition its subject is assigned to

slide-29
SLIDE 29

On Data Placement Strategies in Distributed RDF Stores 29 Daniel Janke

Chunk Sizes

Minimal edge-cut cover has most unbalanced chunks

Hash-based covers have equally-sized chunks

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Chunks 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Number of Triples ×108 HASH HIERARCHICAL MIN EDGE CUT