SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - - PowerPoint PPT Presentation

sparqling pig sparqling pig
SMART_READER_LITE
LIVE PREVIEW

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - - PowerPoint PPT Presentation

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015 Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD


slide-1
SLIDE 1

SPARQLing Pig SPARQLing Pig

Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler

BTW 2015

slide-2
SLIDE 2

connect information / datasets triple format query language: SPARQL federated query processing

(subject, predicate, object)

Motivation Motivation

<event1> <type> <Concert> . <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2"^^<xsd:double> . <event1> <geo:lat> "5.353E1"^^<xsd:double> . <event1> <artist> <Metallica> . <Metallica> <type> <Band> . <Metallica> <name> "Metallica" . <Metallica> <founded> "1981"^^<xsd:integer> .

Linked Data / RDF Linked Data / RDF Pig Pig

data flow language, tuple oriented compiled to MapReduce process large datasets access data in HDFS

a = LOAD "hdfs:///data.csv" AS (id, x: int, y: int); b = FILTER a BY x > 10 AND y < 50; c = FOREACH b GENERATE id; STORE c INTO "hdfs:///data2.csv"; SELECT ?x ?y WHERE { ?event <long> ?x. ?event <lat> ?y. ?event <artist> ?artist . ?artist <name> "Metallica" . }

1

slide-3
SLIDE 3

Motivation Motivation

Big Data (tuples) Linked Data (triples) combined execution model

2

Alexander Schätzle et. al., PigSPARQL: Übersetzung von SPARQL nach PigLatin, BTW 2011

existing solutions: SPARQL to Pig , e.g.:

slide-4
SLIDE 4

Motivation Motivation

Problems Problems

no BGPs in Pig self joins to reconstruct entities load dataset twice (or more) COGROUP: combination of MapReduce jobs

Contribution Contribution

Pig Latin language extension data model add SPARQL-like features not only one dataset + access remote data efficient processing of Linked Data in Pig results as foundation for cost-based Pig compiler/rewriter

3

slide-5
SLIDE 5
  • 1. Data Model
  • 2. Pig Extensions

conversion load/access BGP support

  • 3. Extended Pig Rewriting
  • 4. Evaluation

Outline Outline

slide-6
SLIDE 6

Data Model Data Model

RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig ​ fixed schema not flexible enough: (s, p , ..., p )

1 n

Our approach: Our approach:

for each subject: bag of predicate-object pairs

{ subject: bytearray, stmts: { (predicate: bytearray, object: bytearray) } }

<event1> <artist> <Metallica> . <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2" . <event1> <geo:lat> "5.353E1" . <event1> <type> <Concert> . <Metallica> <type> <Band> . <Metallica> <name> "Metallica" . <Metallica> <founded> "1981" . <event1>, { (<artist>,<Metallica>), (<start>,"2012-08-17T18:00"), (<geo:long>,"-1.135E2"), (<geo:lat>,"5.353E1"), (<type>,<Concert>) }, <Metallica>,{ (<type>,<Band>), (<name>,"Metallica"), (<founded>,"1981") }

4

  • H. Kim, et al,

From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra, PVLDB 2011

slide-7
SLIDE 7

Data Model Data Model

RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig ​ fixed schema not flexible enough: (s, p , ..., p )

1 n

Our approach: Our approach:

{ predicate: bytearray, stmts: { (subject: bytearray, object: bytearray) } }

<event1> <artist> <Metallica> . <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2" . <event1> <geo:lat> "5.353E1" . <event1> <type> <Concert> . <Metallica> <type> <Band> . <Metallica> <name> "Metallica" . <Metallica> <founded> "1981" . <type>, { (<Metallica>, <Band>), (<event1>,<Concert>) }, <artist>, { (<event1>, <Metallica>) }, <start>, { (<event1>, "2012-08-17T18:00") }, <geo:lat>,{ (<event1>, "-1.135E2") }, <geo:lat>,{ (<event1>, "5.353E1") }, <name>, { (<Metallica>,"Metallica") }, <founded>,{ (<Metallica>,"1981") }

for each predicate: bag of subject-object pairs

4

  • H. Kim, et al,

From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra, PVLDB 2011

slide-8
SLIDE 8

Pig Extensions Pig Extensions

TUPLIFY TUPLIFY

triple_groups = FOREACH (GROUP triples BY subject) GENERATE group AS subject, triples.(predicate, object) AS stmts; triple_groups = TUPLIFY triples BY subject

avoid self-joins convert plain triples to triple-bag format using GROUP BY

  • n any component

explicitly or implicitly (rewriting rules)

5

slide-9
SLIDE 9

Pig Extensions Pig Extensions

LOAD - local LOAD - local

load plain N3 files is supported natively by Pig we use a UDF for tokenizing text lines to triples RDFFileLoad macro

DEFINE RDFFileLoad(file) RETURNS T { lines = LOAD '$file' AS (txt: chararray); $T = FOREACH lines GENERATE FLATTEN(pig.RDFize(txt)) AS (subject, predicate, object); } triples = RDFFileLoad("hdfs:///rdf-data.nt"); rdf_tuples = LOAD "rdf-data.dat" USING BinStorage() AS (subject: bytearray, stmts: bag{t:(predicate: bytearray, object: bytearray)});

load TUPLIFIED dataset using BinStorage

6

slide-10
SLIDE 10

Pig Extensions Pig Extensions

materialize (remote) data share accross queries could be used for frequent intermediate results

raw = LOAD "http://endpoint.org:8080/sparql" USING SPARQLLoader("SELECT * WHERE { ?s ?p ?o }") AS (subject, predicate, object); ("http://endpoint.org:8080/sparql", "SELECT * WHERE {?s ?p ?o}")

  • ->

"hdfs:///rdf-data.nt" raw = RDFFileLoad("hdfs:///rdf-data.nt");

LOAD - remote LOAD - remote

run SPARQL query on endpoints filter remote data beforehand depends on user query

7

slide-11
SLIDE 11

Pig Extensions Pig Extensions

BGP Support BGP Support

result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 };

extended FILTER operator hide internal details of BGP processing implementation depends on input schema implemented as language extension internal operators stay unchanged rewriting step in Pig parser - transformation to native Pig code Pig compiler for optimization

8

slide-12
SLIDE 12

Rewriting Rewriting

Example - FILTER Example - FILTER (non-grouping component)

(non-grouping component)

  • ut

= FILTER in BY { ?s ?p value };

(s,{(p,o)}) (s,{(p,o)}) ′ ′

tmp = FOREACH in {

(s,{(p,o)},cnt) (s,{(p,o)})

GENERATE ∗, COUNT(t) AS cnt; };

  • ut

= FILTER tmp BY cnt > 0;

(s,{(p,o)},cnt) (s,{(p,o)}, cnt)

==>

Example - FILTER Example - FILTER

  • ut(s, {(p, o)}) = FILTER in

BY { value ?p ?o };

(s,{(p,o)}) ′ ′

==>

  • ut

= FILTER in BY s == value ;

(s,{(p,o)}) (s,{(p,o)}) ′ ′

t = FILTER stms BY o == value ;

′ ′

9

slide-13
SLIDE 13

Rewriting Rewriting

Example - STAR JOIN Example - STAR JOIN (on grouping component)

(on grouping component)

result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 };

tmp = FOREACH triples { t1 = FILTER stmts BY predicate == "<geo:lat>"; t2 = FILTER stmts BY predicate == "<geo:long>"; GENERATE *, COUNT(t1) AS cnt1, COUNT(t2) AS cnt2; } result = FILTER tmp BY cnt1 > 0 AND cnt2 > 0;

  • ut

= FILTER in BY { TP . ... TP . };

(s,{(p,o)}) (s,{(p,o)}) 1 N

tmp = FOREACH in {

(s,{(p,o)},cnt1,...,cntN) (s,{(p,o)})

t1 = FILTER stms BY p == p ;

′ 1 ′

... tN = FILTER stms BY p == p ;

′ N ′

GENERATE ∗, COUNT(t1) AS cnt1, ... , COUNT(tN) AS cntN; };

  • ut

= FILTER tmp

(s,{(p,o)},cnt1,...,cntN) (s,{(p,o)}, cnt1, ..., cntN)

BY cnt1 > 0 AND ... AND cntN > 0;

==>

10

slide-14
SLIDE 14

Evaluation Evaluation

scripts manually rewritten Dataset: 8GB, 54 mio statements Hadoop Cluster: 8 Nodes, Pig 0.12

triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 };

Self-Join Self-Join

11

slide-15
SLIDE 15

Evaluation Evaluation

FILTER (non-grouping component) FILTER (non-grouping component)

triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY { ?s ?p "Metallica" };

12

slide-16
SLIDE 16

Conclusion Conclusion

native Pig data model not suitable for RDF data combination of self-joins and filter needed support for BGP in Pig Latin join with remote data rewriter produces native Pig code use Pig optimizer allows easier and faster linked data processing in Pig foundation for cost-based optimizer materialized (intermediate) results