SPARQLing Pig SPARQLing Pig
Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler
SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - - PowerPoint PPT Presentation
SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015 Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD
Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler
connect information / datasets triple format query language: SPARQL federated query processing
(subject, predicate, object)
<event1> <type> <Concert> . <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2"^^<xsd:double> . <event1> <geo:lat> "5.353E1"^^<xsd:double> . <event1> <artist> <Metallica> . <Metallica> <type> <Band> . <Metallica> <name> "Metallica" . <Metallica> <founded> "1981"^^<xsd:integer> .
data flow language, tuple oriented compiled to MapReduce process large datasets access data in HDFS
a = LOAD "hdfs:///data.csv" AS (id, x: int, y: int); b = FILTER a BY x > 10 AND y < 50; c = FOREACH b GENERATE id; STORE c INTO "hdfs:///data2.csv"; SELECT ?x ?y WHERE { ?event <long> ?x. ?event <lat> ?y. ?event <artist> ?artist . ?artist <name> "Metallica" . }
1
Big Data (tuples) Linked Data (triples) combined execution model
2
Alexander Schätzle et. al., PigSPARQL: Übersetzung von SPARQL nach PigLatin, BTW 2011
existing solutions: SPARQL to Pig , e.g.:
no BGPs in Pig self joins to reconstruct entities load dataset twice (or more) COGROUP: combination of MapReduce jobs
Pig Latin language extension data model add SPARQL-like features not only one dataset + access remote data efficient processing of Linked Data in Pig results as foundation for cost-based Pig compiler/rewriter
3
conversion load/access BGP support
RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig fixed schema not flexible enough: (s, p , ..., p )
1 n
for each subject: bag of predicate-object pairs
{ subject: bytearray, stmts: { (predicate: bytearray, object: bytearray) } }
<event1> <artist> <Metallica> . <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2" . <event1> <geo:lat> "5.353E1" . <event1> <type> <Concert> . <Metallica> <type> <Band> . <Metallica> <name> "Metallica" . <Metallica> <founded> "1981" . <event1>, { (<artist>,<Metallica>), (<start>,"2012-08-17T18:00"), (<geo:long>,"-1.135E2"), (<geo:lat>,"5.353E1"), (<type>,<Concert>) }, <Metallica>,{ (<type>,<Band>), (<name>,"Metallica"), (<founded>,"1981") }
4
From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra, PVLDB 2011
RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig fixed schema not flexible enough: (s, p , ..., p )
1 n
{ predicate: bytearray, stmts: { (subject: bytearray, object: bytearray) } }
<event1> <artist> <Metallica> . <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2" . <event1> <geo:lat> "5.353E1" . <event1> <type> <Concert> . <Metallica> <type> <Band> . <Metallica> <name> "Metallica" . <Metallica> <founded> "1981" . <type>, { (<Metallica>, <Band>), (<event1>,<Concert>) }, <artist>, { (<event1>, <Metallica>) }, <start>, { (<event1>, "2012-08-17T18:00") }, <geo:lat>,{ (<event1>, "-1.135E2") }, <geo:lat>,{ (<event1>, "5.353E1") }, <name>, { (<Metallica>,"Metallica") }, <founded>,{ (<Metallica>,"1981") }
for each predicate: bag of subject-object pairs
4
From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra, PVLDB 2011
triple_groups = FOREACH (GROUP triples BY subject) GENERATE group AS subject, triples.(predicate, object) AS stmts; triple_groups = TUPLIFY triples BY subject
avoid self-joins convert plain triples to triple-bag format using GROUP BY
explicitly or implicitly (rewriting rules)
5
load plain N3 files is supported natively by Pig we use a UDF for tokenizing text lines to triples RDFFileLoad macro
DEFINE RDFFileLoad(file) RETURNS T { lines = LOAD '$file' AS (txt: chararray); $T = FOREACH lines GENERATE FLATTEN(pig.RDFize(txt)) AS (subject, predicate, object); } triples = RDFFileLoad("hdfs:///rdf-data.nt"); rdf_tuples = LOAD "rdf-data.dat" USING BinStorage() AS (subject: bytearray, stmts: bag{t:(predicate: bytearray, object: bytearray)});
load TUPLIFIED dataset using BinStorage
6
materialize (remote) data share accross queries could be used for frequent intermediate results
raw = LOAD "http://endpoint.org:8080/sparql" USING SPARQLLoader("SELECT * WHERE { ?s ?p ?o }") AS (subject, predicate, object); ("http://endpoint.org:8080/sparql", "SELECT * WHERE {?s ?p ?o}")
"hdfs:///rdf-data.nt" raw = RDFFileLoad("hdfs:///rdf-data.nt");
run SPARQL query on endpoints filter remote data beforehand depends on user query
7
result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 };
extended FILTER operator hide internal details of BGP processing implementation depends on input schema implemented as language extension internal operators stay unchanged rewriting step in Pig parser - transformation to native Pig code Pig compiler for optimization
8
(non-grouping component)
= FILTER in BY { ?s ?p value };
(s,{(p,o)}) (s,{(p,o)}) ′ ′
tmp = FOREACH in {
(s,{(p,o)},cnt) (s,{(p,o)})
GENERATE ∗, COUNT(t) AS cnt; };
= FILTER tmp BY cnt > 0;
(s,{(p,o)},cnt) (s,{(p,o)}, cnt)
==>
BY { value ?p ?o };
(s,{(p,o)}) ′ ′
==>
= FILTER in BY s == value ;
(s,{(p,o)}) (s,{(p,o)}) ′ ′
t = FILTER stms BY o == value ;
′ ′
9
(on grouping component)
result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 };
tmp = FOREACH triples { t1 = FILTER stmts BY predicate == "<geo:lat>"; t2 = FILTER stmts BY predicate == "<geo:long>"; GENERATE *, COUNT(t1) AS cnt1, COUNT(t2) AS cnt2; } result = FILTER tmp BY cnt1 > 0 AND cnt2 > 0;
= FILTER in BY { TP . ... TP . };
(s,{(p,o)}) (s,{(p,o)}) 1 N
tmp = FOREACH in {
(s,{(p,o)},cnt1,...,cntN) (s,{(p,o)})
t1 = FILTER stms BY p == p ;
′ 1 ′
... tN = FILTER stms BY p == p ;
′ N ′
GENERATE ∗, COUNT(t1) AS cnt1, ... , COUNT(tN) AS cntN; };
= FILTER tmp
(s,{(p,o)},cnt1,...,cntN) (s,{(p,o)}, cnt1, ..., cntN)
BY cnt1 > 0 AND ... AND cntN > 0;
==>
10
scripts manually rewritten Dataset: 8GB, 54 mio statements Hadoop Cluster: 8 Nodes, Pig 0.12
triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 };
11
triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY { ?s ?p "Metallica" };
12
native Pig data model not suitable for RDF data combination of self-joins and filter needed support for BGP in Pig Latin join with remote data rewriter produces native Pig code use Pig optimizer allows easier and faster linked data processing in Pig foundation for cost-based optimizer materialized (intermediate) results