SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - PowerPoint PPT Presentation

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015

Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD "hdfs:///data.csv" AS (id, x: int, y: int); b = FILTER a BY x > 10 AND y < 50; process large datasets c = FOREACH b GENERATE id; access data in HDFS STORE c INTO "hdfs:///data2.csv"; Linked Data / RDF Linked Data / RDF connect information / datasets <event1> <type> <Concert> . triple format <event1> <start> "2012-08-17T18:00" . <event1> <geo:long> "-1.135E2"^^<xsd:double> . <event1> <geo:lat> "5.353E1"^^<xsd:double> . (subject, predicate, object) <event1> <artist> <Metallica> . <Metallica> <type> <Band> . query language: SPARQL <Metallica> <name> "Metallica" . <Metallica> <founded> "1981"^^<xsd:integer> . federated query processing SELECT ?x ?y WHERE { ?event <long> ?x. ?event <lat> ?y. ?event <artist> ?artist . ?artist <name> "Metallica" . } 1

Motivation Motivation Big Data Linked Data (tuples) (triples) combined execution model existing solutions: SPARQL to Pig , e.g.: Alexander Schätzle et. al., PigSPARQL: Übersetzung von SPARQL nach PigLatin , BTW 2011 2

Motivation Motivation Problems Problems no BGPs in Pig self joins to reconstruct entities load dataset twice (or more) COGROUP : combination of MapReduce jobs Contribution Contribution Pig Latin language extension data model add SPARQL-like features not only one dataset + access remote data efficient processing of Linked Data in Pig results as foundation for cost-based Pig compiler/rewriter 3

Outline Outline 1. Data Model 2. Pig Extensions conversion load/access BGP support 3. Extended Pig Rewriting 4. Evaluation

Data Model Data Model RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig fixed schema not flexible enough: ( s , p , ..., p ) 1 n Our approach: Our approach: for each subject: bag of predicate-object pairs { subject: bytearray, stmts: { (predicate: bytearray, object: bytearray) } } <event1> <artist> <Metallica> . <event1>, { (<artist>,<Metallica>), <event1> <start> "2012-08-17T18:00" . (<start>,"2012-08-17T18:00"), <event1> <geo:long> "-1.135E2" . (<geo:long>,"-1.135E2"), <event1> <geo:lat> "5.353E1" . (<geo:lat>,"5.353E1"), <event1> <type> <Concert> . (<type>,<Concert>) }, <Metallica> <type> <Band> . <Metallica>,{ (<type>,<Band>), <Metallica> <name> "Metallica" . (<name>,"Metallica"), <Metallica> <founded> "1981" . (<founded>,"1981") } H. Kim, et al, From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , PVLDB 2011 4

Data Model Data Model RDF very flexible model represent arbitrary structures and graphs requires self joins in Pig fixed schema not flexible enough: ( s , p , ..., p ) 1 n Our approach: Our approach: for each predicate: bag of subject-object pairs { predicate: bytearray, stmts: { (subject: bytearray, object: bytearray) } } <event1> <artist> <Metallica> . <type>, { (<Metallica>, <Band>), <event1> <start> "2012-08-17T18:00" . (<event1>,<Concert>) }, <event1> <geo:long> "-1.135E2" . <artist>, { (<event1>, <Metallica>) }, <event1> <geo:lat> "5.353E1" . <start>, { (<event1>, "2012-08-17T18:00") }, <event1> <type> <Concert> . <geo:lat>,{ (<event1>, "-1.135E2") }, <Metallica> <type> <Band> . <geo:lat>,{ (<event1>, "5.353E1") }, <Metallica> <name> "Metallica" . <name>, { (<Metallica>,"Metallica") }, <Metallica> <founded> "1981" . <founded>,{ (<Metallica>,"1981") } H. Kim, et al, From SPARQL to MapReduce: The Journey Using a Nested TripleGroup Algebra , PVLDB 2011 4

Pig Extensions Pig Extensions TUPLIFY TUPLIFY avoid self-joins convert plain triples to triple-bag format using GROUP BY on any component explicitly or implicitly (rewriting rules) triple_groups = TUPLIFY triples BY subject triple_groups = FOREACH (GROUP triples BY subject) GENERATE group AS subject, triples.(predicate, object) AS stmts; 5

Pig Extensions Pig Extensions LOAD - local LOAD - local load plain N3 files is supported natively by Pig we use a UDF for tokenizing text lines to triples RDFFileLoad macro DEFINE RDFFileLoad(file) RETURNS T { lines = LOAD '$file' AS (txt: chararray); $T = FOREACH lines GENERATE FLATTEN(pig.RDFize(txt)) AS (subject, predicate, object); } triples = RDFFileLoad("hdfs:///rdf-data.nt"); load TUPLIFIED dataset using BinStorage rdf_tuples = LOAD "rdf-data.dat" USING BinStorage() AS (subject: bytearray, stmts: bag{t:(predicate: bytearray, object: bytearray)}); 6

Pig Extensions Pig Extensions LOAD - remote LOAD - remote run SPARQL query on endpoints filter remote data beforehand depends on user query raw = LOAD "http://endpoint.org:8080/sparql" USING SPARQLLoader("SELECT * WHERE { ?s ?p ?o }") AS (subject, predicate, object); ("http://endpoint.org:8080/sparql", "SELECT * WHERE {?s ?p ?o}") --> "hdfs:///rdf-data.nt" raw = RDFFileLoad("hdfs:///rdf-data.nt"); materialize (remote) data share accross queries could be used for frequent intermediate results 7

Pig Extensions Pig Extensions BGP Support BGP Support result = FILTER triples BY { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 }; extended FILTER operator hide internal details of BGP processing implementation depends on input schema implemented as language extension internal operators stay unchanged rewriting step in Pig parser - transformation to native Pig code Pig compiler for optimization 8

Rewriting Rewriting Example - FILTER Example - FILTER ( s ,{( p , o )}) ′ ′ out ( s , {( p , o )}) = FILTER in BY { value ? p ? o }; ==> ( s ,{( p , o )}) ( s ,{( p , o )}) ′ ′ = FILTER in BY s == value ; out Example - FILTER (non-grouping component) Example - FILTER (non-grouping component) ( s ,{( p , o )}) ( s ,{( p , o )}) ′ ′ = FILTER in BY { ? s ? p value }; out ==> ( s ,{( p , o )}, cnt ) ( s ,{( p , o )}) = FOREACH in { tmp ′ ′ t = FILTER stms BY o == value ; GENERATE ∗ , COUNT ( t ) AS cnt ; }; ( s ,{( p , o )}, cnt ) ( s ,{( p , o )}, cnt ) = FILTER tmp BY cnt > 0; out 9

Rewriting Rewriting Example - STAR JOIN Example - STAR JOIN (on grouping component) (on grouping component) ( s ,{( p , o )}) ( s ,{( p , o )}) = FILTER in BY { TP . ... TP . }; out 1 N ==> ( s ,{( p , o )}, cnt 1,..., cntN ) ( s ,{( p , o )}) = FOREACH in { tmp ′ ′ t 1 = FILTER stms BY p == p ; 1 ... ′ ′ tN = FILTER stms BY p == p ; N GENERATE ∗ , COUNT ( t 1) AS cnt 1, ... , COUNT ( tN ) AS cntN ; }; ( s ,{( p , o )}, cnt 1,..., cntN ) ( s ,{( p , o )}, cnt 1, ..., cntN ) = FILTER tmp out BY cnt 1 > 0 AND ... AND cntN > 0; tmp = FOREACH triples { result = FILTER triples BY t1 = FILTER stmts BY predicate == "<geo:lat>"; { ?s <geo:lat> ?o1 . t2 = FILTER stmts BY predicate == "<geo:long>"; ?s <geo:long> ?o2 }; GENERATE *, COUNT(t1) AS cnt1, COUNT(t2) AS cnt2; } result = FILTER tmp BY cnt1 > 0 AND cnt2 > 0; 10

Evaluation Evaluation Self-Join Self-Join scripts manually rewritten triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY Dataset: 8GB, 54 mio statements { ?s <geo:lat> ?o1 . ?s <geo:long> ?o2 }; Hadoop Cluster: 8 Nodes, Pig 0.12 11

Evaluation Evaluation FILTER (non-grouping component) FILTER (non-grouping component) triples = RDFLoad("hdfs:///eventful.nt"); result = FILTER triples BY { ?s ?p "Metallica" }; 12

Conclusion Conclusion native Pig data model not suitable for RDF data combination of self-joins and filter needed support for BGP in Pig Latin join with remote data rewriter produces native Pig code use Pig optimizer allows easier and faster linked data processing in Pig foundation for cost-based optimizer materialized (intermediate) results

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - PowerPoint PPT Presentation

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015 Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

SPARQLing Kleene Fast Property Paths in RDF-3X Andrey Gubichev, TU Munich Stephan Seufert,

The Sparqling system S. Di Bartolomeo, G. Pepe, V. Santarelli, D.F. Savo The 4th International

Integrating Assisted Reproductive Technologies & Elite Pig Genetics to Transform the Pig

Making Money in the Making Money in the Russian Pig and Poultry Russian Pig and Poultry Sector

HelenOS in the Year of the Pig HelenOS in the Year of the Pig http://www.helenos.org

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde

Ironveld Plc Northern Limb Pig Iron Project Contents Disclaimer

circular bioeconomy in the pig industry Ccile CRESPEL Environmental division Cooperl France

Polarized We Govern? Sarah Binder GWU and Brookings Negotiating with Republicans is like

Distributed Streaming Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

Massive Scale Magdalena Balazinska University of Washington

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda

Fo-An-Di-Qz system 2. This is known as the simple Basalt system since this quaternary system

Josh Bloch Charlie Garrod School of Computer Science

Mul$lingual subject access and classifica$on-based browsing through

Probability and Statistics for Computer Science Its straigh,orward to link a number to

CSE 573: Introduction to Artificial Intelligence Hanna Hajishirzi Search (Un-informed, Informed

Sambuz

Useful Links

Newsletter

Mail Us

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin - PowerPoint PPT Presentation

SPARQLing Pig SPARQLing Pig Processing Linked Data with Pig Latin Stefan Hagedorn, Katja Hose, Kai-Uwe Sattler BTW 2015 Motivation Motivation Pig Pig data flow language, tuple oriented compiled to MapReduce a = LOAD

Pig manure: A valuable Fertiliser! Gerard McCutcheon Pig Development Department Why should You

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Welcome The Super Pig 2019 The Year of the Earth Pig Setting The Scene The Chinese Zodiac

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Part 1. The Essence of the Pig 1. 2. 3. 4. 5. 6. Part 1. The Essence of the Pig 1.

SPARQLing Kleene Fast Property Paths in RDF-3X Andrey Gubichev, TU Munich Stephan Seufert,

The Sparqling system S. Di Bartolomeo, G. Pepe, V. Santarelli, D.F. Savo The 4th International

Integrating Assisted Reproductive Technologies &amp; Elite Pig Genetics to Transform the Pig

Making Money in the Making Money in the Russian Pig and Poultry Russian Pig and Poultry Sector

HelenOS in the Year of the Pig HelenOS in the Year of the Pig http://www.helenos.org

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine &amp; Julian Hyde

Ironveld Plc Northern Limb Pig Iron Project Contents Disclaimer

circular bioeconomy in the pig industry Ccile CRESPEL Environmental division Cooperl France

Polarized We Govern? Sarah Binder GWU and Brookings Negotiating with Republicans is like

Distributed Streaming Albert Bifet May 2012 COMP423A/COMP523A Data Stream Mining Outline 1.

Massive Scale Magdalena Balazinska University of Washington

Power Pig with Spark Kelly Zhang (liyun.zhang@intel.com) Apache Big Data Europe 2016 Agenda

Fo-An-Di-Qz system 2. This is known as the simple Basalt system since this quaternary system

Josh Bloch Charlie Garrod School of Computer Science

Mul$lingual subject access and classifica$on-based browsing through

Probability and Statistics for Computer Science Its straigh,orward to link a number to

CSE 573: Introduction to Artificial Intelligence Hanna Hajishirzi Search (Un-informed, Informed

Sambuz

Useful Links

Newsletter

Mail Us

Integrating Assisted Reproductive Technologies & Elite Pig Genetics to Transform the Pig

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde