Graph Data Processing
- M. Tamer ¨
Ozsu
1 / 75
Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction - - PowerPoint PPT Presentation
Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying General Graph Processing Offline analytics Online querying 2 / 75 Graph Data are Very Common Internet 3 / 75 Graph Data are Very Common Social
1 / 75
2 / 75
3 / 75
4 / 75
5 / 75
6 / 75
As of September 2011 Music Brainz (zitgist) P20
Turismo de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact- book El Viajero Tourism WordNet (W3C) WordNet (VUA) VIVO UF VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists UniRef UniProt UMBEL UK Post- codes legislation data.gov.uk Uberblic UB Mann- heim TWC LOGD Twarql transport data.gov. uk Traffic Scotland theses. fr Thesau- rus W totl.net Tele- graphis TCM Gene DIT Taxon Concept Open Library (Talis) tags2con delicious t4gm info Swedish Open Cultural Heritage Surge Radio Sudoc STW RAMEAU SH statistics data.gov. uk St. Andrews Resource Lists ECS South- ampton EPrints SSW Thesaur us Smart Link Slideshare 2RDF semantic web.org Semantic Tweet Semantic XBRL SW Dog Food Source Code Ecosystem Linked Data US SEC (rdfabout) Sears Scotland Geo- graphy Scotland Pupils & Exams Scholaro- meter WordNet (RKB Explorer) Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RESEX RAE2001 Pisa OS OAI NSF New- castle LAAS KISTI JISC IRIT IEEE IBM Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Crime Reports UK Course- ware CORDIS (RKB Explorer) CiteSeer Budapest ACM riese Revyu research data.gov. uk Ren. Energy Genera- tors reference data.gov. uk Recht- spraak. nl RDF7 / 75 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director 8 / 75
film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)
geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)
9 / 75
mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director
◮ Workload: SPARQL
◮ Query execution: subgraph
film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)
geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last T director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)
◮ Workload: Online queries
◮ Query execution: Much
10 / 75
11 / 75
mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director 12 / 75
13 / 75
◮ Answering SPARQL query ≡ subgraph matching ◮ gStore, chameleon-db
?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r > 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director
14 / 75
◮ Work directly on the RDF graph and the SPARQL query graph ◮ Use a signature-based encoding of each entity and class vertex
◮ Filter-and-evaluate
◮ Use a false positive algorithm to prune nodes and obtain a set
◮ Use an index (VS∗-tree) over the data signature graph (has
15 / 75
0010 1000 0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 0001
01000
0100 1000
01000
1001 1000
01000
0001 0100
01000
16 / 75
0010 1000 0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 0001
01000
0100 1000
01000
1001 1000
01000
0001 0100
01000
Find matches of Q∗ over signature graph G ∗ Verify each match in RDF graph G
17 / 75
◮ Two step process:
18 / 75
◮ Two step process:
◮ Alternatives:
19 / 75
◮ Two step process:
◮ Alternatives:
◮ Sequential scan of G ∗ ◮ Both steps are inefficient 20 / 75
◮ Two step process:
◮ Alternatives:
◮ Sequential scan of G ∗ ◮ Both steps are inefficient ◮ Use S-trees ◮ Height-balanced tree over signatures ◮ Run an inclusion query for each node of Q∗ and get lists of
◮ Does not support second step – expensive 21 / 75
◮ Two step process:
◮ Alternatives:
◮ Sequential scan of G ∗ ◮ Both steps are inefficient ◮ Use S-trees ◮ Height-balanced tree over signatures ◮ Run an inclusion query for each node of Q∗ and get lists of
◮ Does not support second step – expensive ◮ VS-tree (and VS∗-tree) ◮ Multi-resolution summary graph based on S-tree ◮ Supports both steps efficiently ◮ Grouping by vertices 22 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
23 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
24 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
25 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
26 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
27 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
28 / 75
005 004 006 001 002 003 007 011 008 009 010
d1
1
d2
1
d2
2
d3
1
d3
2 d3 3
d3
4
29 / 75
30 / 75
31 / 75
Focus here is on the dynamism of the graphs in whether or not they change and how they change.
32 / 75
Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes.
33 / 75
Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes. The types of workloads that the approaches are designed to handle.
34 / 75
35 / 75
Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered.
36 / 75
Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes.
37 / 75
Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once.
38 / 75
Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once. Dynamic graphs with un- known changes – requires re- discovery of the graph (e.g., LOD).
39 / 75
40 / 75
Computation accesses a portion of the graph and the results are computed for a subset
to-point shortest path, subgraph matching, reachability, SPARQL.
41 / 75
Computation accesses a portion of the graph and the results are computed for a subset
to-point shortest path, subgraph matching, reachability, SPARQL. Computation accesses the entire graph and may require multiple iterations; e.g., PageR- ank, clustering, graph colouring, all pairs shortest path.
42 / 75
43 / 75
Sees the en- tire input in advance.
44 / 75
Sees the en- tire input in advance. Sees the input piece-meal as it executes.
45 / 75
Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory.
46 / 75
Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input.
47 / 75
Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs.
48 / 75
Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs. Similar to dy- namic, but computation happens in batches of changes.
49 / 75
50 / 75
◮ PageRank ◮ Clustering ◮ Strongly connected
◮ Diameter finding ◮ Graph colouring ◮ All pairs shortest path ◮ Graph pattern mining ◮ Machine learning
◮ Reachability ◮ Single source shortest-path ◮ Subgraph matching ◮ SPARQL queries
51 / 75
52 / 75
Iteration 0 Iteration 1 Iteration 2 Rank at
r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5 r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4 r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5 r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1 r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3 r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2
53 / 75
◮ MapReduce
◮ map and reduce functions ◮ Not suitable for iterative processing due to data movement at
◮ Need to save in storage system intermediate results of each
54 / 75
◮ MapReduce
◮ map and reduce functions ◮ Not suitable for iterative processing due to data movement at
◮ Need to save in storage system intermediate results of each
◮ Vertex-centric paradigm
◮ Specify (a) the computation to be performed at each vertex,
◮ Designed specifically for interactive graph processing ◮ Synchronous (e.g., Pregel, Giraph) ◮ Asynchronous (e.g., GraphLab) 55 / 75
56 / 75
57 / 75
58 / 75
59 / 75
60 / 75
61 / 75
◮ “Think like a vertex”:
62 / 75
◮ No communication barriers. ✓ ◮ Uses the most recent vertex values. ✓
Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3
63 / 75
64 / 75
65 / 75
66 / 75
67 / 75
68 / 75
film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)
geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)
69 / 75
film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)
geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)
70 / 75
film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)
geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)
71 / 75
72 / 75
?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r > 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director
73 / 75
◮ Lei Zou, Jinghui Mo, Lei Chen, M. Tamer ¨
◮ Lei Zou, M. Tamer ¨
◮ G¨
74 / 75
◮ Olaf Hartig and M. Tamer ¨
◮ More organized slides from my talks are available at http://www.
◮ Arijit Khan and Sameh Elnikety. Systems for big-graphs. Proc. VLDB
◮ Slides available at http://people.inf.ethz.ch/khana/Papers/2014_
75 / 75