Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction - - PowerPoint PPT Presentation

graph data processing
SMART_READER_LITE
LIVE PREVIEW

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction - - PowerPoint PPT Presentation

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying General Graph Processing Offline analytics Online querying 2 / 75 Graph Data are Very Common Internet 3 / 75 Graph Data are Very Common Social


slide-1
SLIDE 1

Graph Data Processing

  • M. Tamer ¨

Ozsu

1 / 75

slide-2
SLIDE 2

Outline

Introduction RDF Graph Querying General Graph Processing Offline analytics Online querying

2 / 75

slide-3
SLIDE 3

Graph Data are Very Common

Internet

3 / 75

slide-4
SLIDE 4

Graph Data are Very Common

Social networks

4 / 75

slide-5
SLIDE 5

Graph Data are Very Common

Trade volumes and connections

5 / 75

slide-6
SLIDE 6

Graph Data are Very Common

Biological networks

6 / 75

slide-7
SLIDE 7

Graph Data are Very Common

As of September 2011 Music Brainz (zitgist) P20

Turismo de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact- book El Viajero Tourism WordNet (W3C) WordNet (VUA) VIVO UF VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists UniRef UniProt UMBEL UK Post- codes legislation data.gov.uk Uberblic UB Mann- heim TWC LOGD Twarql transport data.gov. uk Traffic Scotland theses. fr Thesau- rus W totl.net Tele- graphis TCM Gene DIT Taxon Concept Open Library (Talis) tags2con delicious t4gm info Swedish Open Cultural Heritage Surge Radio Sudoc STW RAMEAU SH statistics data.gov. uk St. Andrews Resource Lists ECS South- ampton EPrints SSW Thesaur us Smart Link Slideshare 2RDF semantic web.org Semantic Tweet Semantic XBRL SW Dog Food Source Code Ecosystem Linked Data US SEC (rdfabout) Sears Scotland Geo- graphy Scotland Pupils & Exams Scholaro- meter WordNet (RKB Explorer) Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RESEX RAE2001 Pisa OS OAI NSF New- castle LAAS KISTI JISC IRIT IEEE IBM Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Crime Reports UK Course- ware CORDIS (RKB Explorer) CiteSeer Budapest ACM riese Revyu research data.gov. uk Ren. Energy Genera- tors reference data.gov. uk Recht- spraak. nl RDF
  • hloh
Last.FM (rdfize) RDF Book Mashup Rådata nå! PSH Product Types Ontology Product DB PBAC Poké- pédia patents data.go v.uk Ox Points Ord- nance Survey Openly Local Open Library Open Cyc Open Corpo- rates Open Calais OpenEI Open Election Data Project Open Data Thesau- rus Ontos News Portal OGOLOD Janus AMP Ocean Drilling Codices New York Times NVD ntnusc NTU Resource Lists Norwe- gian MeSH NDL subjects ndlna my Experi- ment Italian Museums medu- cator MARC Codes List Man- chester Reading Lists Lotico Weather Stations London Gazette LOIUS Linked Open Colors lobid Resources lobid Organi- sations LEM Linked MDB LinkedL CCN Linked GeoData LinkedCT Linked User Feedback LOV Linked Open Numbers LODE Eurostat (Ontology Central) Linked EDGAR (Ontology Central) Linked Crunch- base lingvoj Lichfield Spen- ding LIBRIS Lexvo LCSH DBLP (L3S) Linked Sensor Data (Kno.e.sis) Klapp- stuhl- club Good- win Family National Radio- activity JP Jamendo (DBtune) Italian public schools ISTAT Immi- gration iServe IdRef Sudoc NSZL Catalog Hellenic PD Hellenic FBD Piedmont Accomo- dations GovTrack GovWILD Google Art wrapper gnoss GESIS GeoWord Net Geo Species Geo Names Geo Linked Data GEMET GTAA STITCH SIDER Project Guten- berg Medi Care Euro- stat (FUB) EURES Drug Bank Disea- some DBLP (FU Berlin) Daily Med CORDIS (FUB) Freebase flickr wrappr Fishes
  • f Texas
Finnish Munici- palities ChEMBL FanHubz Event Media EUTC Produc- tions Eurostat Europeana EUNIS EU Insti- tutions ESD stan- dards EARTh Enipedia Popula- tion (En- AKTing) NHS (En- AKTing) Mortality (En- AKTing) Energy (En- AKTing) Crime (En- AKTing) CO2 Emission (En- AKTing) EEA SISVU educatio n.data.g
  • v.uk
ECS South- ampton ECCO- TCP GND Didactal ia DDC Deutsche Bio- graphie data dcs Music Brainz (DBTune) Magna- tune John Peel (DBTune) Classical (DB Tune) Audio Scrobbler (DBTune) Last.FM artists (DBTune) DB Tropes Portu- guese DBpedia dbpedia lite Greek DBpedia DBpedia data-
  • pen-
ac-uk SMC Journals Pokedex Airports NASA (Data Incu- bator) Music Brainz (Data Incubator) Moseley Folk Metoffice Weather Forecasts Discogs (Data Incubator) Climbing data.gov.uk intervals Data Gov.ie data bnf.fr Cornetto reegle Chronic- ling America Chem2 Bio2RDF Calames business data.gov. uk Bricklink Brazilian Poli- ticians BNB UniSTS UniPath way UniParc Taxono my UniProt (Bio2RDF) SGD Reactome PubMed Pub Chem PRO- SITE ProDom Pfam PDB OMIM MGI KEGG Reaction KEGG Pathway KEGG Glycan KEGG Enzyme KEGG Drug KEGG Com- pound InterPro Homolo Gene HGNC Gene Ontology GeneID Affy- metrix bible
  • ntology
BibBase FTS BBC Wildlife Finder BBC Program mes BBC Music Alpine Ski Austria LOCAH Amster- dam Museum AGROV OC AEMET US Census (rdfabout) Media Geographic Publications Government Cross-domain Life sciences User-generated content

Linked data

7 / 75 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

slide-8
SLIDE 8

Graph Types

RDF graph

mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director 8 / 75

slide-9
SLIDE 9

Graph Types

Property graph

film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)

  • ffers 0743424425amazonOffer

geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)

9 / 75

slide-10
SLIDE 10

Graph Types

RDF graph

mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director

◮ Workload: SPARQL

queries

◮ Query execution: subgraph

matching by homomorphism Property graph

film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)

  • ffers 0743424425amazonOffer

geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last T director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)

◮ Workload: Online queries

and analytic workloads

◮ Query execution: Much

more varied

10 / 75

slide-11
SLIDE 11

Outline

Introduction RDF Graph Querying General Graph Processing Offline analytics Online querying

11 / 75

slide-12
SLIDE 12

RDF Graph

mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director 12 / 75

slide-13
SLIDE 13

SPARQL Queries

SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r > 4.0) }

?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating

FILTER(?r > 4.0)

13 / 75

slide-14
SLIDE 14

Graph-based Approach

◮ Answering SPARQL query ≡ subgraph matching ◮ gStore, chameleon-db

?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r > 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director

S u b g r a p h M a t c h i n g

14 / 75

slide-15
SLIDE 15

gStore

General Approach:

◮ Work directly on the RDF graph and the SPARQL query graph ◮ Use a signature-based encoding of each entity and class vertex

to speed up matching

◮ Filter-and-evaluate

◮ Use a false positive algorithm to prune nodes and obtain a set

  • f candidates; then do more detailed evaluation on those

◮ Use an index (VS∗-tree) over the data signature graph (has

light maintenance load) for efficient pruning

15 / 75

slide-16
SLIDE 16
  • 1. Encode Q and G to Get Signature Graphs

Query signature graph Q∗

0100 0000 1000 0000 00010 0000 0100 10000

Data signature graph G ∗

0010 1000 0100 0001

00001

1000 0001

00010

0000 0100

10000

0000 1000

10000

0000 0010

10000

0000 1001

00100

0001 0001

01000

0100 1000

01000

1001 1000

01000

0001 0100

01000

16 / 75

slide-17
SLIDE 17
  • 2. Filter-and-Evaluate

Query signature graph Q∗

0100 0000 1000 0000 00010 0000 0100 10000

Data signature graph G ∗

0010 1000 0100 0001

00001

1000 0001

00010

0000 0100

10000

0000 1000

10000

0000 0010

10000

0000 1001

00100

0001 0001

01000

0100 1000

01000

1001 1000

01000

0001 0100

01000

Find matches of Q∗ over signature graph G ∗ Verify each match in RDF graph G

17 / 75

slide-18
SLIDE 18

How to Generate Candidate List

◮ Two step process:

  • 1. For each node of Q∗ get lists of nodes in G ∗ that include that

node.

  • 2. Do a multi-way join to get the candidate list

18 / 75

slide-19
SLIDE 19

How to Generate Candidate List

◮ Two step process:

  • 1. For each node of Q∗ get lists of nodes in G ∗ that include that

node.

  • 2. Do a multi-way join to get the candidate list

◮ Alternatives:

19 / 75

slide-20
SLIDE 20

How to Generate Candidate List

◮ Two step process:

  • 1. For each node of Q∗ get lists of nodes in G ∗ that include that

node.

  • 2. Do a multi-way join to get the candidate list

◮ Alternatives:

◮ Sequential scan of G ∗ ◮ Both steps are inefficient 20 / 75

slide-21
SLIDE 21

How to Generate Candidate List

◮ Two step process:

  • 1. For each node of Q∗ get lists of nodes in G ∗ that include that

node.

  • 2. Do a multi-way join to get the candidate list

◮ Alternatives:

◮ Sequential scan of G ∗ ◮ Both steps are inefficient ◮ Use S-trees ◮ Height-balanced tree over signatures ◮ Run an inclusion query for each node of Q∗ and get lists of

nodes in G ∗ that include that node.

  • Given query signature q and a set of data signatures S,

find all data signatures si ∈ S where q&si = q

◮ Does not support second step – expensive 21 / 75

slide-22
SLIDE 22

How to Generate Candidate List

◮ Two step process:

  • 1. For each node of Q∗ get lists of nodes in G ∗ that include that

node.

  • 2. Do a multi-way join to get the candidate list

◮ Alternatives:

◮ Sequential scan of G ∗ ◮ Both steps are inefficient ◮ Use S-trees ◮ Height-balanced tree over signatures ◮ Run an inclusion query for each node of Q∗ and get lists of

nodes in G ∗ that include that node.

  • Given query signature q and a set of data signatures S,

find all data signatures si ∈ S where q&si = q

◮ Does not support second step – expensive ◮ VS-tree (and VS∗-tree) ◮ Multi-resolution summary graph based on S-tree ◮ Supports both steps efficiently ◮ Grouping by vertices 22 / 75

slide-23
SLIDE 23

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000

23 / 75

slide-24
SLIDE 24

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000

24 / 75

slide-25
SLIDE 25

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000 002 011

25 / 75

slide-26
SLIDE 26

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000 002 011 003 008

26 / 75

slide-27
SLIDE 27

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000 002 011 003 008 004 009

27 / 75

slide-28
SLIDE 28

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000 002 011 003 008 004 009

⋊ ⋉ ⋊ ⋉

28 / 75

slide-29
SLIDE 29

S-tree Solution

1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001

005 004 006 001 002 003 007 011 008 009 010

d1

1

d2

1

d2

2

d3

1

d3

2 d3 3

d3

4

G 3 G 2 G 1 1000 0000 0100 0000 00010 0000 0100 10000 002 011 003 008 004 009

⋊ ⋉ ⋊ ⋉

Possibly large join space!

29 / 75

slide-30
SLIDE 30

Outline

Introduction RDF Graph Querying General Graph Processing Offline analytics Online querying

30 / 75

slide-31
SLIDE 31

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

31 / 75

slide-32
SLIDE 32

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Focus here is on the dynamism of the graphs in whether or not they change and how they change.

32 / 75

slide-33
SLIDE 33

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes.

33 / 75

slide-34
SLIDE 34

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes. The types of workloads that the approaches are designed to handle.

34 / 75

slide-35
SLIDE 35

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

35 / 75

slide-36
SLIDE 36

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered.

36 / 75

slide-37
SLIDE 37

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes.

37 / 75

slide-38
SLIDE 38

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once.

38 / 75

slide-39
SLIDE 39

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once. Dynamic graphs with un- known changes – requires re- discovery of the graph (e.g., LOD).

39 / 75

slide-40
SLIDE 40

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

40 / 75

slide-41
SLIDE 41

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Computation accesses a portion of the graph and the results are computed for a subset

  • f vertices; e.g., point-

to-point shortest path, subgraph matching, reachability, SPARQL.

41 / 75

slide-42
SLIDE 42

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Computation accesses a portion of the graph and the results are computed for a subset

  • f vertices; e.g., point-

to-point shortest path, subgraph matching, reachability, SPARQL. Computation accesses the entire graph and may require multiple iterations; e.g., PageR- ank, clustering, graph colouring, all pairs shortest path.

42 / 75

slide-43
SLIDE 43

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

43 / 75

slide-44
SLIDE 44

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Sees the en- tire input in advance.

44 / 75

slide-45
SLIDE 45

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Sees the en- tire input in advance. Sees the input piece-meal as it executes.

45 / 75

slide-46
SLIDE 46

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory.

46 / 75

slide-47
SLIDE 47

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input.

47 / 75

slide-48
SLIDE 48

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs.

48 / 75

slide-49
SLIDE 49

Classification

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads

Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs. Similar to dy- namic, but computation happens in batches of changes.

49 / 75

slide-50
SLIDE 50

Example Design Points – Not all alternatives make sense

Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Dynamic (or batch-dynamic) algorithms do not make sense for static graphs.

50 / 75

slide-51
SLIDE 51

Graph Workloads

Offline graph analytics

◮ PageRank ◮ Clustering ◮ Strongly connected

components

◮ Diameter finding ◮ Graph colouring ◮ All pairs shortest path ◮ Graph pattern mining ◮ Machine learning

algorithms (Belief propagation, Gaussian non-negative matrix factorization) Online graph querying

◮ Reachability ◮ Single source shortest-path ◮ Subgraph matching ◮ SPARQL queries

51 / 75

slide-52
SLIDE 52

PageRank Computation

A web page is important if it is pointed to by other important pages. P1 P2 P3 P5 P6 P4 r(Pi) =

  • Pj∈BPi

r(Pj) |FPj| r(P2) = r(P1) 2 + r(P3) 3 rk+1(Pi) =

  • Pj∈BPi

rk(Pj) |FPj| BPi: in-neighbours of Pi FPi: out-neighbours of Pi

52 / 75

slide-53
SLIDE 53

PageRank Computation

A web page is important if it is pointed to by other important pages. P1 P2 P3 P5 P6 P4 rk+1(Pi) =

  • Pj∈BPi

rk(Pj) |FPj|

Iteration 0 Iteration 1 Iteration 2 Rank at

  • Iter. 2

r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5 r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4 r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5 r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1 r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3 r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2

Iterative processing.

53 / 75

slide-54
SLIDE 54

Some Alternative Computational Models for Offline Analytics

◮ MapReduce

◮ map and reduce functions ◮ Not suitable for iterative processing due to data movement at

each stage

◮ Need to save in storage system intermediate results of each

iteration

54 / 75

slide-55
SLIDE 55

Some Alternative Computational Models for Offline Analytics

◮ MapReduce

◮ map and reduce functions ◮ Not suitable for iterative processing due to data movement at

each stage

◮ Need to save in storage system intermediate results of each

iteration

◮ Vertex-centric paradigm

◮ Specify (a) the computation to be performed at each vertex,

and (b) its communication with neighbour vertices

◮ Designed specifically for interactive graph processing ◮ Synchronous (e.g., Pregel, Giraph) ◮ Asynchronous (e.g., GraphLab) 55 / 75

slide-56
SLIDE 56

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

56 / 75

slide-57
SLIDE 57

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

Computation

57 / 75

slide-58
SLIDE 58

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

Superstep 1 Superstep 2 Superstep 3

58 / 75

slide-59
SLIDE 59

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

Machine 1 Machine 2 Machine 3 Communication Barrier Superstep 1 Superstep 2 Superstep 3

59 / 75

slide-60
SLIDE 60

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Communication Barrier Communication Barrier Superstep 1 Superstep 2 Superstep 3

60 / 75

slide-61
SLIDE 61

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Communication Barrier Communication Barrier Superstep 1 Superstep 2 Superstep 3

61 / 75

slide-62
SLIDE 62

Pregel-like Graph Processing Systems

Pregel-like systems are BSP, vertex-centric programs.

◮ “Think like a vertex”:

?

62 / 75

slide-63
SLIDE 63

GraphLab (Asynchronous)

GraphLab features asynchronous execution:

◮ No communication barriers. ✓ ◮ Uses the most recent vertex values. ✓

Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3

63 / 75

slide-64
SLIDE 64

GraphLab (Asynchronous)

Implemented via distributed locking: v0 v1 v2 v3 v4

64 / 75

slide-65
SLIDE 65

GraphLab (Asynchronous)

Implemented via distributed locking: v0 v1 v2 v3 v4

65 / 75

slide-66
SLIDE 66

GraphLab (Asynchronous)

Implemented via distributed locking: v0 v1 v2 v3 v4

66 / 75

slide-67
SLIDE 67

GraphLab (Asynchronous)

Implemented via distributed locking: v0 v1 v2 v3 v4

67 / 75

slide-68
SLIDE 68

GraphLab (Asynchronous)

Implemented via distributed locking: v0 v1 v2 v3 v4

68 / 75

slide-69
SLIDE 69

Reachability Queries

film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)

  • ffers 0743424425amazonOffer

geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)

69 / 75

slide-70
SLIDE 70

Reachability Queries

film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)

  • ffers 0743424425amazonOffer

geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)

Can you reach film 1267 from film 2014?

70 / 75

slide-71
SLIDE 71

Reachability Queries

film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7)

  • ffers 0743424425amazonOffer

geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director)

Is there a book whose rating is > 4.0 associated with a film that was directed by Stanley Kubrick?

71 / 75

slide-72
SLIDE 72

Reachability Queries

Think of Facebook graph and finding friends of friends.

72 / 75

slide-73
SLIDE 73

Subgraph Matching

?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r > 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director

S u b g r a p h M a t c h i n g

73 / 75

slide-74
SLIDE 74

For more information I

Graph-based SPARQL processing

◮ Lei Zou, Jinghui Mo, Lei Chen, M. Tamer ¨

Ozsu, and Dongyan Zhao. gStore: answering SPARQL queries via subgraph matching. Proc. VLDB Endowment, 4(8):482–493, 2011

◮ Lei Zou, M. Tamer ¨

Ozsu, Lei Chen, Xuchuan Shen, Ruizhe Huang, and Dongyan Zhao. gStore: A graph-based SPARQL query engine. VLDB J., 23(4):565–590, 2014

◮ G¨

une¸ s Alu¸ c, M. Tamer ¨ Ozsu, Khuzaima Daudjee, and Olaf Hartig. chameleon-db: a workload-aware robust RDF data management system. Technical Report CS-2013-10, University of Waterloo, 2013. Available at https://cs.uwaterloo.ca/sites/ca.computer-science/files/ uploads/files/CS-2013-10.pdf

74 / 75

slide-75
SLIDE 75

For more information II

More on RDF graph management

◮ Olaf Hartig and M. Tamer ¨

  • Ozsu. Linked data query processing. In Proc.

30th Int. Conf. on Data Engineering, pages 1286–1289, 2014. Tutorial description

◮ More organized slides from my talks are available at http://www.

slideshare.net/MTamerOzsu/web-data-management-with-rdf

General graph processing

◮ Arijit Khan and Sameh Elnikety. Systems for big-graphs. Proc. VLDB

Endowment, 7(13):1709–1710, 2014

◮ Slides available at http://people.inf.ethz.ch/khana/Papers/2014_

VLDB_GraphSystemsTutorial.pptx

75 / 75