[PPT] - GraphGen : Adaptive Graph Processing using Relational Databases PowerPoint Presentation

SLIDE 1

GraphGen: Adaptive Graph Processing using Relational Databases

Department of Computer Science University of Maryland

SLIDE 2

Graph Analytics / Querying

Graph datasets can provide value in many domains

Social Networks Email Networks Protein Interaction Networks Stock Trading Networks

Many different types of ways to manage graph data

Graph Databases (neo4j, orientDB, RDF stores)
Distributed Batch Analytics systems (Giraph, GraphX, GraphLab)
In-Memory systems (Ligra, Green-Marl, X-Stream)
Many research prototypes / custom indexes.

SLIDE 3

RDBMS-based Graph Systems vs GraphGen

DECLARATIVE

SLIDE 4

Example: TPC-H

rder_key

customer_key

Orders

1

c1

2

c2

3

c3

rder_key

part_key

LineItem

1

p1

1

p2

2

p1

2

p3

3

p1

3

p2

3

p2 c_key name

Customer

c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1

Orders LineItem

On order_key

Which customer bought which product?

On p_key

Which customers bought the same item?

c1 c4 cust1 cust2 c1 c6 c1 c3

c1

c4 c6

c4 c3 c6

c_1 John c_2 Jane

SLIDE 5

Example: TPC-H

rder_key

customer_key

Orders

1

c1

2

c2

3

c3

rder_ke

y part_ke y

LineItem

1

p1

1

p2

2

p1

2

p3

3

p1

3

p2

3

p2 c_ke y name

Customer

c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1

Orders LineItem

On order_key On p_key

Which customers bought the same item?

c1 c4 cust1 cust2 c1 c6 c1 c3

c1

c4 c6

c4 c3 c6

c_1 John c_2 Jane

Many other graphs of potential interest:

Suppliers that sell a common item
Employees working under the same manager
Parts that were ordered together
Bipartite graph between Part and Supplier
...

Which customer bought which product?

SLIDE 6

GraphGen

Backend Relational DBMS

Java Program

Graph Definition Queries

DSL Parser + Optimizer

In-Memory Engine SQL Queries

Graph Analysis Queries

Results Direct Graph Access

Vertex- Centric Directly over Graph

GraphGen

SLIDE 7

Definition of a GraphView over the database

○ User specifies how to construct the Nodes and Edges

GraphGenDL - Definition Language

CREATE GRAPHVIEW CoAuthors AS Nodes(ID, name) :- Author(ID, name). Edges(ID1, ID2, wt=$COUNT(pub)) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub).

Definition of a collection of graphs (Multi-Graph View) over the database

○ Can enable many optimizations

Edge Property: number of publications

CREATE GRAPHVIEW AuthorEgoNetworks(X) WHERE Author(X) AS Nodes(X, name) :- Author(X, name). Nodes(ID, name) :- AuthorPub(X,pub), AuthorPub(ID,pub), Author(ID, name). Edges(ID1, ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub).

Extract all

ego-graphs

SLIDE 8

Specifying Graph Queries over GraphViews
Support for subgraph pattern matching languages like

SPARQL, Cypher, PGQL etc.

Datalog is a natural fit for expressing recursive

computation over the Edges VIEW

USING GRAPHVIEW CoAuthors Triangle(X, Y, Z) :- Nodes(X, _, “ML” ),Nodes(Y, _, “DB”), Nodes(Z, _, “AL” ),Edges(X, Y),Edges(Y, Z),Edges(X, Z).

GraphGenQL - Query Language

Find triangles of authors whose areas follow: “ML” -> “DB” -> “AL”

SLIDE 9

GraphGen

Backend Relational DBMS

Java Program

Graph Definition Queries

DSL Parser + Optimizer

In-Memory Engine SQL Queries

Graph Analysis Queries

Results Direct Graph Access

Vertex- Centric Directly over Graph

GraphGen

SLIDE 10

GraphGen

Backend Relational DBMS

Java Program

Graph Definition Queries

DSL Parser + Optimizer

In-Memory Engine SQL Queries

Graph Analysis Queries

Results Direct Graph Access

Vertex- Centric Directly over Graph

GraphGen

Goal: We want to adapt the execution

based on the query/analysis.

What are some of the challenges here??

SLIDE 11

1. Where to execute Queries/ Tasks
Depends on workload, rate of updates, rate of queries…

In-memory execution In-database execution

Dataset In-memory ETL

MySQL PosgreSQL

Small 0.001 s 2.05 s

0.8 s 0.1 s

Large 0.015 s 17.52 s

4.26 0.704 s

Triangle Pattern Matching

Dataset DBS1 DBS2 Small 0.899 s 0.22 Large 4.25 s NA

Key Challenge: Develop accurate cost models, tools,
techniques. Decide what to compute where
Other issues: Large-output joins [SIGMOD ‘17], and selectivity

estimation errors associated with them.

SLIDE 12

2. Query Rewriting
Assume the execution is to be pushed to the database
Many different ways to construct equivalent SQL queries
Auto-generated SQL can be verbose → Challenging to optimize

With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query)

DISTINCT DISTINCT

1) With vs VIEW 2) Duplicate Elimination (DISTINCT)

The costly duplicate removal might even be unnecessary if

the query / analysis doesn’t care about them!

SLIDE 13

2. Query Rewriting
Assume the execution is to be pushed to the database
Many different ways to construct equivalent SQL queries
Auto-generated SQL can be verbose → Challenging to optimize

With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query)

DISTINCT DISTINCT

1) With vs VIEW 2) Duplicate Elimination (DISTINCT)

The costly duplicate removal might even be unnecessary if

the query / analysis doesn’t care about them!

Time for query to finish in seconds

SLIDE 14

3. Optimizing Multi-Graph Views

Snapshots

CREATE GRAPHVIEW CoAuthorsSnapshot(X) WHERE X IN RANGE (1950 , 2017 , 1) Nodes(ID,name) :- Author(ID,name). Edges(ID1,ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub), Publication(pub, _, Y), Y <= X.

Key Challenge: Develop a systematic approach to optimizing the extraction of and execution against such multi-graph views.

E.g. Ego-Graph Analysis

Naive: Generate a separate SQL query for each distinct graph.
Result-Tagging: We can extract all graphs with a single query!

Please see full paper

Ego Graph Analysis, Graph snapshot analysis
Ability to refer to each graph independently → significant

savings

Opportunity: Overlap computation and storage over

collections of graphs

SLIDE 15

Result-Tagging

aid1 a1 a1 a6 a1 aid2 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 tag a1 a1 a6 a1 a7 a5 a3 a2

Tagged Edges Table

e1.aid2 = e2.aid1

aid1 a1 a1 a6 a1 aid2 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 tags[] [a1] [a1] [a1,a6] [a1] [a6,a7] [a5,a1] [a2,a3,a5] [a1,a2]

Tag Aggregation

Find the edges 1-hop away for the source (tag) & Union the result with the initial Tagged Edges table aid1 a2 a5 a6 aid2 a3 a3 a7 a7 a8 a3 a4 a3 a4 tag a1 a1 a1 a6 a5 a2 a1 a1 a6 a1 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 a1 a1 a6 a1 a7 a5 a3 a2 Tags show which ego-graphs involve the edge

SLIDE 16

Take Aways

Need for a unified framework for extraction and analysis of

graphs stored implicitly in a structured data store.

We propose declarative a Datalog-based DSL for specifying:

○ GraphViews over relational schemas ○ Declarative Graph queries

Expose a series of APIs for defining complex graph analytics over

GraphViews There is a variety of challenges & opportunities here in terms of:

Deciding where to execute graph queries
Handling large-output joins and inaccuracies of query optimizers
Rewriting SQL queries pushed to the database
Optimizing across collections of graphs (Multi-Graph Views)

GraphGen : Adaptive Graph Processing using Relational Databases - - PowerPoint PPT Presentation

GraphGen: Adaptive Graph Processing using Relational Databases

Graph Analytics / Querying

RDBMS-based Graph Systems vs GraphGen

Example: TPC-H

Example: TPC-H

GraphGen

GraphGenDL - Definition Language

GraphGenQL - Query Language

GraphGen

GraphGen

Result-Tagging

Take Aways

Thank you! Questions?