GraphGen : Adaptive Graph Processing using Relational Databases - - PowerPoint PPT Presentation

graphgen adaptive graph processing using relational
SMART_READER_LITE
LIVE PREVIEW

GraphGen : Adaptive Graph Processing using Relational Databases - - PowerPoint PPT Presentation

GraphGen : Adaptive Graph Processing using Relational Databases Department of Computer Science University of Maryland Graph Analytics / Querying Graph datasets can provide value in many domains Protein Interaction Email Networks Social


slide-1
SLIDE 1

GraphGen: Adaptive Graph Processing using Relational Databases

Department of Computer Science University of Maryland

slide-2
SLIDE 2

Graph Analytics / Querying

Graph datasets can provide value in many domains

Social Networks Email Networks Protein Interaction Networks Stock Trading Networks

Many different types of ways to manage graph data

  • Graph Databases (neo4j, orientDB, RDF stores)
  • Distributed Batch Analytics systems (Giraph, GraphX, GraphLab)
  • In-Memory systems (Ligra, Green-Marl, X-Stream)
  • Many research prototypes / custom indexes.
slide-3
SLIDE 3

RDBMS-based Graph Systems vs GraphGen

DECLARATIVE

slide-4
SLIDE 4

Example: TPC-H

  • rder_key

customer_key

Orders

  • 1

c1

  • 2

c2

  • 3

c3

  • rder_key

part_key

LineItem

  • 1

p1

  • 1

p2

  • 2

p1

  • 2

p3

  • 3

p1

  • 3

p2

  • 3

p2 c_key name

Customer

c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1

Orders LineItem

On order_key

Which customer bought which product?

On p_key

Which customers bought the same item?

c1 c4 cust1 cust2 c1 c6 c1 c3

c1

c4 c6

c4 c3 c6

c_1 John c_2 Jane

slide-5
SLIDE 5

Example: TPC-H

  • rder_key

customer_key

Orders

  • 1

c1

  • 2

c2

  • 3

c3

  • rder_ke

y part_ke y

LineItem

  • 1

p1

  • 1

p2

  • 2

p1

  • 2

p3

  • 3

p1

  • 3

p2

  • 3

p2 c_ke y name

Customer

c_key p_key c1 p1 c1 p2 c3 p2 c4 p1 c6 p1

Orders LineItem

On order_key On p_key

Which customers bought the same item?

c1 c4 cust1 cust2 c1 c6 c1 c3

c1

c4 c6

c4 c3 c6

c_1 John c_2 Jane

Many other graphs of potential interest:

  • Suppliers that sell a common item
  • Employees working under the same manager
  • Parts that were ordered together
  • Bipartite graph between Part and Supplier
  • ...

Which customer bought which product?

slide-6
SLIDE 6

GraphGen

Backend Relational DBMS

Java Program

Graph Definition Queries

DSL Parser + Optimizer

In-Memory Engine SQL Queries

Graph Analysis Queries

Results Direct Graph Access

Vertex- Centric Directly over Graph

GraphGen

slide-7
SLIDE 7
  • Definition of a GraphView over the database

○ User specifies how to construct the Nodes and Edges

GraphGenDL - Definition Language

CREATE GRAPHVIEW CoAuthors AS Nodes(ID, name) :- Author(ID, name). Edges(ID1, ID2, wt=$COUNT(pub)) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub).

  • Definition of a collection of graphs (Multi-Graph View) over the database

○ Can enable many optimizations

Edge Property: number of publications

CREATE GRAPHVIEW AuthorEgoNetworks(X) WHERE Author(X) AS Nodes(X, name) :- Author(X, name). Nodes(ID, name) :- AuthorPub(X,pub), AuthorPub(ID,pub), Author(ID, name). Edges(ID1, ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub).

Extract all

ego-graphs

slide-8
SLIDE 8
  • Specifying Graph Queries over GraphViews
  • Support for subgraph pattern matching languages like

SPARQL, Cypher, PGQL etc.

  • Datalog is a natural fit for expressing recursive

computation over the Edges VIEW

USING GRAPHVIEW CoAuthors Triangle(X, Y, Z) :- Nodes(X, _, “ML” ),Nodes(Y, _, “DB”), Nodes(Z, _, “AL” ),Edges(X, Y),Edges(Y, Z),Edges(X, Z).

GraphGenQL - Query Language

Find triangles of authors whose areas follow: “ML” -> “DB” -> “AL”

slide-9
SLIDE 9

GraphGen

Backend Relational DBMS

Java Program

Graph Definition Queries

DSL Parser + Optimizer

In-Memory Engine SQL Queries

Graph Analysis Queries

Results Direct Graph Access

Vertex- Centric Directly over Graph

GraphGen

slide-10
SLIDE 10

GraphGen

Backend Relational DBMS

Java Program

Graph Definition Queries

DSL Parser + Optimizer

In-Memory Engine SQL Queries

Graph Analysis Queries

Results Direct Graph Access

Vertex- Centric Directly over Graph

GraphGen

  • Goal: We want to adapt the execution

based on the query/analysis.

  • What are some of the challenges here??
slide-11
SLIDE 11
  • 1. Where to execute Queries/ Tasks
  • Depends on workload, rate of updates, rate of queries…

In-memory execution In-database execution

Dataset In-memory ETL

MySQL PosgreSQL

Small 0.001 s 2.05 s

0.8 s 0.1 s

Large 0.015 s 17.52 s

4.26 0.704 s

Triangle Pattern Matching

Dataset DBS1 DBS2 Small 0.899 s 0.22 Large 4.25 s NA

  • Key Challenge: Develop accurate cost models, tools,
  • techniques. Decide what to compute where
  • Other issues: Large-output joins [SIGMOD ‘17], and selectivity

estimation errors associated with them.

slide-12
SLIDE 12
  • 2. Query Rewriting
  • Assume the execution is to be pushed to the database
  • Many different ways to construct equivalent SQL queries
  • Auto-generated SQL can be verbose → Challenging to optimize

With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query)

DISTINCT DISTINCT

1) With vs VIEW 2) Duplicate Elimination (DISTINCT)

  • The costly duplicate removal might even be unnecessary if

the query / analysis doesn’t care about them!

slide-13
SLIDE 13
  • 2. Query Rewriting
  • Assume the execution is to be pushed to the database
  • Many different ways to construct equivalent SQL queries
  • Auto-generated SQL can be verbose → Challenging to optimize

With Nodes as (...) With Edges as (...) (SQL for answering query) Create View Edges as (...) Create View Nodes as (...) (SQL for answering query)

DISTINCT DISTINCT

1) With vs VIEW 2) Duplicate Elimination (DISTINCT)

  • The costly duplicate removal might even be unnecessary if

the query / analysis doesn’t care about them!

Time for query to finish in seconds

slide-14
SLIDE 14
  • 3. Optimizing Multi-Graph Views

Snapshots

CREATE GRAPHVIEW CoAuthorsSnapshot(X) WHERE X IN RANGE (1950 , 2017 , 1) Nodes(ID,name) :- Author(ID,name). Edges(ID1,ID2) :- AuthorPub(ID1, pub), AuthorPub(ID2, pub), Publication(pub, _, Y), Y <= X.

Key Challenge: Develop a systematic approach to optimizing the extraction of and execution against such multi-graph views.

E.g. Ego-Graph Analysis

  • Naive: Generate a separate SQL query for each distinct graph.
  • Result-Tagging: We can extract all graphs with a single query!

Please see full paper

  • Ego Graph Analysis, Graph snapshot analysis
  • Ability to refer to each graph independently → significant

savings

  • Opportunity: Overlap computation and storage over

collections of graphs

slide-15
SLIDE 15

Result-Tagging

aid1 a1 a1 a6 a1 aid2 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 tag a1 a1 a6 a1 a7 a5 a3 a2

Tagged Edges Table

e1.aid2 = e2.aid1

aid1 a1 a1 a6 a1 aid2 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 tags[] [a1] [a1] [a1,a6] [a1] [a6,a7] [a5,a1] [a2,a3,a5] [a1,a2]

Tag Aggregation

Find the edges 1-hop away for the source (tag) & Union the result with the initial Tagged Edges table aid1 a2 a5 a6 aid2 a3 a3 a7 a7 a8 a3 a4 a3 a4 tag a1 a1 a1 a6 a5 a2 a1 a1 a6 a1 a2 a5 a6 a7 a7 a8 a5 a3 a3 a4 a2 a3 a1 a1 a6 a1 a7 a5 a3 a2 Tags show which ego-graphs involve the edge

slide-16
SLIDE 16

Take Aways

  • Need for a unified framework for extraction and analysis of

graphs stored implicitly in a structured data store.

  • We propose declarative a Datalog-based DSL for specifying:

○ GraphViews over relational schemas ○ Declarative Graph queries

  • Expose a series of APIs for defining complex graph analytics over

GraphViews There is a variety of challenges & opportunities here in terms of:

  • Deciding where to execute graph queries
  • Handling large-output joins and inaccuracies of query optimizers
  • Rewriting SQL queries pushed to the database
  • Optimizing across collections of graphs (Multi-Graph Views)

Thank you! Questions?