Extending In-Memory Relational Database Engines with Native Graph - - PowerPoint PPT Presentation

extending in memory relational database engines with
SMART_READER_LITE
LIVE PREVIEW

Extending In-Memory Relational Database Engines with Native Graph - - PowerPoint PPT Presentation

Extending In-Memory Relational Database Engines with Native Graph Support Mohamed S. Hassan 1 Tatiana Kuznetsova 1 Hyun Chai Jeong 1 Walid G. Aref 1 Mohammad Sadoghi 2 2 University of California Davis, CA, USA 1 Purdue University West


slide-1
SLIDE 1

Extending In-Memory Relational Database Engines with Native Graph Support

Mohamed S. Hassan1 Tatiana Kuznetsova1 Hyun Chai Jeong1 Walid G. Aref1 Mohammad Sadoghi2

1Purdue University – West Lafayette, IN, USA

EDBT’18

2University of California – Davis, CA, USA

slide-2
SLIDE 2

Graphs are Ubiquitous

2

Road Network Biological Network Datacenter Network Social Network

slide-3
SLIDE 3

Specialized Graph Database Systems

3

¨ Specialized graph databases can handle graph query-workloads

¤ Vital queries include shortest-path and reachability queries

slide-4
SLIDE 4

Why Use Relational Database Systems to Support Graphs?

4

¨ RDBMS technology is very mature and widely-adopted ¨ Relational data can have latent graphs ¨ Can easily represent graphs using relational tables ¨ Many applications involve graph queries

¤ Queries that involve both relational and graph predicates

n E.g., for each Patient P in a selected area, find the nearest hospital to P ¨ How can an RDBMS effectively and efficiently handle graph query

workloads?

slide-5
SLIDE 5

Graph Support in RDBMSs

5

¨ Why is it challenging?

¤ There is an impedance mismatch between the relational model and the

graph model

¨ Two common approaches for supporting graphs:

¤ Native Relational-Core ¤ Native Graph-Core

¨ Native G+R Core (The proposed GRFusion system)

slide-6
SLIDE 6

Native Relational-Core

6

¨ Use a vanilla RDBMS ¨ Encode graphs in relational schemas ¨ Support limited graph queries ¨ Translate the supported graph queries

into SQL or procedural SQL

¨ E.g., SQLGraph [SIGMOD’15],

Grail [CIDR’15]

¨ Pros:

¤ Use of very mature RDBMS technology

¨ Cons:

¤ Several graph queries are inefficient

to evaluate using pure SQL

¤ Graphs are encoded in complex schemas

Relational Database

Relational Data

Relational Queries (SQL) Results

Graph Encoded into Relational Tables

Graph Queries SQL Translation Layer

slide-7
SLIDE 7

¨ Build on top of an RDBMS ¨ Extract graphs from the RDBMS ¨ Store graphs and process queries

  • utside the realm of the RDBMS

¨ E.g., Ringo [SIGMOD’15],

GraphGen [VLDB’15, SIGMOD’17]

¨ Pros:

¤ Native processing of graph operations

¨ Cons:

¤ Graph updates require

re-extracting the graphs

¤ Queries cannot reference any

non-extracted relational data

Native Graph-Core

7

Relational Database

Relational Data

Graph Extraction Queries (SQL) Graph Extraction and Materialization Engine Extracted Graphs Results Graph Queries

slide-8
SLIDE 8

The Relational Model vs. the Graph Model

8

¨ Graph-core approach

¤ +ve: Queries involving graph traversals are efficiently handled in

the graph model (e.g., shortest paths)

¤ -ve: Not as pervasive and mature as RDBMSs

¨ Relational-core approach

¤ +ve: Mature and pervasive ¤ -ve: Either many temporary inserts/deletes/updates, or too many

joins to traverse a graph

n Intermediate-result size and cardinality estimation ¨ Can the best of the two worlds be combined?

¤ Support native graph processing inside an RDBMS

slide-9
SLIDE 9

Proposed Approach: Native G+R Core

9

¨ Assume that graphs have relational schemas ¨ Relational schemas describe the edges/nodes ¨ Enables graphs to be defined

as native database objects

¨ Store graphs in non-relational structures

  • ptimized for graph operations

¨ Extend the SQL language

¤ Queries can compose relational and

graph operations

¨ Cross-Data-Model QEPs (Query Evaluation

Plans)

¨ Graph updates are supported

Relational Database

Graph Views (Topology + Tuple Pointers) Relational Data

Graph Construction ⋈ σ

GraphOp

π Graph and Relational Operators in the Same QEP Graph-Relational Queries (SQL) Results

slide-10
SLIDE 10

¨ We realize the G+R approach inside

VoltDB

¤ An open-source in-memory RDBMS ¤ GRFusion: Our realization of the G+R

approach into VoltDB

¤ A demo of GRFusion will appear in

SIGMOD 2018

GRFusion: Realizing the G+R Approach

10

In-Memory Relational Database Declarative Graph-Relational Queries

Graph Views Relational Data

Graph-Relational Query Engine

Query Parser Query Optimizer Plan Executor

slide-11
SLIDE 11

Create Graph View

11

¨ Create-Graph-View statement

¤ Create a named graph database object that can be referenced in queries ¤ Define the relational sources of the graph’s vertexes/edges ¤ Materialize the topology of the graph in main-memory as a singleton graph

structure

slide-12
SLIDE 12

Graph-View of a Social Network

12

slide-13
SLIDE 13

Graph-View Structure [Traversal Index]

13

slide-14
SLIDE 14

Graph-View Structure [Traversal Index]

14

slide-15
SLIDE 15

The VERTEXES Construct

15

¨ Appears in the FROM clause and references a graph view

¤ Select … From MyGraphView.VERTEXES v

¨ VERTEXES represents the vertexes of a graph view ¨ A vertex is a tuple with the following properties:

¤ Id ¤ FanIn ¤ FanOut ¤ Property for each vertex attribute

slide-16
SLIDE 16

The EDGES Construct

16

¨ Appears in the FROM clause and references a graph view

¤ Select … From MyGraphView.EDGES v

¨ EDGES represents the edges of a graph view ¨ An edge is a tuple with the following properties:

¤ Id ¤ StartVertexId ¤ EndVertexId ¤ Property for each edge attribute

slide-17
SLIDE 17

The PATHS Construct – Extended SQL

17

¨ Appears in the FROM clause and references a graph view

¤ Select … From MyGraphView.PATHS P

¨ PATHS represents a set of lazily-evaluated paths ¨ A path is a set of consecutive edges ¨ Each edge has two endpoint vertexes

¤ E.g., (V:attributes) –(:E:attributes)à(V:attributes) …..

¨ A path is a tuple with the following properties:

¤ Length ¤ StartVertex ¤ EndVertex ¤ Vertexes ¤ Edges

slide-18
SLIDE 18

Declarative Graph-Relational Queries

18

slide-19
SLIDE 19

The PathScan Operator

19

¨ PathScan is a logical operator that acts on a graph-view

¤ Has three corresponding physical operators: BFScan, DFScan, SPScan

¨ The output of PathScan is a tuple

¤ Extends the standard relational tuple ¤ PathScan output can be ingested by other relational operators in the QEP

¨ PathScan accepts the id of the vertex to start the traversal from

¤ Otherwise, all the vertexes will be considered as start vertexes

¨ Filters can be pushed as Hints into the PathScan operator

¤ E.g., P.PathLength = 2

slide-20
SLIDE 20

Friends-of-Friends Query Example

20

¨ For all the users working as lawyers, retrieve the last name of their

friends of friends, where the friendships happened after 1/1/2000

slide-21
SLIDE 21

QEP of the Friends-of-Friends Query

21

slide-22
SLIDE 22

Reachability Query Example

22

¨ Check if Protein X interacts directly (i.e., by an edge) or indirectly (i.e.,

by a path) with Protein Y through either a covalent or a stable interaction type.

slide-23
SLIDE 23

Shortest-Path Queries with Relational Predicates

23

slide-24
SLIDE 24

Evaluating The Native G+R Approach

24

¨ Realized a certralized version of GRFusion (Native G+R Core

approach) inside VoltDB Version 6.7

¨ Single node running Linux kernel Version 3.17.7 n 32 cores of Intel Xeon 2.90 GHz n 384 GB of RAM ¨ Comparing against:

¤ Native Relational-Core:

n SQLGraph [SIGMOD’15], Grail [CIDR’15]

¤ Natice Graph-Core Systems:

n Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]

slide-25
SLIDE 25

Experimental Setup

25

¨ Native relational-core approach

¤ SQLGraph [SIGMOD’15]

n Represent path traversal using recursive relational joins n Commercial system (code not available) n Implemented the techniques in VoltDB in-memory

¤ Grail [CIDR’15]

n Implemented Grail in VoltDB n Also evaluated Grail in Hekaton n Got similar conclusions (Do not report the Hekaton results here)

slide-26
SLIDE 26

Experimental Setup (Cont’d)

26

¨ Native Graph Approch

¤ Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]

n Native graph-cores (specialized graph systems) n Disk-based systems n Titan: configured to use the in-memory storage configuration n Neo4j: Run on RamDisk to mitigate the disk IO cost

¤ GRFusion uses simple graph algorithms (single-source-shortest-path -

Dijkstra’s algorithm)

n Want to investigate performance gains, if any, of the G+R approach in contrast to

the native relational-core

slide-27
SLIDE 27

Evaluating GRFusion

27

¨ Graph queries

¤ Reachability queries ¤ Reachability queries with filtering predicates ¤ Shortest path queries ¤ Subgraph queries (e.g., count triangles)

¨ Datasets

slide-28
SLIDE 28

Reachability Queries (DBLP Dataset)

28

¨ Performance of GRFusion, Neo4j, Titan more

stable in contrast to SQLGraph

¤ Avoid overheads of recursive relational joins

¨ GRFusion performs better than Neo4J &Titan

¤ VoltDB is optimized for main-memory ¤ Disk-based Titan/Neo4j (although runs on

RamDisk) are not optimized for main-memory

¤ Graph views in GRFusion are more compact n Encode only the topology within the graph n No vertex/edge attributes in the topology n Thus, GRFusion makes better use of caching ¤ GRFusion/VoltDB are C++-based ¤ Neo4j and Titan are Java-based n Overheads from the automatic memory

management of Java

slide-29
SLIDE 29

Reachability Queries (String Dataset)

29

¨ String dataset: ~ 0.5B edges >> DBLP ¨ SQLGraph (based on VoltDB): ¤ Materialize the join results at each

intermediate stage

¤ Explosion in size of intermediate results

(perform more than 11 joins)

¨ SQLGraph and GRFusion follow BFS

evaluation

¨ GRFusion follows as iterative model: ¤ Evaluate one path at a time ¤ Also, only the vertex Ids are stored in BFS

queue

¤ More efficient storage-wise than storing

the tuples of the relational joins as intermediate results

slide-30
SLIDE 30

Reachability Queries (Twitter Dataset)

30

¨ Twitter dataset: 1.4B edges

dataset

¨ Fan-out is also a factor in the

performance

¤ But we did not study effect of fan-

  • ut

¤ Would require synthetic datasets ¤ Current study focus on real

datasets

slide-31
SLIDE 31

Constrained-Reachability Queries (String Dataset)

31

¨ Select a sub-graph then perform the

query

¨ Use synthesized edge attributes to

control selectivity

¨ Vary the selectivity from 5% to 50% ¨ Limit path length of the results of the

generated queries to 20

¤ To emphasize the effect of the selectivity

  • f the sub-graph to operate on

¨ GRFusion uses the relational engine to

evaluate the filtering predicates on the edges

¤ Utilize pointers from edges to

corresponding tuples

slide-32
SLIDE 32

SSSP Queries – Tiger Dataset

32

¨ Grail implemented in-

memory using VoltDB

¨ GRFusion: Run Dijkstra’s

algorithm

slide-33
SLIDE 33

A Note on the Performance Gains of GRFusion

33

¨ Table scan or index scan/seek

¤ Direct pointers are more efficient

¨ Relational joins

¤ Large intermediate results ¤ Inaccurate cardinality estimation

σ eTable T1 σ eTable T2 σ eTable T3

slide-34
SLIDE 34

Conclusions

34

¨ GRFusion (G+R Core approach)

¤ Execute both relational and graph operations in native mode

¨ Write declarative path-queries with relational predicates ¨ Outperform the state-of-the-art by up to 4 orders-of-magnitude in

query-time speedup

¨ Avoid large intermediate join results ¨ Avoid inaccurate cardinality estimation that may lead to non-optimal

join-algorithm selection

slide-35
SLIDE 35

Thank You!

35

slide-36
SLIDE 36

Vertex Query Example

36

¨ Retrige the Birthdate and the number of friends of each user in the

social network with last name = ‘Smith’