extending in memory relational database engines with
play

Extending In-Memory Relational Database Engines with Native Graph - PowerPoint PPT Presentation

Extending In-Memory Relational Database Engines with Native Graph Support Mohamed S. Hassan 1 Tatiana Kuznetsova 1 Hyun Chai Jeong 1 Walid G. Aref 1 Mohammad Sadoghi 2 2 University of California Davis, CA, USA 1 Purdue University West


  1. Extending In-Memory Relational Database Engines with Native Graph Support Mohamed S. Hassan 1 Tatiana Kuznetsova 1 Hyun Chai Jeong 1 Walid G. Aref 1 Mohammad Sadoghi 2 2 University of California – Davis, CA, USA 1 Purdue University – West Lafayette, IN, USA EDBT’18

  2. Graphs are Ubiquitous 2 Biological Network Road Network Social Network Datacenter Network

  3. Specialized Graph Database Systems 3 ¨ Specialized graph databases can handle graph query-workloads ¤ Vital queries include shortest-path and reachability queries

  4. Why Use Relational Database Systems to Support Graphs ? 4 ¨ RDBMS technology is very mature and widely-adopted ¨ Relational data can have latent graphs ¨ Can easily represent graphs using relational tables ¨ Many applications involve graph queries ¤ Queries that involve both relational and graph predicates n E.g., for each Patient P in a selected area, find the nearest hospital to P ¨ How can an RDBMS effectively and efficiently handle graph query workloads ?

  5. Graph Support in RDBMSs 5 ¨ Why is it challenging ? ¤ There is an impedance mismatch between the relational model and the graph model ¨ Two common approaches for supporting graphs: ¤ Native Relational-Core ¤ Native Graph-Core ¨ Native G+R Core (The proposed GRFusion system)

  6. Native Relational-Core 6 ¨ Use a vanilla RDBMS Results ¨ Encode graphs in relational schemas Graph Queries ¨ Support limited graph queries ¨ Translate the supported graph queries into SQL or procedural SQL Relational Queries SQL Translation Layer ¨ E.g., SQLGraph [SIGMOD’15], (SQL) Grail [CIDR’15] ¨ Pros: ¤ Use of very mature RDBMS technology Relational Data Graph Encoded into ¨ Cons: Relational Tables ¤ Several graph queries are inefficient to evaluate using pure SQL ¤ Graphs are encoded in complex schemas Relational Database

  7. Native Graph-Core 7 ¨ Build on top of an RDBMS Results ¨ Extract graphs from the RDBMS Graph Queries ¨ Store graphs and process queries outside the realm of the RDBMS Graph Extraction and Materialization Engine ¨ E.g., Ringo [SIGMOD’15], GraphGen [VLDB’15, SIGMOD’17] Graph Extraction Extracted Graphs ¨ Pros: Queries (SQL) ¤ Native processing of graph operations ¨ Cons: Relational Data ¤ Graph updates require re-extracting the graphs ¤ Queries cannot reference any non-extracted relational data Relational Database

  8. The Relational Model vs. the Graph Model 8 ¨ Graph-core approach ¤ +ve: Queries involving graph traversals are efficiently handled in the graph model (e.g., shortest paths) ¤ -ve: Not as pervasive and mature as RDBMSs ¨ Relational-core approach ¤ +ve: Mature and pervasive ¤ -ve: Either many temporary inserts/deletes/updates, or too many joins to traverse a graph n Intermediate-result size and cardinality estimation ¨ Can the best of the two worlds be combined ? ¤ Support native graph processing inside an RDBMS

  9. Proposed Approach: Native G+R Core 9 ¨ Assume that graphs have relational schemas Results ¨ Relational schemas describe the edges/nodes Graph-Relational Queries (SQL) ¨ Enables graphs to be defined as native database objects π Graph and Relational Operators ¨ Store graphs in non-relational structures ⋈ in the Same QEP optimized for graph operations σ GraphOp ¨ Extend the SQL language Graph Views (Topology Relational Data ¤ Queries can compose relational and + Tuple Pointers) graph operations ¨ Cross-Data-Model QEPs (Query Evaluation Plans) ¨ Graph updates are supported Graph Construction Relational Database

  10. GRFusion: Realizing the G+R Approach 10 Declarative Graph-Relational Queries ¨ We realize the G+R approach inside VoltDB Query Parser ¤ An open-source in-memory RDBMS Query Optimizer ¤ GRFusion: Our realization of the G+R Plan Executor approach into VoltDB Graph-Relational Query Engine ¤ A demo of GRFusion will appear in SIGMOD 2018 Relational Data Graph Views In-Memory Relational Database

  11. Create Graph View 11 ¨ Create-Graph-View statement ¤ Create a named graph database object that can be referenced in queries ¤ Define the relational sources of the graph’s vertexes/edges ¤ Materialize the topology of the graph in main-memory as a singleton graph structure

  12. Graph-View of a Social Network 12

  13. Graph-View Structure [Traversal Index] 13

  14. Graph-View Structure [Traversal Index] 14

  15. The VERTEXES Construct 15 ¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.VERTEXES v ¨ VERTEXES represents the vertexes of a graph view ¨ A vertex is a tuple with the following properties: ¤ Id ¤ FanIn ¤ FanOut ¤ Property for each vertex attribute

  16. The EDGES Construct 16 ¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.EDGES v ¨ EDGES represents the edges of a graph view ¨ An edge is a tuple with the following properties: ¤ Id ¤ StartVertexId ¤ EndVertexId ¤ Property for each edge attribute

  17. The PATHS Construct – Extended SQL 17 ¨ Appears in the FROM clause and references a graph view ¤ Select … From MyGraphView.PATHS P ¨ PATHS represents a set of lazily-evaluated paths ¨ A path is a set of consecutive edges ¨ Each edge has two endpoint vertexes ¤ E.g., (V:attributes) –(:E:attributes) à (V:attributes) ….. ¨ A path is a tuple with the following properties: ¤ Length ¤ StartVertex ¤ EndVertex ¤ Vertexes ¤ Edges

  18. Declarative Graph-Relational Queries 18

  19. The PathScan Operator 19 ¨ PathScan is a logical operator that acts on a graph-view ¤ Has three corresponding physical operators: BFScan, DFScan, SPScan ¨ The output of PathScan is a tuple ¤ Extends the standard relational tuple ¤ PathScan output can be ingested by other relational operators in the QEP ¨ PathScan accepts the id of the vertex to start the traversal from ¤ Otherwise, all the vertexes will be considered as start vertexes ¨ Filters can be pushed as Hints into the PathScan operator ¤ E.g., P.PathLength = 2

  20. Friends-of-Friends Query Example 20 ¨ For all the users working as lawyers, retrieve the last name of their friends of friends, where the friendships happened after 1/1/2000

  21. QEP of the Friends-of-Friends Query 21

  22. Reachability Query Example 22 ¨ Check if Protein X interacts directly (i.e., by an edge) or indirectly (i.e., by a path) with Protein Y through either a covalent or a stable interaction type.

  23. Shortest-Path Queries with Relational Predicates 23

  24. Evaluating The Native G+R Approach 24 ¨ Realized a certralized version of GRFusion ( Native G+R Core approach) inside VoltDB Version 6.7 ¨ Single node running Linux kernel Version 3.17.7 n 32 cores of Intel Xeon 2.90 GHz n 384 GB of RAM ¨ Comparing against: ¤ Native Relational-Core: n SQLGraph [SIGMOD’15], Grail [CIDR’15] ¤ Natice Graph-Core Systems: n Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan]

  25. Experimental Setup 25 ¨ Native relational-core approach ¤ SQLGraph [SIGMOD’15] n Represent path traversal using recursive relational joins n Commercial system (code not available) n Implemented the techniques in VoltDB in-memory ¤ Grail [CIDR’15] n Implemented Grail in VoltDB n Also evaluated Grail in Hekaton n Got similar conclusions (Do not report the Hekaton results here)

  26. Experimental Setup (Cont’d) 26 ¨ Native Graph Approch ¤ Neo4j [neo4j.com] and Titan [thinkaurelius.github.io/titan] n Native graph-cores (specialized graph systems) n Disk-based systems n Titan: configured to use the in-memory storage configuration n Neo4j: Run on RamDisk to mitigate the disk IO cost ¤ GRFusion uses simple graph algorithms (single-source-shortest-path - Dijkstra’s algorithm) n Want to investigate performance gains, if any, of the G+R approach in contrast to the native relational-core

  27. Evaluating GRFusion 27 ¨ Graph queries ¤ Reachability queries ¤ Reachability queries with filtering predicates ¤ Shortest path queries ¤ Subgraph queries (e.g., count triangles) ¨ Datasets

  28. Reachability Queries (DBLP Dataset) 28 ¨ Performance of GRFusion, Neo4j, Titan more stable in contrast to SQLGraph ¤ Avoid overheads of recursive relational joins ¨ GRFusion performs better than Neo4J &Titan ¤ VoltDB is optimized for main-memory ¤ Disk-based Titan/Neo4j (although runs on RamDisk) are not optimized for main-memory ¤ Graph views in GRFusion are more compact n Encode only the topology within the graph n No vertex/edge attributes in the topology n Thus, GRFusion makes better use of caching ¤ GRFusion/VoltDB are C++-based ¤ Neo4j and Titan are Java-based n Overheads from the automatic memory management of Java

  29. Reachability Queries (String Dataset) 29 ¨ String dataset: ~ 0.5B edges >> DBLP ¨ SQLGraph (based on VoltDB): ¤ Materialize the join results at each intermediate stage ¤ Explosion in size of intermediate results (perform more than 11 joins) ¨ SQLGraph and GRFusion follow BFS evaluation ¨ GRFusion follows as iterative model: ¤ Evaluate one path at a time ¤ Also, only the vertex Ids are stored in BFS queue ¤ More efficient storage-wise than storing the tuples of the relational joins as intermediate results

  30. Reachability Queries (Twitter Dataset) 30 ¨ Twitter dataset: 1.4B edges dataset ¨ Fan-out is also a factor in the performance ¤ But we did not study effect of fan- out ¤ Would require synthetic datasets ¤ Current study focus on real datasets

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend