Graph Analytics using Vertica Relational Database Meichun Hsu - PowerPoint PPT Presentation

Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Malú Castellanos Microsoft MIT Vertica Vertica * work done while at MIT

Motivation for graphs on DB • Data anyways in a DB   - avoid expensive copying   - end-to-end data analysis   - leverage other DB features • Processing involves full scans and joins   - relational engines could run them efficiently   - particularly suited for column stores • Relational algebra/SQL offers powerful declarative syntax   - in fact, we could express Giraph as an operator DAG   - can even express more complex graph analytics

5-point Agenda • From graph queries to SQL : how do we make the translation? • Graph query optimization : can we leverage decades of relational wisdom? • Column store backends : why are they a good choice? • Comparison with specialized graph systems : how do the numbers look? • Extending column stores : can we do better?

1. From Graph to SQL

Vertex-centric Graph Queries • Popular language for graph analytics 1 • Vertex programs that run in 3 supersets and communicate via 2 5 message passing 4

Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 inf • Vertex programs that run in 1 3 inf supersets and communicate via 2 5 inf message passing 4 inf

Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 inf supersets and communicate via 2 2 5 1 message passing 2 2 4 inf

Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 3 4 2

Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 4 2

Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 4 • Programmer only specifies a vertex 2 program • System takes care of running it in parallel

The Giraph Plan • Giraph: a popular, open-source graph analytics system on Hadoop

HDFS G=(V,E) The Giraph Plan Split Input Superstep Scan W 1 W 2 W 3 W 4 … … … RecRead Shuffle Server Data W 1 W 2 W 3 W 4 partition store edge store … … … • Giraph: a popular, open-source message store Master synchronize graph analytics system on Hadoop vertexCompute Superstep 1 Shuffle Server Data W 1 W 2 W 3 W 4 • The Giraph physical plan: hard partition store edge store … … … message store coded physical execution pipeline Master synchronize ……. vertexCompute Shuffle … Server Data W 1 W 2 W 3 W 4 partition store edge store … … … message store Output Superstep Master synchronize cleanup store HDFS G’=(V’,E’)

The Giraph Plan Modified New • Giraph: a popular, open-source Vertices Messages V’ U M’ graph analytics system on Hadoop vertexCompute • The Giraph physical plan: hard γ coded physical execution pipeline V V.id=M.to • Giraph logical query plan using V.id=E.from M relational operators Messages E V Edges Vertices

Rewriting Logical Giraph Plan Giraph logical   Pushing down the   Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V.id=E.from vertexCompute vertexCompute vertexCompute V’ E γ γ γ V 1 V vertexCompute V V.id=M.to γ V.id=M.to V 1 .id=E.to V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V.id=M.to E V V M V 2 E E V

Rewriting Logical Giraph Plan Giraph logical   Pushing down the   Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ V’ M’ V’ V’ U M’ V.id=E.from σ d’<V 1 .d vertexCompute vertexCompute vertexCompute vertexCompute V’ E γ γ γ V 1 V vertexCompute Γ V d’=min(V 2 .d+1) V.id=M.to γ V.id=M.to V 1 .id=E.to γ V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V 1 V.id=M.to E V V M V 2 E V 1 .id=E.to E V V 2 .id=E.from V 1 V 2 E Single Source Shortest Path

Rewriting Logical Giraph Plan Giraph logical   Pushing down the   Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V’ V.id=E.from σ vertexCompute vertexCompute vertexCompute vertexCompute V’ E γ cc’<V 1 .cc γ γ V 1 V vertexCompute Γ V cc’=min(V 2 .id) V.id=M.to γ V.id=M.to V 1 .id=E.to γ V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V 1 V.id=M.to E V V M V 2 E V 1 .id=E.to E V V 2 .id=E.from V 1 V 2 E Connected Components

Rewriting Logical Giraph Plan Giraph logical   Pushing down the   Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V’ V.id=E.from vertexCompute vertexCompute Γ vertexCompute vertexCompute V’ E V 1 .r=0.15/n+0.85*   γ γ γ V 1 sum(V 2 .r/V 2 .outD) γ V vertexCompute V V.id=M.to γ V 1 V.id=M.to V 1 .id=E.to V.id=E.from V 1 .id=E.to V V.id=E.from M M V 2 .id=E.from V 1 V.id=M.to V 2 .id=E.from V 1 E V V M V 2 E E V V 2 E PageRank

Rewriting Logical Giraph Plan Giraph logical   Pushing down the   Replacing M vertexCompute UDF Replacing join 1 2 2 3 3 query plan vertexCompute UDF by V E as Table UDF with union V’ U M’ V’ U M’ M’ V’ V’ U M’ V’ V’ U M’ V.id=E.from Table UDF vertexCompute vertexCompute Γ Table UDF vertexCompute vertexCompute V’ vertexCompute E V 1 .r=0.15/n+0.85*   γ vertexCompute γ γ V 1 sum(V 2 .r/V 2 .outD) γ V sort vertexCompute V sort γ V.pid V.id=M.to γ V 1 V.id=M.to V 1 .id=E.to γ V.pid .id=M.to V.id=E.from V 1 .id=E.to V V.id=E.from M M V 2 .id=E.from V.id=M.to V 1 V.id=M.to V 2 .id=E.from U V 1 V.id=E.from E V V M V 2 E E M V V M E V 2 E E V Unmodified Vertex Compute Optimized Unmodified Program, e.g. SGD Vertex Compute Program

2. Graph Query Optimization

Leveraging Relational Query Optimizers • Multiple rule- or cost-based query rewriting possible; pick the best one using an optimizer • No hard-coded physical execution plan • Several new optimizations proposed:   - update vs replace   - incremental evaluation   - join elimination   - alternate direction graph exploration

Inner Join Update 0 Updated Input Output 1 1 1 Node Value Node Value 1 3 inf 1 0 2 1 SSSP 2 5 2 1 3 1 1 3 1 Inner Join 4 4 inf inf 5 inf Good for small number of updates!

Outer Join Replace 0 Input Output New Input 1 1 1 Node Value Node Value Node Value 1 3 Outer Join SSSP inf 1 0 2 1 1 0 2 5 2 inf 3 1 2 1 1 3 inf 3 1 4 4 inf 4 inf inf 5 inf 5 inf Good for bulk updates!

Incremental Computation New Inc. Input Inc. Input Node Value 0 Output Node Value 2 1 1 SSSP 1 1 0 3 1 Node Value 1 1 2 1 Input 3 inf New Input 3 1 2 5 1 Node Value Node Value 1 0 Outer Join 1 0 4 2 inf 2 1 inf 3 inf 3 1 4 inf 4 inf 5 inf 5 inf

Incremental Computation New Inc. Input Inc. Input Node Value 0 Node Value Output 4 2 1 2 1 SSSP 5 2 Node Value 1 3 1 4 2 3 2 New Input 2 Input 5 2 2 5 1 Node Value 2 Node Value 2 Outer Join 1 0 1 0 4 2 1 2 1 2 3 1 3 1 4 2 4 inf 5 2 5 inf Faster Iteration Runtime!

3. Column Store Backends

    Why columns stores could be a good choice? • Modern column stores provide several features   - physical design   - join optimizations   - query pipelining   - intra-query parallelism • For more details, pick your favorite column store papers:   - MonetDB   [Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct, Peter A. Boncz et. al., PVLDB 2009.]   - C-Store   [C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., VLDB 2005.]   - Vertica   [The Vertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., VLDB 2012.]

node[3]=node3 (executor) Up Root OutBlk=[UncTuple(2)] Illustration: Vertica NewEENode OutBlk=[UncTuple(2)] Query Plan for SSSP ExprEval: e.to_node, <SVAR> Recv from: node0,node1,node2,node3 Send to: node0 • Early filtering using FilterStep: (<SVAR> < <SVAR>) GroupByPipe: 1 keys sideways information Aggs: min((n1.value + 1)), min(n2.value) passing StorageMergeStep: twitter_edge; 1 sorted GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) • Fully pipelined query ExprEval: e.to_node, (n1.value + 1), n2.value execution Join: Merge-Join: using previous join and twitter_node_b0 Join: Hash-Join: StorageMergeStep: twitter_node_b0; 1 sorted using twitter_edge and twitter_node_b0 • Picks the right join ScanStep: twitter_edge SIP2(HashJoin): e.from_node ScanStep: twitter_node_b0 Recv from: node0,node1,node2,node3 SIP1(MergeJoin): e.to_node id, value strategies,   to_node (not emitted),from_node e.g. broadcast Send to: node0,node1,node2,node3 StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value

4. Comparison with Specialized Graph Systems

Graph Analytics using Vertica Relational Database Meichun Hsu - PowerPoint PPT Presentation

Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Mal Castellanos Microsoft MIT Vertica Vertica * work done while at MIT Motivation for graphs on DB Data anyways in a DB - avoid expensive

Graph Analytics using Vertica Relational Database Alekh Jindal - Samuel Madden, Mal

Chapter 3: Relational Model Structure of Relational Databases Relational Algebra Tuple

Chapter 2: Relational Model Chapter 2: Relational Model Structure of Relational Databases

Relational Algebra Relational Query Languages Recall: Query = Retrieval Program Language

Relational Algebra 1 / 39 Relational Algebra Relational model specifies stuctures and

Relational Query Languages (2) SQL and QBE Walid G. Aref Query Languages For The Relational

Chapter 8 Evaluation of Relational Operators Implementing the Relational Algebra Relational

This Lecture The Relational Model Relational data structures Relations and Relational

Chapter 7: Relational Database Design Pitfalls in Relational Database Design Decomposition

CSE 154 LECTURE 13:RELATIONAL DATABASES AND SQL Relational databases relational database : A

CSC 337 LECTURE 20: RELATIONAL DATABASES AND SQL Relational databases relational database : A

Extended RA Database Systems: The Complete Book Ch 5.1-5.2, 15.4 1 Relational Algebra A Set of

CSE 154 LECTURE 22:RELATIONAL DATABASES AND SQL Relational databases relational database : A

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Real Time Analytics Vertica A SQL analytic engine Built for Speed, Scale and Efficiency

Relational Calculus More declarative than relational algebra Foundation for query

FabScalar RISC-V Rangeen Basu Roy Chowdhury Anil Kumar Kannepalli Eric Rotenberg FabScalar

Complementarity-based Dynamic Simulation for Kinodynamic Motion Planning Nilanjan Chakraborty

Mass.gov: A Guide to Data-Informed Content Optimization Nathan James and Julia Gutierrez

Detecting Task Phases from Power Traces Joseph Granados, Jake Probst , Nick Armour, Jeffrey

Algorithms for Natural Language Processing Lecture 7: Lexical Semantics Three Ways of Looking

Lets Fix OpenGL Adrian Sampson, Cornell Commands Pixels CPU GPU Display CPU Display

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 2 Jan-Willem van de Meent (

Optimizing Black-box Metrics with Adaptive Surrogates Qijia Jiang 1 , Olaoluwa (Oliver) Adigun 2 ,