graph analytics using vertica relational database
play

Graph Analytics using Vertica Relational Database Meichun Hsu - PowerPoint PPT Presentation

Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Mal Castellanos Microsoft MIT Vertica Vertica * work done while at MIT Motivation for graphs on DB Data anyways in a DB - avoid expensive


  1. Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Malú Castellanos Microsoft MIT Vertica Vertica * work done while at MIT

  2. Motivation for graphs on DB • Data anyways in a DB 
 - avoid expensive copying 
 - end-to-end data analysis 
 - leverage other DB features • Processing involves full scans and joins 
 - relational engines could run them efficiently 
 - particularly suited for column stores • Relational algebra/SQL offers powerful declarative syntax 
 - in fact, we could express Giraph as an operator DAG 
 - can even express more complex graph analytics

  3. 5-point Agenda • From graph queries to SQL : how do we make the translation? • Graph query optimization : can we leverage decades of relational wisdom? • Column store backends : why are they a good choice? • Comparison with specialized graph systems : how do the numbers look? • Extending column stores : can we do better?

  4. 1. From Graph to SQL

  5. Vertex-centric Graph Queries • Popular language for graph analytics 1 • Vertex programs that run in 3 supersets and communicate via 2 5 message passing 4

  6. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 inf • Vertex programs that run in 1 3 inf supersets and communicate via 2 5 inf message passing 4 inf

  7. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 inf supersets and communicate via 2 2 5 1 message passing 2 2 4 inf

  8. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 3 4 2

  9. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 4 2

  10. Vertex-centric Graph Queries • Popular language for graph 0 analytics 1 1 • Vertex programs that run in 3 2 supersets and communicate via 2 5 1 message passing 4 • Programmer only specifies a vertex 2 program • System takes care of running it in parallel

  11. The Giraph Plan • Giraph: a popular, open-source graph analytics system on Hadoop

  12. HDFS G=(V,E) The Giraph Plan Split Input Superstep Scan W 1 W 2 W 3 W 4 … … … RecRead Shuffle Server Data W 1 W 2 W 3 W 4 partition store edge store … … … • Giraph: a popular, open-source message store Master synchronize graph analytics system on Hadoop vertexCompute Superstep 1 Shuffle Server Data W 1 W 2 W 3 W 4 • The Giraph physical plan: hard partition store edge store … … … message store coded physical execution pipeline Master synchronize ……. vertexCompute Shuffle … Server Data W 1 W 2 W 3 W 4 partition store edge store … … … message store Output Superstep Master synchronize cleanup store HDFS G’=(V’,E’)

  13. The Giraph Plan Modified New • Giraph: a popular, open-source Vertices Messages V’ U M’ graph analytics system on Hadoop vertexCompute • The Giraph physical plan: hard γ coded physical execution pipeline V V.id=M.to • Giraph logical query plan using V.id=E.from M relational operators Messages E V Edges Vertices

  14. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V.id=E.from vertexCompute vertexCompute vertexCompute V’ E γ γ γ V 1 V vertexCompute V V.id=M.to γ V.id=M.to V 1 .id=E.to V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V.id=M.to E V V M V 2 E E V

  15. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ V’ M’ V’ V’ U M’ V.id=E.from σ d’<V 1 .d vertexCompute vertexCompute vertexCompute vertexCompute V’ E γ γ γ V 1 V vertexCompute Γ V d’=min(V 2 .d+1) V.id=M.to γ V.id=M.to V 1 .id=E.to γ V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V 1 V.id=M.to E V V M V 2 E V 1 .id=E.to E V V 2 .id=E.from V 1 V 2 E Single Source Shortest Path

  16. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V’ V.id=E.from σ vertexCompute vertexCompute vertexCompute vertexCompute V’ E γ cc’<V 1 .cc γ γ V 1 V vertexCompute Γ V cc’=min(V 2 .id) V.id=M.to γ V.id=M.to V 1 .id=E.to γ V.id=E.from V V.id=E.from M M V 2 .id=E.from V 1 V 1 V.id=M.to E V V M V 2 E V 1 .id=E.to E V V 2 .id=E.from V 1 V 2 E Connected Components

  17. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M 1 2 3 query plan vertexCompute UDF by V E V’ U M’ M’ V’ V’ U M’ V’ V.id=E.from vertexCompute vertexCompute Γ vertexCompute vertexCompute V’ E V 1 .r=0.15/n+0.85* 
 γ γ γ V 1 sum(V 2 .r/V 2 .outD) γ V vertexCompute V V.id=M.to γ V 1 V.id=M.to V 1 .id=E.to V.id=E.from V 1 .id=E.to V V.id=E.from M M V 2 .id=E.from V 1 V.id=M.to V 2 .id=E.from V 1 E V V M V 2 E E V V 2 E PageRank

  18. Rewriting Logical Giraph Plan Giraph logical 
 Pushing down the 
 Replacing M vertexCompute UDF Replacing join 1 2 2 3 3 query plan vertexCompute UDF by V E as Table UDF with union V’ U M’ V’ U M’ M’ V’ V’ U M’ V’ V’ U M’ V.id=E.from Table UDF vertexCompute vertexCompute Γ Table UDF vertexCompute vertexCompute V’ vertexCompute E V 1 .r=0.15/n+0.85* 
 γ vertexCompute γ γ V 1 sum(V 2 .r/V 2 .outD) γ V sort vertexCompute V sort γ V.pid V.id=M.to γ V 1 V.id=M.to V 1 .id=E.to γ V.pid .id=M.to V.id=E.from V 1 .id=E.to V V.id=E.from M M V 2 .id=E.from V.id=M.to V 1 V.id=M.to V 2 .id=E.from U V 1 V.id=E.from E V V M V 2 E E M V V M E V 2 E E V Unmodified Vertex Compute Optimized Unmodified Program, e.g. SGD Vertex Compute Program

  19. 2. Graph Query Optimization

  20. Leveraging Relational Query Optimizers • Multiple rule- or cost-based query rewriting possible; pick the best one using an optimizer • No hard-coded physical execution plan • Several new optimizations proposed: 
 - update vs replace 
 - incremental evaluation 
 - join elimination 
 - alternate direction graph exploration

  21. Inner Join Update 0 Updated Input Output 1 1 1 Node Value Node Value 1 3 inf 1 0 2 1 SSSP 2 5 2 1 3 1 1 3 1 Inner Join 4 4 inf inf 5 inf Good for small number of updates!

  22. Outer Join Replace 0 Input Output New Input 1 1 1 Node Value Node Value Node Value 1 3 Outer Join SSSP inf 1 0 2 1 1 0 2 5 2 inf 3 1 2 1 1 3 inf 3 1 4 4 inf 4 inf inf 5 inf 5 inf Good for bulk updates!

  23. Incremental Computation New Inc. Input Inc. Input Node Value 0 Output Node Value 2 1 1 SSSP 1 1 0 3 1 Node Value 1 1 2 1 Input 3 inf New Input 3 1 2 5 1 Node Value Node Value 1 0 Outer Join 1 0 4 2 inf 2 1 inf 3 inf 3 1 4 inf 4 inf 5 inf 5 inf

  24. Incremental Computation New Inc. Input Inc. Input Node Value 0 Node Value Output 4 2 1 2 1 SSSP 5 2 Node Value 1 3 1 4 2 3 2 New Input 2 Input 5 2 2 5 1 Node Value 2 Node Value 2 Outer Join 1 0 1 0 4 2 1 2 1 2 3 1 3 1 4 2 4 inf 5 2 5 inf Faster Iteration Runtime!

  25. 3. Column Store Backends

  26. 
 
 Why columns stores could be a good choice? • Modern column stores provide several features 
 - physical design 
 - join optimizations 
 - query pipelining 
 - intra-query parallelism • For more details, pick your favorite column store papers: 
 - MonetDB 
 [Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct, Peter A. Boncz et. al., PVLDB 2009.] 
 - C-Store 
 [C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., VLDB 2005.] 
 - Vertica 
 [The Vertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., VLDB 2012.]

  27. node[3]=node3 (executor) Up Root OutBlk=[UncTuple(2)] Illustration: Vertica NewEENode OutBlk=[UncTuple(2)] Query Plan for SSSP ExprEval: e.to_node, <SVAR> Recv from: node0,node1,node2,node3 Send to: node0 • Early filtering using FilterStep: (<SVAR> < <SVAR>) GroupByPipe: 1 keys sideways information Aggs: min((n1.value + 1)), min(n2.value) passing StorageMergeStep: twitter_edge; 1 sorted GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) • Fully pipelined query ExprEval: e.to_node, (n1.value + 1), n2.value execution Join: Merge-Join: using previous join and twitter_node_b0 Join: Hash-Join: StorageMergeStep: twitter_node_b0; 1 sorted using twitter_edge and twitter_node_b0 • Picks the right join ScanStep: twitter_edge SIP2(HashJoin): e.from_node ScanStep: twitter_node_b0 Recv from: node0,node1,node2,node3 SIP1(MergeJoin): e.to_node id, value strategies, 
 to_node (not emitted),from_node e.g. broadcast Send to: node0,node1,node2,node3 StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value

  28. 4. Comparison with Specialized Graph Systems

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend