Graph Analytics using Vertica Relational Database
Alekh Jindal* Samuel Madden Malú Castellanos
Meichun Hsu
Microsoft MIT Vertica Vertica
* work done while at MIT
Graph Analytics using Vertica Relational Database Meichun Hsu - - PowerPoint PPT Presentation
Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Mal Castellanos Microsoft MIT Vertica Vertica * work done while at MIT Motivation for graphs on DB Data anyways in a DB - avoid expensive
Alekh Jindal* Samuel Madden Malú Castellanos
Meichun Hsu
Microsoft MIT Vertica Vertica
* work done while at MIT
translation?
do the numbers look?
analytics
supersets and communicate via message passing
1 3 2 5 4
analytics
supersets and communicate via message passing
1 3 2 5 4
inf inf inf inf 1 1
1 3 2 5 4
1 inf 1 inf 2 2 2
analytics
supersets and communicate via message passing
1 3 2 5 4
1 2 1 2 3
analytics
supersets and communicate via message passing
1 3 2 5 4
1 2 1 2
analytics
supersets and communicate via message passing
1 3 2 5 4
1 2 1 2
analytics
supersets and communicate via message passing
program
parallel
graph analytics system on Hadoop
graph analytics system on Hadoop
coded physical execution pipeline
Scan RecRead Shuffle
W1 W2 W3 W4
… … …
Server Data vertexCompute Shuffle
partition store
W1 W2 W3 W4
… … …
edge store message store
synchronize
Master
Server Data vertexCompute Shuffle
partition store
W1 W2 W3 W4
… … …
edge store message store
synchronize
Master
Server Data cleanup store
partition store
W1 W2 W3 W4
… … …
edge store message store
synchronize
Master
Split
G=(V,E) HDFS G’=(V’,E’) HDFS
…
Input Superstep Superstep 1 Output Superstep …….
graph analytics system on Hadoop
coded physical execution pipeline
relational operators
vertexCompute
V V.id=E.from V.id=M.to
Vertices Edges Messages Modified Vertices New Messages
V M V’ U M’
vertexCompute
V
V.id=E.from V.id=M.to
E E M
vertexCompute
V
V.id=M.to
M’ V’ V
V.id=E.from
E
vertexCompute
V1.id=E.to
V’ V1 V2
V2.id=E.from
1 2 3
Giraph logical query plan Pushing down the vertexCompute UDF Replacing M by V E
V E M V’ U M’
vertexCompute
γ
V V.id=E.from V.id=M.to
V M V’ U M’
vertexCompute
V
V.id=E.from V.id=M.to
E E M
vertexCompute
V
V.id=M.to
M’ V’ V
V.id=E.from
E
vertexCompute
V1.id=E.to
V’ V1 V2
V2.id=E.from
1 2 3
Giraph logical query plan Pushing down the vertexCompute UDF Replacing M by V E
V E M V’ U M’
vertexCompute
γ
V V.id=E.from V.id=M.to
V’ E
V1
V1.id=E.to
V1 V2
V2.id=E.from
Γ
d’=min(V2.d+1)
Single Source Shortest Path
vertexCompute
V M V’ U M’
vertexCompute
V
V.id=E.from V.id=M.to
E E M
vertexCompute
V
V.id=M.to
M’ V’ V
V.id=E.from
E
vertexCompute
V1.id=E.to
V’ V1 V2
V2.id=E.from
1 2 3
Giraph logical query plan Pushing down the vertexCompute UDF Replacing M by V E
V’ E
V1
V1.id=E.to
V1 V2
V2.id=E.from
Γ
cc’=min(V2.id)
cc’<V1.cc
Connected Components
V E M V’ U M’
vertexCompute
γ
V V.id=E.from V.id=M.to vertexCompute
V M V’ U M’
vertexCompute
V
V.id=E.from V.id=M.to
E E M
vertexCompute
V
V.id=M.to
M’ V’ V
V.id=E.from
E
vertexCompute
V1.id=E.to
V’ V1 V2
V2.id=E.from
1 2 3
Giraph logical query plan Pushing down the vertexCompute UDF Replacing M by V E
V’ E
V1
V1.id=E.to
V1 V2
V2.id=E.from
Γ
V1.r=0.15/n+0.85* sum(V2.r/V2.outD)
PageRank
V E M V’ U M’
vertexCompute
γ
V V.id=E.from V.id=M.to vertexCompute
V M V’ U M’
vertexCompute
V
V.id=E.from V.id=M.to
E E M
vertexCompute
V
V.id=M.to
M’ V’ V
V.id=E.from
E
vertexCompute
V1.id=E.to
V’ V1 V2
V2.id=E.from
1 2 3
Giraph logical query plan Pushing down the vertexCompute UDF Replacing M by V E
V’ E
V1
V1.id=E.to
V1 V2
V2.id=E.from
Γ
V1.r=0.15/n+0.85* sum(V2.r/V2.outD)
V E M V’ U M’
vertexCompute
γ
V V.id=E.from V.id=M.to vertexCompute
2 3
vertexCompute UDF as Table UDF Replacing join with union
.id=M.to
V M V’ U M’
V.id=E.from V.id=M.to
E
sort
vertexCompute
V M V’ U M’
E
sort
vertexCompute
U
Table UDF Table UDF
Unmodified Vertex Compute Program, e.g. SGD Optimized Unmodified Vertex Compute Program
possible; pick the best one using an optimizer
1 3 2 5 4
inf inf 1 1 Node Value
1 2 1 3 1 4 inf 5 inf
Node Value
2 1 3 1
Updated Input Output
SSSP 1 1 Inner Join
Good for small number of updates!
1 3 2 5 4
inf inf 1 1 Node Value
1 2 inf 3 inf 4 inf 5 inf
Node Value
2 1 3 1
Input Output
SSSP 1 1 Outer Join Node Value
1 2 1 3 1 4 inf 5 inf
New Input
Good for bulk updates!
Node Value
1
Node Value
2 1 3 1
Output
SSSP Outer Join Node Value
1 2 1 3 1 4 inf 5 inf
New Input
Node Value
1 2 inf 3 inf 4 inf 5 inf
Input
Node Value
2 1 3 1
New Inc. Input
1 3 2 5 4
inf inf 1 1 1 1
1 3 2 5 4
2 2 Node Value
4 2 5 2
Output
SSSP 1 1 Outer Join Node Value
1 2 1 3 1 4 inf 5 inf
Input
Faster Iteration Runtime!
Node Value
2 1 3 1
Node Value
1 2 1 3 1 4 2 5 2
New Input
Node Value
4 2 5 2
New Inc. Input
2 2 2
[Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct, Peter
[C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., VLDB 2005.]
[The Vertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., VLDB 2012.]
sideways information passing
execution
strategies, e.g. broadcast
node[3]=node3 (executor) Up Root OutBlk=[UncTuple(2)] NewEENode OutBlk=[UncTuple(2)] ExprEval: e.to_node, <SVAR> Recv from: node0,node1,node2,node3 Send to: node0 FilterStep: (<SVAR> < <SVAR>) GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) StorageMergeStep: twitter_edge; 1 sorted GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) ExprEval: e.to_node, (n1.value + 1), n2.value Join: Merge-Join: using previous join and twitter_node_b0 Join: Hash-Join: using twitter_edge and twitter_node_b0 ScanStep: twitter_edge SIP2(HashJoin): e.from_node SIP1(MergeJoin): e.to_node to_node (not emitted),from_node Recv from: node0,node1,node2,node3 Send to: node0,node1,node2,node3 StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value StorageMergeStep: twitter_node_b0; 1 sorted ScanStep: twitter_node_b0 id, value
Time (seconds)
240 480 720 960 1,200 PR SSSP CC
378 279 644 438 304 1041 463 392 494
GraphLab Giraph Vertica
Twitter graph: 1.4 billion edges, 41.6 million nodes
Query Dataset Vertica Giraph Strong Overlap Youtube 259.56 230.01 LiveJournal-undir 381.05
Weak Ties Youtube 746.14
LiveJournal-undir 1,475.99
TABLE IV. 1- N A .
Query Type Vertica Giraph SpeedUp Sub-graph Projection & Selection PR 55.6 954.6 17.2 SSSP 101.3 405.5 4.0 Graph Analysis Aggregation PR 643.9 1089.7 1.7 SSSP 279.8 349.9 1.3 Graph Joins PR+SSSP 927.0 1435.9 1.5
Twitter graph with synthetic metadata
Time (seconds)
250 500 750 1000
PR SSSP CC PR SSSP CC PR SSSP CC
Load/Store Iterations
GraphLab Giraph Vertica
Twitter graph: 1.4 billion edges, 41.6 million nodes
Size 16 32 48
Giraph
Size 16 32 48
GraphLab
Size 16 32 48
Vertica
Twitter graph: 1.4 billion edges, 41.6 million nodes
(GB) (GB) (GB)
Read
10 20 30
GraphLab Giraph Vertica Write 2 4 Total
10 20 30 Read Write
13.3 8.6 1.2 24.8 0.7 24.4
Twitter graph: 1.4 billion edges, 41.6 million nodes
V E VE
V.pid sort
U
V’ M’
Vertex Compute
V’ U M’ V’
Synchronization Updates
Table UDF
V M V’ U M’
E
sort
vertexCompute
U
Table UDF
1
Disk-based Iterations
2
In-Memory Iterations
V E VE
γ
V.pid sort
U
V’ M’
Vertex Compute
V’ U M’ V’
Synchronization Updates
Table UDF
In-Memory Iterations
memory — no additional I/Os for each iteration
transaction — reduce overheads such as logging, locking, buffer lookups
table UDFs
via shared memory
Time (seconds) 50 100 150 200 250 300 PageRank Shortest Path
16.50 29.63 54.42 98.00 146.29 1,072.33
Vertica (UDF + Disk) Vertica (SQL + Disk) Vertica (UDF + Shared Memory)
LiveJournal graph: 69 million edges, 4.8 million nodes
tions
Time (seconds)
14 28 42 56 70 PR SSSP CC PR SSSP CC
7.0 1.6 14.2 6.6 1.4 11.5 16.5 14.8 15.4 49.0 50.7 49.1
Load/Store Time Algorithm Time
GraphLab Vertica
LiveJournal graph: 69 million edges, 4.8 million nodes
ation l grap
Time (seconds)
100 200 300 400 500
GraphLab 4 nodes Giraph 4 nodes Vertica (SQL) 4 nodes Vertica (Mem) 1 node
44.2 277.0 101.5 25.5 290.9 161.8 366.6
Load/Store Time Algorithm Time
Twitter graph: 1.4 billion edges, 41.6 million nodes
as Vertica
applications, e.g., XML, RDF, Spatial, Array, etc.
databases!