Graph Analytics using Vertica Relational Database Meichun Hsu - - PowerPoint PPT Presentation

graph analytics using vertica relational database
SMART_READER_LITE
LIVE PREVIEW

Graph Analytics using Vertica Relational Database Meichun Hsu - - PowerPoint PPT Presentation

Graph Analytics using Vertica Relational Database Meichun Hsu Alekh Jindal* Samuel Madden Mal Castellanos Microsoft MIT Vertica Vertica * work done while at MIT Motivation for graphs on DB Data anyways in a DB - avoid expensive


slide-1
SLIDE 1

Graph Analytics using Vertica Relational Database

Alekh Jindal* Samuel Madden Malú Castellanos

Meichun Hsu

Microsoft MIT Vertica Vertica

* work done while at MIT

slide-2
SLIDE 2

Motivation for graphs on DB

  • Data anyways in a DB

  • avoid expensive copying

  • end-to-end data analysis

  • leverage other DB features
  • Processing involves full scans and joins

  • relational engines could run them efficiently

  • particularly suited for column stores
  • Relational algebra/SQL offers powerful declarative syntax

  • in fact, we could express Giraph as an operator DAG

  • can even express more complex graph analytics
slide-3
SLIDE 3

5-point Agenda

  • From graph queries to SQL: how do we make the

translation?

  • Graph query optimization: can we leverage decades
  • f relational wisdom?
  • Column store backends: why are they a good choice?
  • Comparison with specialized graph systems: how

do the numbers look?

  • Extending column stores: can we do better?
slide-4
SLIDE 4
  • 1. From Graph to SQL
slide-5
SLIDE 5

Vertex-centric Graph Queries

  • Popular language for graph

analytics

  • Vertex programs that run in

supersets and communicate via message passing

1 3 2 5 4

slide-6
SLIDE 6

Vertex-centric Graph Queries

  • Popular language for graph

analytics

  • Vertex programs that run in

supersets and communicate via message passing

1 3 2 5 4

inf inf inf inf 1 1

slide-7
SLIDE 7

Vertex-centric Graph Queries

1 3 2 5 4

1 inf 1 inf 2 2 2

  • Popular language for graph

analytics

  • Vertex programs that run in

supersets and communicate via message passing

slide-8
SLIDE 8

Vertex-centric Graph Queries

1 3 2 5 4

1 2 1 2 3

  • Popular language for graph

analytics

  • Vertex programs that run in

supersets and communicate via message passing

slide-9
SLIDE 9

Vertex-centric Graph Queries

1 3 2 5 4

1 2 1 2

  • Popular language for graph

analytics

  • Vertex programs that run in

supersets and communicate via message passing

slide-10
SLIDE 10

Vertex-centric Graph Queries

1 3 2 5 4

1 2 1 2

  • Popular language for graph

analytics

  • Vertex programs that run in

supersets and communicate via message passing

  • Programmer only specifies a vertex

program

  • System takes care of running it in

parallel

slide-11
SLIDE 11

The Giraph Plan

  • Giraph: a popular, open-source

graph analytics system on Hadoop

slide-12
SLIDE 12

The Giraph Plan

  • Giraph: a popular, open-source

graph analytics system on Hadoop

  • The Giraph physical plan: hard

coded physical execution pipeline

Scan RecRead Shuffle

W1 W2 W3 W4

… … …

Server Data vertexCompute Shuffle

partition store

W1 W2 W3 W4

… … …

edge store message store

synchronize

Master

Server Data vertexCompute Shuffle

partition store

W1 W2 W3 W4

… … …

edge store message store

synchronize

Master

Server Data cleanup store

partition store

W1 W2 W3 W4

… … …

edge store message store

synchronize

Master

Split

G=(V,E) HDFS G’=(V’,E’) HDFS

Input Superstep Superstep 1 Output Superstep …….

slide-13
SLIDE 13

The Giraph Plan

  • Giraph: a popular, open-source

graph analytics system on Hadoop

  • The Giraph physical plan: hard

coded physical execution pipeline

  • Giraph logical query plan using

relational operators

V E M V’ U M’

vertexCompute

γ

V V.id=E.from V.id=M.to

Vertices Edges Messages Modified Vertices New Messages

slide-14
SLIDE 14

V M V’ U M’

vertexCompute

γ

V

V.id=E.from V.id=M.to

E E M

vertexCompute

γ

V

V.id=M.to

M’ V’ V

V.id=E.from

E

vertexCompute

γV1

V1.id=E.to

V’ V1 V2

V2.id=E.from

1 2 3

Giraph logical
 query plan Pushing down the 
 vertexCompute UDF Replacing M by V E

Rewriting Logical Giraph Plan

V E M V’ U M’

vertexCompute

γ

V V.id=E.from V.id=M.to

slide-15
SLIDE 15

V M V’ U M’

vertexCompute

γ

V

V.id=E.from V.id=M.to

E E M

vertexCompute

γ

V

V.id=M.to

M’ V’ V

V.id=E.from

E

vertexCompute

γV1

V1.id=E.to

V’ V1 V2

V2.id=E.from

1 2 3

Giraph logical
 query plan Pushing down the 
 vertexCompute UDF Replacing M by V E

Rewriting Logical Giraph Plan

V E M V’ U M’

vertexCompute

γ

V V.id=E.from V.id=M.to

V’ E

γ

V1

V1.id=E.to

V1 V2

V2.id=E.from

Γ

d’=min(V2.d+1)

σd’<V1.d

Single Source Shortest Path

vertexCompute

slide-16
SLIDE 16

V M V’ U M’

vertexCompute

γ

V

V.id=E.from V.id=M.to

E E M

vertexCompute

γ

V

V.id=M.to

M’ V’ V

V.id=E.from

E

vertexCompute

γV1

V1.id=E.to

V’ V1 V2

V2.id=E.from

1 2 3

Giraph logical
 query plan Pushing down the 
 vertexCompute UDF Replacing M by V E

V’ E

γ

V1

V1.id=E.to

V1 V2

V2.id=E.from

Γ

cc’=min(V2.id)

σ

cc’<V1.cc

Connected Components

Rewriting Logical Giraph Plan

V E M V’ U M’

vertexCompute

γ

V V.id=E.from V.id=M.to vertexCompute

slide-17
SLIDE 17

V M V’ U M’

vertexCompute

γ

V

V.id=E.from V.id=M.to

E E M

vertexCompute

γ

V

V.id=M.to

M’ V’ V

V.id=E.from

E

vertexCompute

γV1

V1.id=E.to

V’ V1 V2

V2.id=E.from

1 2 3

Giraph logical
 query plan Pushing down the 
 vertexCompute UDF Replacing M by V E

V’ E

γ

V1

V1.id=E.to

V1 V2

V2.id=E.from

Γ

V1.r=0.15/n+0.85*
 sum(V2.r/V2.outD)

PageRank

Rewriting Logical Giraph Plan

V E M V’ U M’

vertexCompute

γ

V V.id=E.from V.id=M.to vertexCompute

slide-18
SLIDE 18

V M V’ U M’

vertexCompute

γ

V

V.id=E.from V.id=M.to

E E M

vertexCompute

γ

V

V.id=M.to

M’ V’ V

V.id=E.from

E

vertexCompute

γV1

V1.id=E.to

V’ V1 V2

V2.id=E.from

1 2 3

Giraph logical
 query plan Pushing down the 
 vertexCompute UDF Replacing M by V E

V’ E

γ

V1

V1.id=E.to

V1 V2

V2.id=E.from

Γ

V1.r=0.15/n+0.85*
 sum(V2.r/V2.outD)

V E M V’ U M’

vertexCompute

γ

V V.id=E.from V.id=M.to vertexCompute

Rewriting Logical Giraph Plan

2 3

vertexCompute UDF as Table UDF Replacing join with union

.id=M.to

V M V’ U M’

γV.pid

V.id=E.from V.id=M.to

E

sort

vertexCompute

V M V’ U M’

γV.pid

E

sort

vertexCompute

U

Table UDF Table UDF

Unmodified Vertex Compute Program, e.g. SGD Optimized Unmodified Vertex Compute Program

slide-19
SLIDE 19
  • 2. Graph Query

Optimization

slide-20
SLIDE 20

Leveraging Relational Query Optimizers

  • Multiple rule- or cost-based query rewriting

possible; pick the best one using an optimizer

  • No hard-coded physical execution plan
  • Several new optimizations proposed:

  • update vs replace

  • incremental evaluation

  • join elimination

  • alternate direction graph exploration
slide-21
SLIDE 21

Inner Join Update

1 3 2 5 4

inf inf 1 1 Node Value

1 2 1 3 1 4 inf 5 inf

Node Value

2 1 3 1

Updated Input Output

SSSP 1 1 Inner Join

Good for small number of updates!

slide-22
SLIDE 22

Outer Join Replace

1 3 2 5 4

inf inf 1 1 Node Value

1 2 inf 3 inf 4 inf 5 inf

Node Value

2 1 3 1

Input Output

SSSP 1 1 Outer Join Node Value

1 2 1 3 1 4 inf 5 inf

New Input

Good for bulk updates!

slide-23
SLIDE 23

Incremental Computation

Node Value

1

Node Value

2 1 3 1

  • Inc. Input

Output

SSSP Outer Join Node Value

1 2 1 3 1 4 inf 5 inf

New Input

Node Value

1 2 inf 3 inf 4 inf 5 inf

Input

Node Value

2 1 3 1

New Inc. Input

1 3 2 5 4

inf inf 1 1 1 1

slide-24
SLIDE 24

Incremental Computation

1 3 2 5 4

2 2 Node Value

4 2 5 2

Output

SSSP 1 1 Outer Join Node Value

1 2 1 3 1 4 inf 5 inf

Input

Faster Iteration Runtime!

Node Value

2 1 3 1

  • Inc. Input

Node Value

1 2 1 3 1 4 2 5 2

New Input

Node Value

4 2 5 2

New Inc. Input

2 2 2

slide-25
SLIDE 25
  • 3. Column Store

Backends

slide-26
SLIDE 26

Why columns stores could be a good choice?

  • Modern column stores provide several features

  • physical design

  • join optimizations

  • query pipelining

  • intra-query parallelism
  • For more details, pick your favorite column store papers:

  • MonetDB


[Database Architecture Evolution: Mammals Flourished long before Dinosaurs became Extinct, Peter

  • A. Boncz et. al., PVLDB 2009.]


  • C-Store


[C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., VLDB 2005.]


  • Vertica


[The Vertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., VLDB 2012.]

slide-27
SLIDE 27

Illustration: Vertica Query Plan for SSSP

  • Early filtering using

sideways information passing

  • Fully pipelined query

execution

  • Picks the right join

strategies, 
 e.g. broadcast

node[3]=node3 (executor) Up Root OutBlk=[UncTuple(2)] NewEENode OutBlk=[UncTuple(2)] ExprEval: e.to_node, <SVAR> Recv from: node0,node1,node2,node3 Send to: node0 FilterStep: (<SVAR> < <SVAR>) GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) StorageMergeStep: twitter_edge; 1 sorted GroupByPipe: 1 keys Aggs: min((n1.value + 1)), min(n2.value) ExprEval: e.to_node, (n1.value + 1), n2.value Join: Merge-Join: using previous join and twitter_node_b0 Join: Hash-Join: using twitter_edge and twitter_node_b0 ScanStep: twitter_edge SIP2(HashJoin): e.from_node SIP1(MergeJoin): e.to_node to_node (not emitted),from_node Recv from: node0,node1,node2,node3 Send to: node0,node1,node2,node3 StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value StorageMergeStep: twitter_node_b0; 1 sorted ScanStep: twitter_node_b0 id, value

slide-28
SLIDE 28
  • 4. Comparison with

Specialized Graph Systems

slide-29
SLIDE 29

Setup

  • Systems:

  • Vertica

  • Giraph

  • GraphLab
  • Datasets:

  • directed (Twitter, LiveJournal)

  • undirected (Youtube, LiveJournal)
  • Machines

  • 4 machines (12 cores, 48GB memory, 1.4TB disk)
  • Data preparation

  • upload time [Vertica: 916s; GraphLab: 472s; Giraph: 268s]

  • disk usage [Vertica: 10GB; GraphLab/Giraph: 73GB]
slide-30
SLIDE 30

Typical Graph Analytics

Time (seconds)

240 480 720 960 1,200 PR SSSP CC

378 279 644 438 304 1041 463 392 494

GraphLab Giraph Vertica

Twitter graph: 1.4 billion edges, 41.6 million nodes

slide-31
SLIDE 31

Advanced Graph Analytics

  • Mixing graph and relational queries



 
 


  • Multi-hop neighborhood queries

Query Dataset Vertica Giraph Strong Overlap Youtube 259.56 230.01 LiveJournal-undir 381.05

  • ut of memory

Weak Ties Youtube 746.14

  • ut of memory

LiveJournal-undir 1,475.99

  • ut of memory

TABLE IV. 1- N A .

Query Type Vertica Giraph SpeedUp Sub-graph Projection & Selection PR 55.6 954.6 17.2 SSSP 101.3 405.5 4.0 Graph Analysis Aggregation PR 643.9 1089.7 1.7 SSSP 279.8 349.9 1.3 Graph Joins PR+SSSP 927.0 1435.9 1.5

Twitter graph with synthetic metadata

slide-32
SLIDE 32

Detailed Analysis: Cost Breakdown

Time (seconds)

250 500 750 1000

PR SSSP CC PR SSSP CC PR SSSP CC

Load/Store Iterations

GraphLab Giraph Vertica

Twitter graph: 1.4 billion edges, 41.6 million nodes

slide-33
SLIDE 33

Detailed Analysis: Memory Footprint

Size 16 32 48

Giraph

Size 16 32 48

GraphLab

Size 16 32 48

Vertica

Twitter graph: 1.4 billion edges, 41.6 million nodes

(GB) (GB) (GB)

slide-34
SLIDE 34

Read

10 20 30

GraphLab Giraph Vertica Write 2 4 Total

10 20 30 Read Write

13.3 8.6 1.2 24.8 0.7 24.4

Detailed Analysis: I/O Footprint

Twitter graph: 1.4 billion edges, 41.6 million nodes

slide-35
SLIDE 35

Problem: significantly high I/O

Can we do better?

slide-36
SLIDE 36
  • 5. Extending Column

Stores

slide-37
SLIDE 37

Rewriting Graph Query Plan (Yet again!)

V E VE

γ

V.pid sort

U

V’ M’

Vertex Compute

V’ U M’ V’

Synchronization Updates

Table UDF

V M V’ U M’

γV.pid

E

sort

vertexCompute

U

Table UDF

1

Disk-based Iterations

2

In-Memory Iterations

slide-38
SLIDE 38

Trading Memory for I/O

V E VE

γ

V.pid sort

U

V’ M’

Vertex Compute

V’ U M’ V’

Synchronization Updates

Table UDF

In-Memory Iterations

  • Loading and keeping data in main-

memory — no additional I/Os for each iteration

  • All iterations run as a single

transaction — reduce overheads such as logging, locking, buffer lookups

  • Unmodified vertex-program run via

table UDFs

  • Communication (message passing)

via shared memory

slide-39
SLIDE 39

Comparing Different Implementations in Vertica

Time (seconds) 50 100 150 200 250 300 PageRank Shortest Path

16.50 29.63 54.42 98.00 146.29 1,072.33

Vertica (UDF + Disk) Vertica (SQL + Disk) Vertica (UDF + Shared Memory)

LiveJournal graph: 69 million edges, 4.8 million nodes

slide-40
SLIDE 40

Comparison with GraphLab

tions

Time (seconds)

14 28 42 56 70 PR SSSP CC PR SSSP CC

7.0 1.6 14.2 6.6 1.4 11.5 16.5 14.8 15.4 49.0 50.7 49.1

Load/Store Time Algorithm Time

GraphLab Vertica

LiveJournal graph: 69 million edges, 4.8 million nodes

slide-41
SLIDE 41

Scaling to larger graphs

ation l grap

Time (seconds)

100 200 300 400 500

GraphLab
 4 nodes Giraph
 4 nodes Vertica (SQL)
 4 nodes Vertica (Mem)
 1 node

44.2 277.0 101.5 25.5 290.9 161.8 366.6

Load/Store Time Algorithm Time

Twitter graph: 1.4 billion edges, 41.6 million nodes

slide-42
SLIDE 42

Conclusion

  • Efficient graph analytics possible within column stores such

as Vertica


  • graph queries can be mapped to SQL

  • several query optimizations can be applied

  • column stores serve as efficient backends

  • could extend column stores to trade memory for I/O
  • The curious case of relational database re-discovery

  • repeatedly emerged as the backend for several new data/

applications, e.g., XML, RDF, Spatial, Array, etc.


  • cycles of branch-innovate-merge-commit
  • Next time you have a big data problem —> try relational

databases!