Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker - - PowerPoint PPT Presentation

graphs on databases
SMART_READER_LITE
LIVE PREVIEW

Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker - - PowerPoint PPT Presentation

Graphs On Databases Alekh Jindal Sam Madden Mike Stonebraker CSAIL, MIT + = Jena FlockDB AllegeroGraph TAO Pegasus Neo4j DEX HypergraphDB Pregel Titan GraphBase Twister Giraph Trinity HaLoop GraphLab PrItr SQL ? ? Can we


slide-1
SLIDE 1

Graphs On Databases

Alekh Jindal

CSAIL, MIT

Sam Madden Mike Stonebraker

slide-2
SLIDE 2
slide-3
SLIDE 3

+

slide-4
SLIDE 4
slide-5
SLIDE 5

=

slide-6
SLIDE 6
slide-7
SLIDE 7

AllegeroGraph Jena Neo4j HypergraphDB TAO FlockDB Trinity Twister HaLoop PrItr Pegasus Pregel Giraph GraphLab Titan DEX GraphBase

slide-8
SLIDE 8

?

SQL

slide-9
SLIDE 9
slide-10
SLIDE 10

?

slide-11
SLIDE 11

Can we have efficient graph analytics within a SQL Database?

slide-12
SLIDE 12

Graph Analytics on SQL Databases

  • Graph: set of nodes, set of edges
  • Node table: nodeId and associated metadata
  • Edge table: (from,to) nodeIds and associated metadata
  • Undirected graph: two tuples per edge
  • Node/Edge access: selection, projection on node and

edge tables

  • Graph traversal: successive joins between node and edge

tables

slide-13
SLIDE 13

Optimizations

  • Number of Joins


Parallel graph exploration

  • Number of queries; round trips


Nested queries; database handles the optimizations

  • Data movement between server and client


UDFs, Stored Procedures

  • Database connections


Keep connections alive between iterations

  • SQL query performance


Sort orders, indexes

slide-14
SLIDE 14

How does the performance look like?

slide-15
SLIDE 15

Time (seconds)

1 10 100 1000 10000

Facebook Twitter GPlus LiveJournal

Graph Database SQL: Main-memory Store SQL: Row Store SQL: Column Store

PageRank

4K 88K 76K 107K 4.8M 2.4M 30M 69M Nodes Edges

slide-16
SLIDE 16

Shortest Paths

4K 88K 76K 107K 4.8M 2.4M 30M 69M Nodes Edges

Time (seconds)

1 10 100 1000 10000 100000

Facebook Twitter GPlus LiveJournal

212.7 20.2 21.3 4.4 18,702.2 1,231.7 168.1 8.7 428.4 4.7 395.6 3.2

Graph Database SQL: Main-memory Store SQL: Row Store SQL: Column Store

slide-17
SLIDE 17

What about the query interface? Is SQL the right choice for graph queries?

slide-18
SLIDE 18

Shortest Path in SQL

UPDATE NNodes AS nnode SET Estimate = new_nnode.Estimate, Predecessor = new_nnode.Predecessor FROM (SELECT temp.Id, temp.Estimate, edge.from_node AS Predecessor FROM NNodes AS nn, edge, (SELECT e.to_node AS Id, min(n1.Estimate+1) AS Estimate FROM NNodes AS n1, edge AS e, NNodes AS n2 WHERE n1.Id=e.from_node AND n2.Id=e.to_node GROUP BY e.to_node, n2.Estimate HAVING min(n1.Estimate+1) < n2.Estimate ) AS temp WHERE nn.Id=edge.from_node AND edge.to_node=temp.Id AND nn.estimate=temp.estimate-1 ) AS new_nnode WHERE nnode.Id = new_nnode.Id;

Tables !!!

slide-19
SLIDE 19

Shortest Path in Pregel

void compute(vector<vfloat> messages){

  • // get the minimum distance

vfloat mindist = id==START_NODE ? 0 : DBL_MAX; for(vector<vfloat>::iterator it = messages.begin(); it != messages.end(); ++it) mindist = min(mindist,*it);

  • // send messages to all edges if new minimum is found

vfloat vvalue = getVertexValue(); if(mindist < vvalue){ modifyVertexValue(mindist); vector<vint> edges = getOutEdges(); for(vector<vint>::iterator it = edges.begin(); it != edges.end(); ++it) sendMessage(*it, mindist+1); }

  • // halt

voteToHalt(); }

Graph !!!

slide-20
SLIDE 20

What about having a vertex-centric interface in a SQL Database?

slide-21
SLIDE 21

Vertex-centric Interface in SQL Databases

  • Idea: Map vertex-centric program execution to SQL

queries in a SQL database

  • The programmer specifies what happens on each

vertex in the graph

  • Vertices are executed as long as they are in active

state or if they have an incoming message

  • Exact same API as in Pregel, e.g. getting incoming

messages, vertex value, vertex edges, etc.

slide-22
SLIDE 22

Implementation Details

  • Vertex (V), Edge (E), Message (M)
  • The vertex programs are executed in parallel (super step)

as UDFs in the SQL database

  • Vertex Input: (V,E,M) for the vertex
  • Vertex Output: outgoing M from the vertex
  • A coordinator synchronizes between super steps, i.e.

redistributes the messages from one super step to the next

  • The coordinator stops when there are no more messages
slide-23
SLIDE 23

Optimizations

  • 3-way join, instead of 2-way


Table unions in place of joins

  • UDF call overheads


Batching several vertices in each UDF call

  • Too many new messages in each super step


Replace messages table, no in-place updates

  • SQL query performance


Sort orders, indexes

slide-24
SLIDE 24

How does the performance look like?

slide-25
SLIDE 25

Shortest Paths

4K 88K 76K 107K 4.8M 2.4M 30M 69M Nodes Edges

Time (seconds)

1 10 100 1000

Facebook Twitter GPlus LiveJournal

100.9 47.0 35.0 28.2 439.8 92.5 18.4 5.7 212.7 20.2 21.3 4.4

SQL: Column Store Vertex-centric: Column Store Vertex-centric: Apache Giraph

slide-26
SLIDE 26

Vertex-centric interface allows…

  • Connected Components
  • Random Walks with Restart
  • Stochastic Gradient Descent
  • Message Passing Algorithms
  • Or, any other vertex-centric algorithm

…. right within the SQL database system!

slide-27
SLIDE 27

Summary

  • Graph analytics can be mapped to relational

queries (plus UDFs)

  • SQL systems can offer very good performance
  • ver relational queries
  • We can extend SQL systems to provide more

graph-natural query interfaces

slide-28
SLIDE 28

+

slide-29
SLIDE 29

=

slide-30
SLIDE 30

Team Members

slide-31
SLIDE 31

Non-faculty Members

slide-32
SLIDE 32

Non-faculty, non-postdoc …

slide-33
SLIDE 33

Graphs On Databases

Alekh Jindal

CSAIL, MIT

Sam Madden Mike Stonebraker