Processing Massive Graphs
Amir H. Payberah
amir.payberah@cs.ox.ac.uk
University of Oxford
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 1 / 78
Processing Massive Graphs Amir H. Payberah - - PowerPoint PPT Presentation
Processing Massive Graphs Amir H. Payberah amir.payberah@cs.ox.ac.uk University of Oxford Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 1 / 78 Whats the Problem? Amir H. Payberah (Oxford) Processing Massive Graphs
Amir H. Payberah
amir.payberah@cs.ox.ac.uk
University of Oxford
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 1 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 2 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 3 / 78
◮ A large graph either cannot fit into memory of single computer or
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 4 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 5 / 78
◮ Scale up or scale vertically. ◮ Scale out or scale horizontally.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 6 / 78
◮ Count the number of times each distinct word appears in the file ◮ If the file fits in memory: words(doc.txt) | sort | uniq -c
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 7 / 78
◮ Count the number of times each distinct word appears in the file ◮ If the file fits in memory: words(doc.txt) | sort | uniq -c ◮ If not?
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 7 / 78
◮ Parallelize the data and process. ◮ Data-Parallel processing.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 8 / 78
◮ MapReduce
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 9 / 78
Can we use platforms like MapReduce or Spark, which are based on data-parallel model, for large-scale graph proceeding?
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 10 / 78
◮ Difficult to extract parallelism based on partitioning of the data. ◮ Difficult to express parallelism based on partitioning of computation. ◮ No locality between computations and data access patterns.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 11 / 78
Graph-Parallel Processing ◮ Computation typically depends on the neighbors.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 12 / 78
◮ Restricts the types of computation. ◮ New techniques to partition and distribute graphs. ◮ Exploit graph structure.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 13 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 14 / 78
◮ Vertex-centric processing model
◮ Edge-centric processing model
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 15 / 78
◮ Vertex-centric Programming model
◮ Vertex operations:
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 16 / 78
◮ Iterates over vertices // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 17 / 78
◮ Iterates over vertices // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 17 / 78
Until convergence { // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 18 / 78
Until convergence { // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 19 / 78
Until convergence { // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 20 / 78
Until convergence { // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 21 / 78
Until convergence { // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 22 / 78
Vertex-centric Edge-centric
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 23 / 78
Until convergence { // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) } Until convergence { // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 24 / 78
Until convergence { // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 25 / 78
Until convergence { // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 26 / 78
Until convergence { // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 27 / 78
Until convergence { // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 28 / 78
Until convergence { // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 29 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 30 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 31 / 78
◮ Large-scale graph-parallel processing platform developed at Google. ◮ Inspired by bulk synchronous parallel (BSP) model.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 32 / 78
◮ Vertex-centric programming: Think as a vertex. ◮ Each vertex computes individually its value: in parallel ◮ Each vertex can see its local context and updates its value.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 33 / 78
◮ Applications run in sequence of iterations: supersteps ◮ A vertex in superstep S can:
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 34 / 78
◮ Superstep 0: all vertices are in the active state. ◮ A vertex deactivates itself by voting to halt: no further work to do. ◮ A halted vertex can be active if it receives a message. ◮ The whole algorithm terminates when:
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 35 / 78
i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 36 / 78
i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 37 / 78
i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 38 / 78
i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 39 / 78
◮ Update ranks in parallel. ◮ Iterate until convergence.
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 40 / 78
Pregel_PageRank(i, messages): // receive all the messages total = 0 foreach(msg in messages): total = total + msg // update the rank of this vertex R[i] = 0.15 + total // send new messages to neighbors foreach(j in out_neighbors[i]): sendmsg(R[i] * wij) to vertex j
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 41 / 78
◮ Vertices are assigned to partitions based on their vertex-ID. ◮ E.g., hash(ID)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 42 / 78
◮ Inefficient if different regions of the graph converge at different
speed.
◮ Can suffer if one task is more expensive than the others. ◮ Runtime of each phase is determined by the slowest machine.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 43 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 44 / 78
◮ GraphLab allows asynchronous iterative computation. ◮ Vertex scope of vertex v: the data stored in v, in all adjacent vertices
and edges.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 45 / 78
◮ Vertex-centric programming ◮ A vertex can read and modify any of the data in its scope.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 46 / 78
Pregel_PageRank(i, messages): // receive all the messages total = 0 foreach(msg in messages): total = total + msg // update the rank of this vertex R[i] = 0.15 + total // send new messages to neighbors foreach(j in out_neighbors[i]): sendmsg(R[i] * wij) to vertex j
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 47 / 78
GraphLab_PageRank(i) // compute sum over neighbors total = 0 foreach(j in in_neighbors(i)): total = total + R[j] * wji // update the PageRank R[i] = 0.15 + total // trigger neighbors to run again foreach(j in out_neighbors(i)): signal vertex-program on j
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 48 / 78
◮ Overlapped scopes: race-condition in simultaneous execution of two
update functions.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 49 / 78
◮ Overlapped scopes: race-condition in simultaneous execution of two
update functions.
◮ Full consistency: during the execution f(v), no other function reads
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 49 / 78
◮ Edge consistency: during the execution f(v), no other function
reads or modifies any of the data on v or any of the edges adja- cent to v.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 50 / 78
◮ Vertex consistency: during the execution f(v), no other function
will be applied to v.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 51 / 78
Consistency vs. Parallelism
[Low, Y., GraphLab: A Distributed Abstraction for Large Scale Machine Learning (Doctoral dissertation, University of California), 2013.] Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 52 / 78
◮ Convert the input graph to a meta-graph. ◮ Meta-graph is very small. ◮ A fast balanced partition.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 53 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 54 / 78
◮ Factorizes the update function into the Gather, Apply and Scatter
phases.
◮ Vertex-cut partitioning.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 55 / 78
◮ Gather-Apply-Scatter (GAS) ◮ Gather: accumulate information about neighborhood through a gen-
eralized sum.
◮ Apply: apply the accumulated value to center vertex. ◮ Scatter: update adjacent edges and vertices.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 56 / 78
◮ Initially all vertices are active. ◮ It executes the vertex-program on the active vertices until none re-
main.
inactive until it is reactivated.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 57 / 78
◮ Initially all vertices are active. ◮ It executes the vertex-program on the active vertices until none re-
main.
inactive until it is reactivated.
◮ PowerGraph can execute both synchronously and asynchronously.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 57 / 78
◮ Synchronous scheduling like Pregel.
each step.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 58 / 78
◮ Synchronous scheduling like Pregel.
each step.
◮ Asynchronous scheduling like GraphLab.
functions are immediately committed to the graph.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 58 / 78
Pregel_PageRank(i, messages): // receive all the messages total = 0 foreach(msg in messages): total = total + msg // update the rank of this vertex R[i] = 0.15 + total // send new messages to neighbors foreach(j in out_neighbors[i]): sendmsg(R[i] * wij) to vertex j
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 59 / 78
GraphLab_PageRank(i) // compute sum over neighbors total = 0 foreach(j in in_neighbors(i)): total = total + R[j] * wji // update the PageRank R[i] = 0.15 + total // trigger neighbors to run again foreach(j in out_neighbors(i)): signal vertex-program on j
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 60 / 78
PowerGraph_PageRank(i): Gather(j -> i): return wji * R[j] sum(a, b): return a + b // total: Gather and sum Apply(i, total): R[i] = 0.15 + total Scatter(i -> j): if R[i] changed then activate(j)
R[i] = 0.15 +
wjiR[j]
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 61 / 78
◮ Natural graphs: skewed Power-Law degree distribution. ◮ Edge-cut algorithms perform poorly on Power-Law Graphs.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 62 / 78
Vertex-Cut partitioning
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 63 / 78
◮ Random vertex-cuts
◮ Greedy vertex-cuts
become minimum.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 64 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 65 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 66 / 78
Could we compute Big Graphs on a single machine?
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 67 / 78
◮ Disk-based processing
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 68 / 78
◮ Disk-based processing
Eiko Y., and Roy A., “Scale-up Graph Processing: A Storage-centric View”, 2013.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 68 / 78
Vertex-centric Edge-centric
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 69 / 78
◮ Vertex-centric scatter-gather: EdgeData RandomAccessBandwidth ◮ Edge-centric scatter-gather: Scatters×EdgeData SequentialAccessBandwidth ◮ Sequential Access Bandwidth ≫ Random Access Bandwidth. ◮ Few scatter gather iterations for real world graphs.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 70 / 78
◮ Problem: still have random access to vertex set.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 71 / 78
◮ Problem: still have random access to vertex set.
Solution Partition the graph into streaming partitions.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 71 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 72 / 78
Random access for free.
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 73 / 78
Until convergence { // the scatter phase for all vertices v for all outgoing edges from v: update = f(v.value) // the gather phase for all vertices v for all incoming edges to v: v.value = g(v.value, update) } Until convergence { // the scatter phase for all edges e u = new update u.dst = e.dst u.value = f(e.src.value) // the gather phase for all edges e e.dst.value = g(e.dst.value, u.value) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 74 / 78
// the scatter phase for each streaming_partition p { load Vertices(p) for each unprocessed e in Edges(P) u = new update u.dst = e.dst u.value = f(e.src.value) add u to Update(partition(u.dst)) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 75 / 78
// the scatter phase for each streaming_partition p { load Vertices(p) for each unprocessed e in Edges(P) u = new update u.dst = e.dst u.value = f(e.src.value) add u to Update(partition(u.dst)) } // the gather phase for each streaming-partition p { load Vertices(p) for each unprocessed u in Update(p) u.dst.value = g(u.dst.value, u.value) delete Update(p) }
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 75 / 78
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 76 / 78
◮ Data-parallel vs. Graph-parallel processing ◮ Graph-parallel: vertex-centric vs. edge-centric ◮ Vertex-centric: pregel and graphlab ◮ Edge-centric: x-stream
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 77 / 78
Acknowledgement Some slides were derived from the slides of Amitabha Roy (EPFL)
Amir H. Payberah (Oxford) Processing Massive Graphs April 17, 2017 78 / 78