One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC - - PowerPoint PPT Presentation

one trillion edges graph processing at facebook scale
SMART_READER_LITE
LIVE PREVIEW

One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC - - PowerPoint PPT Presentation

One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC 2015, Moscow Avery Ching Sergey Edunov Maja Kabiljo Facebook Facebook Facebook Dionysios Logothetis Sambavi Muthukrishnan Facebook Facebook Social Graph Social Graph


slide-1
SLIDE 1

One Trillion Edges: Graph Processing at Facebook-Scale

GraphHPC 2015, Moscow

Avery Ching

Facebook

Sambavi Muthukrishnan

Facebook

Sergey Edunov

Facebook

Maja Kabiljo

Facebook

Dionysios Logothetis

Facebook

slide-2
SLIDE 2

Social Graph

slide-3
SLIDE 3

Example Question: Are Jay and Sambavi friends?

Social Graph

slide-4
SLIDE 4
slide-5
SLIDE 5

Ranking Features

7.6 9.3 6.4 8.2

slide-6
SLIDE 6

Ranking Features Recommendations

7.6 9.3 6.4 8.2

slide-7
SLIDE 7

Ranking Features Data Partitioning Recommendations

7.6 9.3 6.4 8.2

slide-8
SLIDE 8

Benchmark Graphs

Clueweb 09 Twitter research Friendster Yahoo! web

1750 3500 5250 7000

Edges Vertices

Benchmark to Social Graphs

slide-9
SLIDE 9

Benchmark Graphs

Clueweb 09 Twitter research Friendster Yahoo! web 2015 Twitter Approx. 2015 Facebook Approx.

1750 3500 5250 7000 125000 250000 375000 50000

Edges Vertices

70x larger than benchmarks! Benchmark to Social Graphs

slide-10
SLIDE 10

Requirements

slide-11
SLIDE 11
  • Efficient iterative computing model
  • Easy to program and debug graph-based API
  • Scale to real world Facebook graph sizes (1B+ nodes and

hundreds of billions of edges)

  • Easily interoperable with existing data (Hive)
  • Run multiple jobs in a multi-tenant environment

Requirements

slide-12
SLIDE 12
  • Highly scalable graph processing engine loosely based on Pregel
  • Combiners are used to aggregate message values
  • Aggregators are global data generated on every superset

Apache Giraph

Maximum Vertex Example

slide-13
SLIDE 13
  • Highly scalable graph processing engine loosely based on Pregel
  • Combiners are used to aggregate message values
  • Aggregators are global data generated on every superset

Apache Giraph

Processor 1 Processor 2 Time 5 2 1

Maximum Vertex Example

slide-14
SLIDE 14
  • Highly scalable graph processing engine loosely based on Pregel
  • Combiners are used to aggregate message values
  • Aggregators are global data generated on every superset

Apache Giraph

Processor 1 Processor 2 Time 5 2 1 5 5 2 1

Maximum Vertex Example

slide-15
SLIDE 15
  • Highly scalable graph processing engine loosely based on Pregel
  • Combiners are used to aggregate message values
  • Aggregators are global data generated on every superset

Apache Giraph

Processor 1 Processor 2 Time 5 2 5 5 5 2 1 5 5 2 1

Maximum Vertex Example

slide-16
SLIDE 16
  • Highly scalable graph processing engine loosely based on Pregel
  • Combiners are used to aggregate message values
  • Aggregators are global data generated on every superset

Apache Giraph

Processor 1 Processor 2 Time 5 5 5 5 2 5 5 5 2 1 5 5 2 1

Maximum Vertex Example

slide-17
SLIDE 17
slide-18
SLIDE 18

Pipelines

Data Pipelines Framework

Storage

HDFS

Applications

Core Analytics

Execution Framework

MapReduce (Scheduler)

slide-19
SLIDE 19

Architecture

slide-20
SLIDE 20

Input Format

Split 0 Split 1 Split 2 Split 3 Master Worker 0 Load/Send Graph Load/Send Graph Worker 1

Loading the graph

Architecture

slide-21
SLIDE 21

Compute / Iterate

In-memory graph

Part 0 Part 1 Part 2 Part 3

Send stats/iterate!

Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1

Input Format

Split 0 Split 1 Split 2 Split 3 Master Worker 0 Load/Send Graph Load/Send Graph Worker 1

Loading the graph

Architecture

slide-22
SLIDE 22

Compute / Iterate

In-memory graph

Part 0 Part 1 Part 2 Part 3

Send stats/iterate!

Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1

Input Format

Split 0 Split 1 Split 2 Split 3 Master Worker 0 Load/Send Graph Load/Send Graph Worker 1

Loading the graph Storing the graph

Part 0 Part 1 Part 2 Part 3

Output Format

Part 0 Part 1 Part 2 Part 3 Worker 0 Worker 0

Architecture

slide-23
SLIDE 23

Parallelization Model

Compute / Iterate

In-memory graph

Part 0 Part 1 Part 2 Part 3

Send stats/iterate!

Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1

slide-24
SLIDE 24

Parallelization Model

Compute / Iterate

In-memory graph

Part 0 Part 1 Part 2 Part 3

Send stats/iterate!

Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1

Worker parallelization: Rely on scheduling framework for parallelization (e.g. more mappers) Pros: Simple

slide-25
SLIDE 25

Parallelization Model

Compute / Iterate

In-memory graph

Part 0 Part 1 Part 2 Part 3

Send stats/iterate!

Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1

Worker parallelization: Rely on scheduling framework for parallelization (e.g. more mappers) Pros: Simple Multithreading parallelization:
 Multicore machines leverage up to (partitions / worker) threads Pros: Fewer connections, better memory usage (e.g. shared message buffering)

slide-26
SLIDE 26

Efficient Java Object Support

/**
 * Interface for data structures that store out-edges for a vertex.
 *
 * @param <I> Vertex id
 * @param <E> Edge value
 */
 public interface OutEdges<I extends WritableComparable, E extends Writable> extends Iterable<Edge<I, E>>, Writable {
 
 void initialize(Iterable<Edge<I, E>> edges);
 
 void initialize(int capacity);
 
 void initialize();
 
 void add(Edge<I, E> edge);
 
 void remove(I targetVertexId);
 
 int size();
 }

  • Edges >> vertices (> 2 orders)
  • Allow custom out edge

implementations

  • Example: Java primitive arrays,

FastUtil libraries

  • Serialize messages into large

byte arrays

slide-27
SLIDE 27

Page Rank

Map Reduce (Hadoop)

slide-28
SLIDE 28

Page Rank

Map Reduce (Hadoop)

slide-29
SLIDE 29

Giraph

public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute( Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { // Calculate new page rank value if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(0.15d / getTotalNumVertices() +0.85d * sum); } // Send page rank value to neighbors if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }

Page Rank

Map Reduce (Hadoop)

slide-30
SLIDE 30

Giraph

public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute( Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { // Calculate new page rank value if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(0.15d / getTotalNumVertices() +0.85d * sum); } // Send page rank value to neighbors if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }

Page Rank

Map Reduce (Hadoop)

slide-31
SLIDE 31

Giraph

public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute( Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { // Calculate new page rank value if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(0.15d / getTotalNumVertices() +0.85d * sum); } // Send page rank value to neighbors if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }

Page Rank

Map Reduce (Hadoop)

slide-32
SLIDE 32

Pregel Extensions

slide-33
SLIDE 33

Pregel Model Limitations

slide-34
SLIDE 34
  • Difficult to construct “multi-stage” graph applications

Pregel Model Limitations

slide-35
SLIDE 35
  • Difficult to construct “multi-stage” graph applications
  • Hard to reuse code

Pregel Model Limitations

slide-36
SLIDE 36
  • Difficult to construct “multi-stage” graph applications
  • Hard to reuse code

Pregel Model Limitations

slide-37
SLIDE 37

Extensions

slide-38
SLIDE 38
  • Make Computation first class object

Extensions

slide-39
SLIDE 39
  • Make Computation first class object
  • Define computation on a master, worker, and a vertex

Extensions

slide-40
SLIDE 40
  • Make Computation first class object
  • Define computation on a master, worker, and a vertex
  • Master computation is executed centrally to set the

computation, combiner for the workers

Extensions

slide-41
SLIDE 41
  • Make Computation first class object
  • Define computation on a master, worker, and a vertex
  • Master computation is executed centrally to set the

computation, combiner for the workers

  • All computations are now composable and reusable

Extensions

slide-42
SLIDE 42

First Class Computation

slide-43
SLIDE 43

First Class Computation

class Vertex { public: virtual void Compute(MessageIterator* msgs) = 0; … };

slide-44
SLIDE 44

First Class Computation

class Vertex { public: … }; public interface Computation { void compute(Vertex<I, V, E> vertex, Iterable<M1> messages); … }

slide-45
SLIDE 45

Defining Computation for Master/Worker

public abstract class WorkerContext { 
 public abstract void preApplication(); 
 public abstract void postApplication(); 
 public abstract void preSuperstep(); 
 public abstract void postSuperstep(); … }


public abstract class MasterCompute {
 
 public abstract void compute();
 
 public abstract void haltComputation();
 
 public final void setComputation(
 Class<? extends Computation> computationClass);
 
 public final void setMessageCombiner(
 Class<? extends MessageCombiner> combinerClass); 
 public final <A extends Writable> void setAggregatedValue(
 String name, A value); … }

slide-46
SLIDE 46

Example Multi-Stage Applications

Balanced Label Propagation

compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) start edge cut metrics complete edge cut metrics Do edge cut calculation every 10 cycles of BLP

slide-47
SLIDE 47

Example Multi-Stage Applications

Balanced Label Propagation

compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) start edge cut metrics complete edge cut metrics Do edge cut calculation every 10 cycles of BLP calculate and send responsibilities calculate and send availabilities update exemplars Continue if halting condition not met (i.e. < n vertices changed exemplars?)

Affinity Propagation

start edge cut metrics complete edge cut metrics Do edge cut calculation every 5 cycles of AP

slide-48
SLIDE 48

Example Multi-Stage Applications

Balanced Label Propagation

compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) start edge cut metrics complete edge cut metrics Do edge cut calculation every 10 cycles of BLP calculate and send responsibilities calculate and send availabilities update exemplars Continue if halting condition not met (i.e. < n vertices changed exemplars?)

Affinity Propagation

start edge cut metrics complete edge cut metrics Do edge cut calculation every 5 cycles of AP

slide-49
SLIDE 49

Large/Imbalanced Supersteps

A B C D E

A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}

slide-50
SLIDE 50
  • Example: Mutual friends

calculation between neighbors

Large/Imbalanced Supersteps

A B C D E

A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}

slide-51
SLIDE 51
  • Example: Mutual friends

calculation between neighbors

  • Send your friends a list of your

friends

Large/Imbalanced Supersteps

A B C D E

A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}

slide-52
SLIDE 52
  • Example: Mutual friends

calculation between neighbors

  • Send your friends a list of your

friends

  • Intersect with your friend list

Large/Imbalanced Supersteps

A B C D E

A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}

slide-53
SLIDE 53
  • Example: Mutual friends

calculation between neighbors

  • Send your friends a list of your

friends

  • Intersect with your friend list

Large/Imbalanced Supersteps

A B C D E

A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}

Messages memory = (1.23B MAP (as of 1/2014)) x (200+ average friends (2011 S1)) x (8-byte ids (longs)) = 394 TB of memory required for messages = Assuming 100 GB machines, this is 3,940 machines (not including the graph)

slide-54
SLIDE 54

Superstep Splitting

slide-55
SLIDE 55

Superstep Splitting

  • Split superstep into multiple iterations
slide-56
SLIDE 56

Superstep Splitting

  • Split superstep into multiple iterations
  • Partition message sources/destinations into groups for

activation (e.g. hash vertex id by iteration number)

slide-57
SLIDE 57

Superstep Splitting

  • Split superstep into multiple iterations
  • Partition message sources/destinations into groups for

activation (e.g. hash vertex id by iteration number)

Sources: A (off), B (on) Destinations: A (on), B (off) 1(A) 4(B) 2(B) 0(B) 3(A) 5(A) 2(B) 3(A) Sources: A (on), B (off) Destinations: A (off), B (on) 1(A) 4(B) 0(B) 5(A) Sources: A (on), B (off) Destinations: A (on), B (off) 1(A) 4(B) 2(B) 0(B) 3(A) 5(A) Sources: A (off), B (on) Destinations: A (off), B (on) 1(A) 4(B) 2(B) 0(B) 3(A) 5(A)

slide-58
SLIDE 58
  • Message based compute update must be commutative and

associative when destination splitting is enabled

  • No single message can overflow the memory buffer of a single

vertex.

  • In extreme cases - rely on graph partitioning balance load 


Superstep Splitting Limitations

slide-59
SLIDE 59

Benchmarks/Performance

slide-60
SLIDE 60

Scalability

Scalability of workers (200B edges)

Seconds 125 250 375 500 # of Workers 50 100 150 200 250 300

Giraph Ideal

Scalability of Edges (50 workers)

Seconds 125 250 375 500 # of Edges 1E+09 6.733E+10 1.337E+11 2E+11

slide-61
SLIDE 61

“ A billion edges isn’t cool…. you know what’s cool? ”

slide-62
SLIDE 62

“ A billion edges isn’t cool…. you know what’s cool? ” A TRILLION EDGES.

slide-63
SLIDE 63

Giraph runs page rank with 1,000,000,000,000+ edges in < 3 minutes / iteration

with 200 machines

slide-64
SLIDE 64

Giraph vs Hive (Graph Applications)

Application Hive

Giraph Speedup

Page rank (single iteration) 400B+ edges Total CPU 16.5M secs Total CPU 0.6M secs

26x

Elapsed Time 600 mins Elapsed Time 19 mins

120x

Friends-of-friends score 71B+ edges Total CPU 255M secs Total CPU 18M secs

14x

Elapsed Time 7200 mins Elapsed Time 110 mins

65x

slide-65
SLIDE 65

Giraph vs Hive (Hive Applications)

Application Hive

Giraph Speedup

Double Join query (450B connections 2.5B+ unique ids) Total CPU 211 days Total CPU 43 days

5x

Elapsed Time 425 mins Elapsed Time 50 mins

8.5x

Count Distinct query (620B actions
 110M objects) Total CPU 485 days Total CPU 78 days

6.2x

Elapsed Time 510 mins Elapsed Time 45 mins

11.3x

slide-66
SLIDE 66
  • Runs in Corona (FB MapReduce implementation)
  • One map task per machine, leveraging multithreading

parallelism

  • Strict FIFO queue to avoid deadlock
  • Hive data model - prepare graph in Hive and/or use multiple

Giraph input formats for filtering, pre-processing, transformation

Operational experience

slide-67
SLIDE 67

Recent Giraph Work

slide-68
SLIDE 68
  • Recommendations (http://www.tinyurl.com/fb-mf-cf)

Recent Giraph Work

slide-69
SLIDE 69
  • Recommendations (http://www.tinyurl.com/fb-mf-cf)
  • Blocks and Pieces (http://giraph.apache.org/blocks.html)

Recent Giraph Work

slide-70
SLIDE 70
  • Recommendations (http://www.tinyurl.com/fb-mf-cf)
  • Blocks and Pieces (http://giraph.apache.org/blocks.html)
  • Adaptive OOC (Most of the work already committed to open-

source)

Recent Giraph Work

slide-71
SLIDE 71

Acknowledgements

Alessandro Presta Nitay Joeffe Greg Malewicz Pavan Kumar

slide-72
SLIDE 72

Future Work

slide-73
SLIDE 73

Scalability

Future Work

slide-74
SLIDE 74

Scalability Applications

Future Work

slide-75
SLIDE 75

Future Work

Performance

slide-76
SLIDE 76

Performance

Future Work

slide-77
SLIDE 77

Future Work

slide-78
SLIDE 78

Thank You