One Trillion Edges: Graph Processing at Facebook-Scale
GraphHPC 2015, Moscow
Avery Ching
Sambavi Muthukrishnan
Sergey Edunov
Maja Kabiljo
Dionysios Logothetis
One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC - - PowerPoint PPT Presentation
One Trillion Edges: Graph Processing at Facebook-Scale GraphHPC 2015, Moscow Avery Ching Sergey Edunov Maja Kabiljo Facebook Facebook Facebook Dionysios Logothetis Sambavi Muthukrishnan Facebook Facebook Social Graph Social Graph
Avery Ching
Sambavi Muthukrishnan
Sergey Edunov
Maja Kabiljo
Dionysios Logothetis
Ranking Features
Ranking Features Recommendations
Ranking Features Data Partitioning Recommendations
Clueweb 09 Twitter research Friendster Yahoo! web
1750 3500 5250 7000
Edges Vertices
Clueweb 09 Twitter research Friendster Yahoo! web 2015 Twitter Approx. 2015 Facebook Approx.
1750 3500 5250 7000 125000 250000 375000 50000
Edges Vertices
Maximum Vertex Example
Processor 1 Processor 2 Time 5 2 1
Maximum Vertex Example
Processor 1 Processor 2 Time 5 2 1 5 5 2 1
Maximum Vertex Example
Processor 1 Processor 2 Time 5 2 5 5 5 2 1 5 5 2 1
Maximum Vertex Example
Processor 1 Processor 2 Time 5 5 5 5 2 5 5 5 2 1 5 5 2 1
Maximum Vertex Example
Data Pipelines Framework
HDFS
Core Analytics
MapReduce (Scheduler)
Input Format
Split 0 Split 1 Split 2 Split 3 Master Worker 0 Load/Send Graph Load/Send Graph Worker 1
In-memory graph
Part 0 Part 1 Part 2 Part 3
Send stats/iterate!
Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1
Input Format
Split 0 Split 1 Split 2 Split 3 Master Worker 0 Load/Send Graph Load/Send Graph Worker 1
In-memory graph
Part 0 Part 1 Part 2 Part 3
Send stats/iterate!
Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1
Input Format
Split 0 Split 1 Split 2 Split 3 Master Worker 0 Load/Send Graph Load/Send Graph Worker 1
Part 0 Part 1 Part 2 Part 3
Output Format
Part 0 Part 1 Part 2 Part 3 Worker 0 Worker 0
In-memory graph
Part 0 Part 1 Part 2 Part 3
Send stats/iterate!
Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1
In-memory graph
Part 0 Part 1 Part 2 Part 3
Send stats/iterate!
Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1
Worker parallelization: Rely on scheduling framework for parallelization (e.g. more mappers) Pros: Simple
In-memory graph
Part 0 Part 1 Part 2 Part 3
Send stats/iterate!
Master Worker 0 Compute/ Send Messages Compute/ Send Messages Worker 1
Worker parallelization: Rely on scheduling framework for parallelization (e.g. more mappers) Pros: Simple Multithreading parallelization: Multicore machines leverage up to (partitions / worker) threads Pros: Fewer connections, better memory usage (e.g. shared message buffering)
/** * Interface for data structures that store out-edges for a vertex. * * @param <I> Vertex id * @param <E> Edge value */ public interface OutEdges<I extends WritableComparable, E extends Writable> extends Iterable<Edge<I, E>>, Writable { void initialize(Iterable<Edge<I, E>> edges); void initialize(int capacity); void initialize(); void add(Edge<I, E> edge); void remove(I targetVertexId); int size(); }
public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute( Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { // Calculate new page rank value if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(0.15d / getTotalNumVertices() +0.85d * sum); } // Send page rank value to neighbors if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }
public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute( Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { // Calculate new page rank value if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(0.15d / getTotalNumVertices() +0.85d * sum); } // Send page rank value to neighbors if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }
public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute( Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { // Calculate new page rank value if (getSuperstep() >= 1) { double sum = 0; for (DoubleWritable message : messages) { sum += message.get(); } vertex.getValue().set(0.15d / getTotalNumVertices() +0.85d * sum); } // Send page rank value to neighbors if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); } else { voteToHalt(); } } }
class Vertex { public: virtual void Compute(MessageIterator* msgs) = 0; … };
class Vertex { public: … }; public interface Computation { void compute(Vertex<I, V, E> vertex, Iterable<M1> messages); … }
public abstract class WorkerContext { public abstract void preApplication(); public abstract void postApplication(); public abstract void preSuperstep(); public abstract void postSuperstep(); … }
public abstract class MasterCompute { public abstract void compute(); public abstract void haltComputation(); public final void setComputation( Class<? extends Computation> computationClass); public final void setMessageCombiner( Class<? extends MessageCombiner> combinerClass); public final <A extends Writable> void setAggregatedValue( String name, A value); … }
Balanced Label Propagation
compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) start edge cut metrics complete edge cut metrics Do edge cut calculation every 10 cycles of BLP
Balanced Label Propagation
compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) start edge cut metrics complete edge cut metrics Do edge cut calculation every 10 cycles of BLP calculate and send responsibilities calculate and send availabilities update exemplars Continue if halting condition not met (i.e. < n vertices changed exemplars?)
Affinity Propagation
start edge cut metrics complete edge cut metrics Do edge cut calculation every 5 cycles of AP
Balanced Label Propagation
compute candidates to move to partitions probabilistically move vertices Continue if halting condition not met (i.e. < n vertices moved?) start edge cut metrics complete edge cut metrics Do edge cut calculation every 10 cycles of BLP calculate and send responsibilities calculate and send availabilities update exemplars Continue if halting condition not met (i.e. < n vertices changed exemplars?)
Affinity Propagation
start edge cut metrics complete edge cut metrics Do edge cut calculation every 5 cycles of AP
A B C D E
A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}
A B C D E
A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}
A B C D E
A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}
A B C D E
A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}
A B C D E
A:{D} D:{A,E} E:{D} B:{} C:{D} D:{C} A:{C} C:{A,E} E:{C} C:{D} D:{C} E:{}
Messages memory = (1.23B MAP (as of 1/2014)) x (200+ average friends (2011 S1)) x (8-byte ids (longs)) = 394 TB of memory required for messages = Assuming 100 GB machines, this is 3,940 machines (not including the graph)
Sources: A (off), B (on) Destinations: A (on), B (off) 1(A) 4(B) 2(B) 0(B) 3(A) 5(A) 2(B) 3(A) Sources: A (on), B (off) Destinations: A (off), B (on) 1(A) 4(B) 0(B) 5(A) Sources: A (on), B (off) Destinations: A (on), B (off) 1(A) 4(B) 2(B) 0(B) 3(A) 5(A) Sources: A (off), B (on) Destinations: A (off), B (on) 1(A) 4(B) 2(B) 0(B) 3(A) 5(A)
Scalability of workers (200B edges)
Seconds 125 250 375 500 # of Workers 50 100 150 200 250 300
Giraph Ideal
Scalability of Edges (50 workers)
Seconds 125 250 375 500 # of Edges 1E+09 6.733E+10 1.337E+11 2E+11
Application Hive
Giraph Speedup
Page rank (single iteration) 400B+ edges Total CPU 16.5M secs Total CPU 0.6M secs
Elapsed Time 600 mins Elapsed Time 19 mins
Friends-of-friends score 71B+ edges Total CPU 255M secs Total CPU 18M secs
Elapsed Time 7200 mins Elapsed Time 110 mins
Application Hive
Giraph Speedup
Double Join query (450B connections 2.5B+ unique ids) Total CPU 211 days Total CPU 43 days
Elapsed Time 425 mins Elapsed Time 50 mins
Count Distinct query (620B actions 110M objects) Total CPU 485 days Total CPU 78 days
Elapsed Time 510 mins Elapsed Time 45 mins
Alessandro Presta Nitay Joeffe Greg Malewicz Pavan Kumar