Giraph: Production-grade graph processing infrastructure for - PowerPoint PPT Presentation

Giraph: Production-grade graph processing infrastructure for trillion edge graphs 6/22/2014 GRADES Avery Ching

Motivation

Apache Giraph • Inspired by Google’s Pregel but runs on Hadoop • “Think like a vertex” • Maximum value vertex example Processor 1 5 5 5 5 Processor 2 1 5 5 1 5 2 5 2 2 5 Time

Giraph on Hadoop / Yarn Giraph MapReduce YARN Hadoop Hadoop Hadoop Hadoop 0.20.x 0.20.203 1.x 2.0.x

Page rank in Giraph � � public class PageRankComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Vertex<LongWritable, DoubleWritable, FloatWritable> vertex, Iterable<DoubleWritable> messages) { if (getSuperstep() >= 1) { Calculate double sum = 0; updated for (DoubleWritable message : messages) { page rank sum += message.get(); value from } neighbors vertex.getValue().set(DoubleWritable((0.15d / getTotalNumVertices()) + 0.85d * sum); } if (getSuperstep() < 30) { sendMsgToAllEdges(new DoubleWritable(getVertexValue().get() / getNumOutEdges())); Send page rank } else { value to voteToHalt(); neighbors for } 30 iterations } }

Apache Giraph data flow Loading the graph Compute / Iterate Storing the graph Input   In-memory Output   format graph format Worker 0 Worker 0 Worker 0 Part 0 Load/ Compute/ Send Split 0 Part 0 Part 0 Send Graph Part 1 Messages Master Master Split 1 Part 1 Part 1 Split 2 Part 2 Part 2 Worker 1 Worker 1 Worker 1 Part 2 Load/ Compute/ Send Split 3 Part 3 Send Part 3 Graph Part 3 Messages Send stats/iterate!

Pipelined computation Master “computes” • Sets computation, in/out message, combiner for next super step • Can set/modify aggregator values Master Worker 0 phase 1a phase 1b phase 2 phase 3 Worker 1 phase 1b phase 2 phase 3 phase 1a Time

Use case

Affinity propagation Frey and Dueck “Clustering by passing messages between data points” Science 2007 Organically discover exemplars based on similarity Initialization Intermediate Convergence

3 stages Responsibility r(i,k) • How well suited is k to be an exemplar for i ? Availability a(i,k) • How appropriate for point i to choose point k as an exemplar given all of k ’s responsibilities? Update exemplars • Based on known responsibilities/availabilities, which vertex should be my exemplar? � * Dampen responsibility, availability

Responsibility Every vertex i with an edge to k maintains responsibility of k for i Sends responsibility to k in ResponsibilityMessage (senderid, responsibility(i,k)) r(b,a) B r(b,d) A r(c,a) r(d,a) C D

Availability Vertex sums positive messages Sends availability to i in AvailabilityMessage (senderid, availability(i,k)) a(b,a) B A a(c,a) a(d,a) a(b,d) C D

Update exemplars Dampens availabilities and scans edges to find exemplar k Updates self-exemplar B update A update exemplar=a exemplar=d update C update D exemplar=a exemplar=a

Master logic calculate calculate update initial halt responsibility availability exemplars state if (exemplars agree they are exemplars && changed exemplars < ∆ ) then halt, otherwise continue

Performance & Scalability

Example graph sizes Graphs used in research publications Rough social network scale* 7 300 5.25 225 Billions Billions 3.5 150 1.75 75 0 0 Clueweb 09 Twitter dataset Friendster Yahoo! web Twitter Est* Facebook Est* Twitter 255M MAU (https://about.twitter.com/company), 208 average followers (Beevolve 2012) → Estimated >53B edges Facebook 1.28B MAU (Q1/2014 report), 200+ average friends (2011 S1) → Estimated >256B edges

Faster than Hive? Application Graph Size CPU Time Speedup Elapsed Time Speedup Page rank   400B+ edges 26x 120x (single iteration) Friends of friends score   71B+ edges 12.5x 48x

Apache Giraph scalability Scalability of workers Scalability of edges (50 (200B edges) workers) 500 500 375 375 Seconds Seconds 250 250 125 125 0 0 50 100 150 200 250 300 1E+09 7E+10 1E+11 2E+11 # of Workers # of Edges Giraph Ideal Giraph Ideal

Trillion social edges page rank 4 Improvements • GIRAPH-840 - Netty 4 upgrade Minutes per iteration 3 • G1 Collector / tuning 2 1 0 6/30/2013 6/2/2014

Graph partitioning

Why balanced partitioning Random partitioning == good balance BUT ignores entity affinity 0 1 6 3 10 4 5 7 8 9 2 11

Balanced partitioning application Results from one service: Cache hit rate grew from 70% to 85% , bandwidth cut in 1/2 � � 0 3 6 9 1 4 7 10 2 5 8 11

Balanced label propagation results * Loosely based on Ugander and Backstrom. Balanced label propagation for partitioning massive graphs, WSDM '13

Leveraging partitioning Explicit remapping Native remapping • Transparent • Embedded

Explicit remapping Original Compute graph output Compute - shortest paths from 0 Id Distance Id Edges (Chicago, 4) San 0 0 Jose (New York, 6) (San Jose, 4) Chicago 1 4 Remapped Final compute (New York, 3) graph output (San Jose, 6) New 2 6 York (Chicago, 3) Id Edges Id Distance Join Join (1, 4) Reverse partition   Partitioning San 0 0 Jose (2, 6) mapping Mapping (0, 4) Chicago 4 1 Id Alt Id Alt Id Id (2, 3) (0, 6) New San Jose 0 0 San Jose 2 6 York (1, 3) Chicago 1 1 Chicago New York 2 2 New York

Native transparent remapping Original graph Compute - shortest paths from Id Edges Original graph with (Chicago, 4) San group information Jose (New York, 6) (San Jose, 4) Chicago Final compute Id Group Edges (New York, 3) output (San Jose, 6) (Chicago, 4) New “San Jose” San Jose 0 York (Chicago, 3) (New York, 6) Id Distance (San Jose, 4) San Partitioning Chicago 1 0 (New York, 3) Jose Mapping (San Jose, 6) New York 2 Chicago 4 (Chicago, 3) Id Group New San Jose 0 6 York Chicago 1 New York 2

Native embedded remapping Original graph Compute - shortest paths from Id Edges Original graph with (1, 4) 0 mapping embedded in Id (2, 6) (0, 4) 1 Final compute Top bits machine, Id Edges (2, 3) output (0, 6) (Chicago, 4) 2 “San Jose” 0, 0 (1, 3) (New York, 6) Id Distance (San Jose, 4) Partitioning 1, 1 0 0 (New York, 3) Mapping (San Jose, 6) 0, 2 1 4 (Chicago, 3) Id Mach 0 0 2 6 Not all graphs can leverage this 1 1 technique, Facebook 2 0 can since ids are longs with unused bits.

Remapping comparison Native Native Explicit Transparent Embedded • Can also add id • No application change, • Utilize unused bits compression just additional input parameters Pros � •Application aware of • Additional memory • Application changes Id remapping usage on input type •Workflow complexity • Group information uses Cons more memory •Pre and post joins overhead

Partitioning experiments 345B edge page rank 160 Seconds per iteration 120 80 40 0 Random 47% Local 60% Local

Message explosion

Avoiding out-of-core Example: Mutual friends calculation between � � � C:{D} neighbors D:{C} E:{} 1. Send your friends a list of your friends A B 2. Intersect with your friend list A:{D} B:{} � D:{A,E} C:{D} C E 1.23B (as of 1/2014) E:{D} D:{C} 200+ average friends (2011 S1) 8-byte ids (longs) D A:{C} = 394 TB / 100 GB machines C:{A,E} E:{C} 3,940 machines (not including the graph)

Superstep splitting Subsets of sources/destinations edges per superstep * Currently manual - future work automatic! Sources: A (on), B (off) Sources: A (on), B (off) Sources: A (off), B (on) Sources: A (off), B (on) Destinations: A (on), B (off) Destinations: A (off), B (on) Destinations: A (on), B (off) Destinations: A (off), B (on) B B B B A B A B A B A B B A B A B A B A A A A A

Giraph in production Over 1.5 years in production Hundreds of production Giraph jobs processed a week • Lots of untracked experiments 30+ applications in our internal application repository Sample production job - 700B+ edges Job times range from minutes to hours

GiraphicJam demo

Giraph related projects Graft: The distributed Giraph debugger

Giraph roadmap 2/12 - 0.1 5/13 - 1.0 6/14 - 1.1

The future

Scheduling jobs Snapshot automatically after a time period and restart at end of queue Time Time

Giraph: Production-grade graph processing infrastructure for - PowerPoint PPT Presentation

Giraph: Production-grade graph processing infrastructure for trillion edge graphs 6/22/2014 GRADES Avery Ching Motivation Apache Giraph Inspired by Googles Pregel but runs on Hadoop Think like a vertex Maximum value

Outline Vienna, Austria - introduction to the giRaph package The giRaph package for graph

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org>

Sequence at a Glance 8 TH GRADE 9 TH GRADE 10 TH GRADE 11 TH GRADE 12 TH GRADE SUGGESTED PROGRAM

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Team Renato

PHS COURSE SELECTION (CRF) PROCESS GRADE LEVEL COURSEWORK Required : 10th grade 11th grade

giraph a language for manipulating graphs Jessie Liu Seth Benjamin Daniel Benett Jennifer Bi

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Congratulations October Buc of the Month Recipients! 12 th Grade 9 th Grade 10 th Grade 11 th

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

WELCOME RISING 6 th GRADERS Band / Chorus Presentation Sixth Grade Students Present Mrs. Gotlib

JUNIOR YEAR TIMELINE 9th Grade 10th Grade 11th Grade 12th Grade Beginning of Fall 2018:

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

One Trillion edges : Graph Processing at Facebook- Scale Tong Niu tong.niu.cn@outlook.com 11.

of Charm Physics Alexey Dzyuba \ HEPD PNPI NRC KI on behalf of LHCb Collaboration 21 st of May

PANDA Software Trigger Status Report PANDA Collaboration Meeting Computing Session March 2014,

Searching for Physics Beyond the Standard Model @ LHCb Mike Williams Department of Physics &

A Trillion Rows Per Second as a Foundation for Interactive Analytics Eric Hanson, Principal

11/10/20 Triple Threat or Epiphany? The Need for a Biopsychosocial Approach to Pain Management

The tunnel leveling addendum Darryl McCullough University of Oklahoma Geometric Topology in 3

Parabolic Solar Trough Section: Red A Use for Parabolic Solar Trough n Energy from sun is 1000

Giraph: Production-grade graph processing infrastructure for - PowerPoint PPT Presentation

Giraph: Production-grade graph processing infrastructure for trillion edge graphs 6/22/2014 GRADES Avery Ching Motivation Apache Giraph Inspired by Googles Pregel but runs on Hadoop Think like a vertex Maximum value

Outline Vienna, Austria - introduction to the giRaph package The giRaph package for graph

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella &lt;claudio@apache.org&gt;

Sequence at a Glance 8 TH GRADE 9 TH GRADE 10 TH GRADE 11 TH GRADE 12 TH GRADE SUGGESTED PROGRAM

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Team Renato

PHS COURSE SELECTION (CRF) PROCESS GRADE LEVEL COURSEWORK Required : 10th grade 11th grade

giraph a language for manipulating graphs Jessie Liu Seth Benjamin Daniel Benett Jennifer Bi

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Congratulations October Buc of the Month Recipients! 12 th Grade 9 th Grade 10 th Grade 11 th

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

WELCOME RISING 6 th GRADERS Band / Chorus Presentation Sixth Grade Students Present Mrs. Gotlib

JUNIOR YEAR TIMELINE 9th Grade 10th Grade 11th Grade 12th Grade Beginning of Fall 2018:

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

One Trillion edges : Graph Processing at Facebook- Scale Tong Niu tong.niu.cn@outlook.com 11.

of Charm Physics Alexey Dzyuba \ HEPD PNPI NRC KI on behalf of LHCb Collaboration 21 st of May

PANDA Software Trigger Status Report PANDA Collaboration Meeting Computing Session March 2014,

Searching for Physics Beyond the Standard Model @ LHCb Mike Williams Department of Physics &amp;

A Trillion Rows Per Second as a Foundation for Interactive Analytics Eric Hanson, Principal

11/10/20 Triple Threat or Epiphany? The Need for a Biopsychosocial Approach to Pain Management

The tunnel leveling addendum Darryl McCullough University of Oklahoma Geometric Topology in 3

Parabolic Solar Trough Section: Red A Use for Parabolic Solar Trough n Energy from sun is 1000

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org>

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Searching for Physics Beyond the Standard Model @ LHCb Mike Williams Department of Physics &