Design Patterns for Efficient Graph Algorithms in MapReduce - PowerPoint PPT Presentation

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and Michael Schatz Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

@lintool

Talk Outline � Graph algorithms � Graph algorithms in MapReduce G ap a go t s ap educe � Making it efficient � Experimental results � Experimental results

What’s a graph? � G = (V, E), where � V represents the set of vertices (nodes) � E represents the set of edges (links) � Both vertices and edges may contain additional information � Graphs are everywhere: � Graphs are everywhere: � E.g., hyperlink structure of the web, interstate highway system, social networks, etc. � Graph problems are everywhere: � E.g., random walks, shortest paths, MST, max flow, bipartite matching clustering etc matching, clustering, etc.

Source: Wikipedia (Königsberg)

Graph Representation � G = (V, E) � Typically represented as adjacency lists: yp ca y ep ese ted as adjace cy sts � Each node is associated with its neighbors (via outgoing edges) 2 1: 2, 4 , 1 1 2: 1, 3, 4 3 3: 1 4: 1, 3 4

“Message Passing” Graph Algorithms � Large class of iterative algorithms on sparse, directed graphs � At each iteration: � Computations at each vertex � Partial results (“messages”) passed (usually) along directed edges � Computations at each vertex: messages aggregate to alter state � Iterate until convergence � Iterate until convergence

A Few Examples… � Parallel breadth-first search (SSSP) � Messages are distances from source � Each node emits current distance + 1 � Aggregation = MIN � PageRank � PageRank � Messages are partial PageRank mass � Each node evenly distributes mass to neighbors � Aggregation = SUM � DNA Sequence assembly � Michael Schatz’s dissertation

PageRank in a nutshell…. � Random surfer model: � User starts at a random Web page � User randomly clicks on links, surfing from page to page � With some probability, user randomly jumps around � PageRank � PageRank… � Characterizes the amount of time spent on any given page � Mathematically, a probability distribution over pages

PageRank: Defined Given page x with inlinks t 1 …t n , where � C(t) is the out-degree of t � α is probability of random jump � N is the total number of nodes in the graph ⎛ ⎛ ⎞ ⎞ n 1 1 ( ( ) ) ∑ PR PR t t = α + − α ⎜ ⎟ i PR ( x ) ( 1 ) ⎝ ⎠ N C ( t ) = i 1 i t 1 X t 2 … t n

Sample PageRank Iteration (1) Iteration 1 n 2 (0.2) n 2 (0.166) 0 1 0.1 n 1 (0.2) 0.1 0.1 n 1 (0.066) 0.1 0.066 0.066 0.066 n 5 (0.2) n 5 (0.3) n 3 (0.2) n 3 (0.166) 0.2 0.2 n 4 (0.2) n 4 (0.3)

Sample PageRank Iteration (2) Iteration 2 n 2 (0.166) n 2 (0.133) n 1 (0.066)0.033 0 033 0 083 0.083 0.083 n 1 (0.1) 0.033 0.1 0.1 0.1 n 5 (0.3) n 5 (0.383) n 3 (0.166) n 3 (0.183) 0.3 0.166 n 4 (0.3) n 4 (0.2)

PageRank in MapReduce n 1 [ n 2 , n 4 ] n 2 [ n 3 , n 5 ] n 3 [ n 4 ] n 4 [ n 5 ] n 5 [ n 1 , n 2 , n 3 ] Map Map n 2 n 4 n 3 n 5 n 4 n 5 n 1 n 2 n 3 n 1 n 2 n 2 n 3 n 3 n 4 n 4 n 5 n 5 Reduce n 1 [ n 2 , n 4 ] n 2 [ n 3 , n 5 ] n 3 [ n 4 ] n 4 [ n 5 ] n 5 [ n 1 , n 2 , n 3 ]

PageRank Pseudo-Code

Why don’t distributed algorithms scale?

Source: http://www.flickr.com/photos/fusedforces/4324320625/

Three Design Patterns � In-mapper combining: efficient local aggregation � Smarter partitioning: create more opportunities S a te pa t t o g c eate o e oppo tu t es � Schimmy: avoid shuffling the graph

In-Mapper Combining � Use combiners � Perform local aggregation on map output � Downside: intermediate data is still materialized � Better: in-mapper combining � Preserve state across multiple map calls, aggregate messages in buffer, emit buffer contents at end � Downside: requires memory management buffer configure map close

Better Partitioning � Default: hash partitioning � Randomly assign nodes to partitions � Observation: many graphs exhibit local structure � E.g., communities in social networks � Better partitioning creates more opportunities for local aggregation � Unfortunately… partitioning is hard ! � Sometimes, chick-and-egg Sometimes chick and egg � But in some domains (e.g., webgraphs) take advantage of cheap heuristics � For webgraphs: range partition on domain-sorted URLs

Schimmy Design Pattern � Basic implementation contains two dataflows: � Messages (actual computations) � Graph structure (“bookkeeping”) � Schimmy: separate the two data flows, shuffle only the messages messages � Basic idea: merge join between graph structure and messages both relations consistently partitioned and sorted by join key both relations consistently partitioned and sorted by join key S 1 S T 1 T S 2 T 2 S 3 T 3

Do the Schimmy! � Schimmy = reduce side parallel merge join between graph structure and messages � Consistent partitioning between input and intermediate data � Mappers emit only messages (actual computation) � Reducers read graph structure directly from HDFS Reducers read graph structure directly from HDFS from HDFS intermediate data from HDFS intermediate data from HDFS intermediate data (graph structure) (messages) (graph structure) (messages) (graph structure) (messages) S 1 T 1 S 2 T 2 S 3 T 3 Reducer Reducer Reducer

Experiments � Cluster setup: � 10 workers, each 2 cores (3.2 GHz Xeon), 4GB RAM, 367 GB disk � Hadoop 0.20.0 on RHELS 5.3 � Dataset: � First English segment of ClueWeb09 collection � 50.2m web pages (1.53 TB uncompressed, 247 GB compressed) � Extracted webgraph: 1.4 billion links, 7.0 GB � Dataset arranged in crawl order � Setup: � Measured per-iteration running time (5 iterations) � 100 partitions

“Best Practices” Results

674m +18% 1.4b Results

-15% 674m +18% 1.4b Results

Results +18% 1.4b 674m -15% -60% 86m

Results +18% 1.4b 674m -15% -60% -69% 86m

Take-Aw ay Messages � Lots of interesting graph problems! � Social network analysis � Bioinformatics � Reducing intermediate data is key � Local aggregation � Better partitioning � Less bookkeeping

Complete details in Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010) , July 2010, Washington, D.C. htt http://mapreduce.me/ // d / Source code available in Cloud 9 htt http://cloud9lib.org/ // l d9lib / @lintool

Design Patterns for Efficient Graph Algorithms in MapReduce - PowerPoint PPT Presentation

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and Michael Schatz Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed under a Creative Commons

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Design Patterns Applications Programming What is design patterns? The design patterns are

Design Patterns in Eiffel Dr. Till Bay design patterns? [Design Patterns] are

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

Design Patterns 1 What are Design Patterns? Design patterns describe common (and successful)

More Design Patterns Horstmann ch.10.1,10.4 Design patterns Structural design patterns

Graph Algorithms L.F.O.A. Lecture Full Of Acronyms Graph Search Algorithms The most basic graph

Design Patterns Massimo Felici Massimo Felici Design Patterns 2011 c 1 Design Patterns

Java Design Patterns Lecture 28 COP 3252 Summer 2017 July 25, 2017 Design Patterns Design

Patterns 2020/4/12 Structural Design Patterns Creational Structural Behavioral Design

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

Design Patterns: Background Design Patterns: Background Five Principles (revisited)

Efficient Graph Rewriting York Semigroup Graham Campbell May 2019 Graham Campbell Efficient

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Algorithms Graph Algorithms g Undirected: edge ( u , v ) = ( v , u ); for all v , ( v ,

Lecture 20 Next lecture: Design Patterns 1 Structural patterns (controlling heap layout)

Assembly Assembly Assembling with Repeats Assembling with Repeats Mate Pairs Mate Pairs Whole

Google Matrix Analysis of DNA Sequences Vivek Kandiah and Dima Shepelyansky Laboratoire de

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms

Conference Welcome Paul Holme Chair NWPN Apprenticeships The Leeds Way Treat 2 million

Chapter 6 Dynamic Programming CS 573: Algorithms, Fall 2013 September 12, 2013 6.1 Maximum

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Information Technology Advisory Committee (ITAC) Public Business Meeting June 21, 2019

A lion, a head, and a dash of YAML Extending Sphinx to automate your documentation FOSDEM 2018