Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 5: Analyzing Graphs (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States 1 See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details 1

Structure of the Course Analyzing Graphs Relational Data Analyzing Text Data Mining Analyzing “Core” framework features and algorithm design 2 2

What’s a graph? G = (V,E), where V represents the set of vertices (nodes) E represents the set of edges (links) Edges may be directed or undirected Both vertices and edges may contain additional information outlinks outgoing (outbound) edges out-degree edges (links) vertex (node) edges (links) in-degree incoming (inbound) edges inlinks 3 3

Examples of Graphs • Social networks • Hyperlink structure of the web • Computers on the Internet We’re mostly interested in sparse graphs! 4 4

Representing Graphs • Adjacency matrices • Adjacency lists • Edge lists 5 5

Adjacency Matrices Represent a graph as an n x n square matrix M n = |V| M ij = 1 iff an edge from vertex i to j 2 1 2 3 4 1 0 1 0 1 1 3 2 1 0 1 1 3 1 0 0 0 4 1 0 1 0 4 6 6

Adjacency Matrices: Critique Advantages Amenable to mathematical manipulation Intuitive iteration over rows and columns Disadvantages Lots of wasted space (for sparse matrices) 7 7

Adjacency Lists Take adjacency matrix… and throw away all the zeros 2 1 1 2 3 4 3 1 0 1 0 1 1: 2, 4 4 2: 1, 3, 4 2 1 0 1 1 3: 1 3 1 0 0 0 4: 1, 3 4 1 0 1 0 8 We have seen this in posting lists. 8

Adjacency Lists: Critique Advantages Much more compact representation (compress!) Easy to compute over outlinks Disadvantages Difficult to compute over inlinks 9 9

Edge Lists Explicitly enumerate all edges (1, 2) 1 2 3 4 (1, 4) 1 0 1 0 1 (2, 1) (2, 3) 2 1 0 1 1 (2, 4) 3 1 0 0 0 (3, 1) 4 1 0 1 0 (4, 1) (4, 3) 10 10

Edge Lists: Critique Advantages Easily support edge insertions Disadvantages Wastes spaces 11 11

Some Graph Problems Finding shortest paths Routing Internet traffic and UPS trucks Finding minimum spanning trees Telco laying down fiber Finding max flow Airline scheduling Identify “special” nodes and communities Halting the spread of avian flu Bipartite matching match.com Web ranking PageRank 12 12

What does the web look like? Analysis of a large webgraph from the common crawl: 3.5 billion pages, 129 billion links Meusel et al. Graph Structure in the Web — Revisited. WWW 2014. 13 13

What does the web look like? Very roughly, a scale-free network Fraction of k nodes having k connections: (i.e., degree distribution follows a power law) 14 14

Ali’s webpage Google 15 15

How do we extract the webgraph? The webgraph … is big?! webgraph from the common crawl: 3.5 billion pages, 129 billion links Meusel et al. Graph Structure in the Web — Revisited. WWW 2014. 17 17

Graphs and MapReduce (and Spark) A large class of graph algorithms involve: Local computations at each node Propagating results: “traversing” the graph Key questions: How do you represent graph data in MapReduce (and Spark)? How do you traverse a graph in MapReduce (and Spark)? 18 18

Single-Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost First, a refresher: Dijkstra’s Algorithm… 19 19

Dijkstra’s Algorithm Example 1   10 9 2 3 4 6 0 7 5   2 20 Example from CLR 20

Dijkstra’s Algorithm Example 1  10 10 9 2 3 4 6 0 7 5  5 2 21 Example from CLR 21

Dijkstra’s Algorithm Example 1 8 14 10 9 2 3 4 6 0 7 5 5 7 2 22 Example from CLR 22

Dijkstra’s Algorithm Example 1 1 8 9 10 9 2 3 4 6 0 7 5 5 7 2 24 Example from CLR 24

Single-Source Shortest Path Problem: find shortest path from a source node to one or more target nodes Shortest might also mean lowest weight or cost Single processor machine: Dijkstra’s Algorithm MapReduce: parallel breadth-first search (BFS) 26 26

Finding the Shortest Path Consider simple case of equal edge weights Solution to the problem can be defined inductively: Define: b is reachable from a if b is on adjacency list of a D ISTANCE T O ( s ) = 0 For all nodes p reachable from s , D ISTANCE T O ( p ) = 1 For all nodes n reachable from some other set of nodes M , D ISTANCE T O ( n ) = 1 + min(D ISTANCE T O ( m ), m  M ) d 1 m 1 … d 2 s n … m 2 … d 3 m 3 27 27

Visualizing Parallel BFS n 7 n 0 n 1 n 2 n 3 n 6 n 5 n 4 n 8 n 9 28 28

From Intuition to Algorithm Data representation: Key: node n Value: d (distance from start), adjacency list Initialization: for all nodes except for start node, d =  Mapper:  m  adjacency list: emit ( m , d + 1) Sort/Shuffle: Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path 29 29

Multiple Iterations Needed Each MapReduce iteration advances the “frontier” by one hop Subsequent iterations include more reachable nodes as frontier expands Multiple iterations are needed to explore entire graph Preserving graph structure: Problem: Where did the adjacency list go? Solution: mapper emits ( n , adjacency list) as well 30 30

BFS Pseudo-Code class Mapper { def map(id: Long, n: Node) = { emit(id, n) // emit graph structure val d = n.distance for (m <- n.adjacencyList) { emit(m, d+1) } } class Reducer { def reduce(id: Long, objects: Iterable[Object]) = { var min = infinity var m = null for (d <- objects) { if (isNode(d)) m <- d else if d < min min = d } m.distance = min emit(id, m) } } 31 31

Stopping Criterion (equal edge weight) How many iterations are needed in parallel BFS? Convince yourself: when a node is first “discovered”, we’ve found the shortest path What does it have to do with six degrees of separation? 32 32

Frontier size during BFS traversal 33 33

Implementation Practicalities HDFS map reduce Convergence? HDFS 34 34

Comparison to Dijkstra Dijkstra’s algorithm is more efficient At each step, only pursues edges from minimum-cost path inside frontier MapReduce explores all paths in parallel Lots of “waste” Useful work is only done at the “frontier” Why can’t we do better using MapReduce? 35 We can’t do better because we cannot keep a global state like Dijkstra does. 35

Single Source: Weighted Edges Now add positive weights to the edges Simple change: add weight w for each edge in adjacency list Simple change: add weight w for each edge in adjacency list In mapper, emit ( m , d + w p ) instead of ( m , d + 1) for each node m That’s it? 36 36

Stopping Criterion (positive edge weight) How many iterations are needed in parallel BFS? Convince yourself: when a node is first “discovered”, we’ve found the shortest path 37 37

Additional Complexities 1 search frontier 1 1 n 6 n 7 n 8 10 r 1 n 9 n 5 n 1 1 s 1 q p n 4 1 1 n 2 n 3 38 38

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 5: Analyzing Graphs (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Outline Contagion Contagion Biological Contagion Introduction Introduction Principles of

Matching an Internet g Panel Sample of Pregnant Women to a g Probability Sample Andrew Burkey

Texans Views on the COVID-19 Pandemic Selected Findings from the 2020 Texas COVID-19 Survey 1

Transfer learning for unsupervised infmuenza-like illness models from online search data Bin Zou

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil January 14, 2016 The

Lesson 12 Man made disaster 12.01 Prof. R. Nagarajan, CSRE , IIT Bombay GNR 639 GNR 639 :

The Bioshield Bioshield Dilemma: Dilemma: The Developing New Technologies At an Affordable