Graph Processing Marco Serafini COMPSCI 532 Lecture 9 Graph - PowerPoint PPT Presentation

Graph Processing Marco Serafini COMPSCI 532 Lecture 9

Graph Analytics Marco Serafini 2 2

Scaling Graph Algorithms • Type of algorithms • PageRank • Shortest path • Clustering • Connected component • Requirements • Support for in-memory iterative computation • Scaling to large graphs in a distributed system • Fault-tolerance 3 3

Why Pregel • Existing solutions unsuitable for large graphs • Custom distributed system à hard to implement right • MapReduce à Poor support for iterative computation • Single-node libraries à Don’t scale • Distributed libraries à Not fault tolerant 5 5

“Think Like a Vertex” • Vertex in input graph = stateful worker thread • Each vertex executes the same UDF • Vertices send messages to other vertices • Typically neighbors in the input graph, but not necessarily • Easy to scale to large graphs: partition by vertex 6 6

Complexities of Graph Processing • Poor locality of memory access • Little work per vertex • Changing degree of parallelism during execution 7 7

Bulk Synchronous Parallel Model • Computation is a sequence of supersteps • At each superstep • Processes consume input messages using UDF • Update their state • Change topology (if needed) • Send output messages (typically to neighbors) 8 8

Termination • Vertices can vote to halt and deactivate themselves • A vertex is re-activated when it receives a message • Termination: no more active vertices 9 9

Excercise: Connected Component • (Strongly) connected component • Each vertex can reach every other vertex • How to implement it in Pregel? 10 10

Exercise: SSSP • Single-Source Shortest Path (SSSP) • Given one vertex source • Find shortest path of each vertex from source • Distance: Weighted edges (positive weights) • How to implement it in Pregel? 11 11

SSSP • Input: Graph (weighted edges), source vertex • Output: Min distance between the source and all other vertices • TLV implementation vertex code: Receive distances from neighbors, extract minimum If minimum is smaller than current distance Replace current distance with minimum For each edge Send current distance + edge weight Halt 12

Example of TLV Run vertex code: Receive distances from neighbors, extract minimum If minimum is smaller than current distance Replace current distance with minimum For each edge Send current distance + edge weight Halt 1 2 4 Superstep 0 ∞ ∞ 0 ∞ message values = 2 and 4 2 Superstep 1 ∞ 2 0 4 message values = 4, 3, and 8 Superstep 2 4 2 0 3 message values = 6 and 7 Superstep 3 4 2 0 3 Complete, no new messages 13

Matrix-Vector Multiplication in TLV • Page-Rank has similar structure • But can use non-linear functions (UDFs) sum new state a12 * i2 inputs to neighbors importance: i2 2 … 1 a13 * i3 … importance: i3 3 superstep i+1 superstep i superstep i+2 links to v1 0 a12 a13 i1 a12 * i2 + a13 * i3 … … … i2 … * = i3 … new importance adjacency matrix importance 14 (transposed)

Advantages over MapReduce • Pregel has stateful workers • MapReduce does not • How would you implement the previous algorithms using MapReduce? 15 15

Pregel System • Input partitioning: Vertices à partitions à worker • Custom partitioning allowed • Multiple partitions per worker for load balance • Master controls • Global execution flow and barriers • Checkpointing and recovery • Message passing • Local: updates to shared memory • Distributed: asynchronous message passing 16 16

State Management • Accessing state from Worker • State encapsulated in a VertexValue object • Explicit method to get and modify the value • Q: Why this design? 17 17

Combiners and Aggregators • Combiners • Similar to MapReduce • Aggregate multiple messages to same recipient from same server into a single message • Also executed at the receiver side to save space • Aggregators • Master collects data from vertices at the end of a superstep • Workers aggregate locally and use tree-based structure to aggregate to master • Broadcast the result to all vertices before the next superstep 18 18

Topology Mutations • Need to guarantee determinism • But mutations might be conflicting • Criteria • Mutations arbitrated by interested vertex • Partial ordering among mutations • User-defined arbitration 19 19

Fault Tolerance • Option 1: Checkpointing and rollback • Option 2: Confined recovery • Log messages • Does not require global rollback 20 20

Beyond Pregel 21

Problem: Graphs are Skewed! • Long time to process all incoming messages • Lots of output messages • Lots of edge metadata to keep 22

Gather-Apply-Scatter (PowerGraph) • Replicate high degree vertices • Gather, Apply, Scatter (GAS) • Edge-centric: updates computed per edge (1)(Gather( Accumulator( (2)( Gather( Gather( (Par4al(Sum)( Apply( Mirror( (3)(Apply( Sca>er( Sca>er( Updated(( (4)( Vertex(Data( Machine(1( Machine(2( (5)(Sca>er( 23 23

Graph Processing on Top of Spark • Unified approach to different types of analytics • No data transfers required • Single, homogeneous execution environment • Similar argument as SparkSQL 25 25

Graph as RDDs • Vertex collection • (vertex ID, properties) • Edge collection • (source vertex ID, destination vertex ID, properties) • Composable with other collections • Different vertex collections for same graph (edges) • Vertex and edge collections used for further analysis 26 26

Basic Graph Computation Stages • Join stage: build (source, edge, destination) triplets • Used to calculate outgoing messages • Group-by stage: gather messages for destination • Used to update destination vertex state 27 27

GraphX Operators 28 28

Pregel on GraphX • mrTriplets • join to get triplets • map + groupBy • generate msgs from triplet • gather them by dst • leftJoinV • join by source ID • mapV • apply function to all vertices • generate output messages 29 29

Compressed Sparse Row (CSR) • Compact representation of graph data • Also used for sparse matrices • Read-only • Two sequential arrays • Vertex array: at source vertex ID, contains offset of edge array where destinations are located • Edge array: list of destination vertex IDs 30 30

Distributed Graph Representation • Edge partition gathers all incident vertices into triplets • Vertex mirroring (GAS): vertex data replicated • Routing table: co-partitioned with vertices • For each vertex: set of edge partitions with adjacent edges 31 31

Graph Processing Marco Serafini COMPSCI 532 Lecture 9 Graph - PowerPoint PPT Presentation

Graph Processing Marco Serafini COMPSCI 532 Lecture 9 Graph Analytics Marco Serafini 2 2 Scaling Graph Algorithms Type of algorithms PageRank Shortest path Clustering Connected component Requirements Support

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system

Medusa Simplified Graph Processing on GPUs Motivation Graph processing algorithms are often

9/14/16 1 Graph Processing Graphs & Analytics Parallel Graph Processing on Web Graphs

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Lecture 6: Graphs. Graphs! Euler Definitions: model. Euler Again!! Konigsberg bridges problem.

Almost Vanishing Polynomials and an Application to Hough Transforms Maria-Laura Torrente joint

ROMEO Humanoid for Action and Communication Rodolphe GELIN Aldebaran Robotics 7 th workshop on

MULTI-MODAL IMAGE INTEGRATION CARLO CAVEDON MEDICAL PHYSICS UNIT VERONA UNIVERSITY HOSPITAL -

Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward

Mesh representations and data structures Luca Castelli Aleardi Shared vertex representation

GraphIt: A DSL for High-Performance Graph Analytics Yunming Zhang, Mengjiao Yang, Riyadh

Chapter 10.1 Trees Prof. Tesler Math 184A Winter 2017 Prof. Tesler Ch. 10.1: Trees Math 184A

Graph Processing Marco Serafini COMPSCI 532 Lecture 9 Graph - PowerPoint PPT Presentation

Graph Processing Marco Serafini COMPSCI 532 Lecture 9 Graph Analytics Marco Serafini 2 2 Scaling Graph Algorithms Type of algorithms PageRank Shortest path Clustering Connected component Requirements Support

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system

Medusa Simplified Graph Processing on GPUs Motivation Graph processing algorithms are often

9/14/16 1 Graph Processing Graphs &amp; Analytics Parallel Graph Processing on Web Graphs

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Lecture 6: Graphs. Graphs! Euler Definitions: model. Euler Again!! Konigsberg bridges problem.

Almost Vanishing Polynomials and an Application to Hough Transforms Maria-Laura Torrente joint

ROMEO Humanoid for Action and Communication Rodolphe GELIN Aldebaran Robotics 7 th workshop on

MULTI-MODAL IMAGE INTEGRATION CARLO CAVEDON MEDICAL PHYSICS UNIT VERONA UNIVERSITY HOSPITAL -

Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward

Mesh representations and data structures Luca Castelli Aleardi Shared vertex representation

GraphIt: A DSL for High-Performance Graph Analytics Yunming Zhang, Mengjiao Yang, Riyadh

Chapter 10.1 Trees Prof. Tesler Math 184A Winter 2017 Prof. Tesler Ch. 10.1: Trees Math 184A

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

9/14/16 1 Graph Processing Graphs & Analytics Parallel Graph Processing on Web Graphs