graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, - PowerPoint PPT Presentation

Parallel high-performance graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU

Graph algorithms Bioinformatics Social networks analysis Business-analytics Data mining City planning and others … 2

Graph algorithms Breadth-first search ◦ Easy to understand ◦ Widespread ◦ Many ways to parallelize Graph500 ◦ www.graph500.org ◦ R. Murphy, K. Wheeler, B. Barrett, and J. Ang. Introducing the graph 500. In Cray User’s Group (CUG), 2010 ◦ Parallel breadth-first search ◦ MPI and OpenMP implementations ◦ Designed for graphs with relatively small diameter and skewed degree distribution ◦ “Scale - Free” graphs 3

Parallel breadth-first search Level synchronous algorithms ◦ Processing of level N+1 begins only when processing of level N is over 0 1 2 4 3 5 6 4

Obstacles for efficient parallel implementation Data transfer problem Graph marking problem 5

Data transfer problem Problem description ◦ Real-world graphs have irregular memory access pattern ◦ Graphs with small diameter have big overheads connected with data transfer through interconnect network Suggested solution ◦ Combine different types of level synchronous algorithms 6

Level synchronous algorithms Top-down traversal ◦ Traditional way to implement breadth-first search ◦ Active vertex tries to check all its neighbors Bottom-up traversal ◦ Inactive vertices looking for active vertices in its neighbor lists ◦ S. Beamer, K. Asanovic, D. A. Patterson, Direction-optimizing breadth-first search // in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. 7

Top down breadth-first search for all u in dist dist [u] ← -1 dist [s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = level for each neighb in vert.neighbors if neighb in V.this_node if dist[neighb] = -1 dist[neighb ] ← level + 1 pred[neighb ] ← vert else vert_batch_to_send.push(neighb) send(vert_batch_to_send) receive(vert_batch_to_receive) parallel for each vert in vert_batch_to_receive if dist[vert] = -1 dist [vert] ← level + 1 pred [vert] ← vert.pred level++ while (!check_end()) 8

Top down breadth-first search Data transfers according to data transfer matrix ◦ 𝑏 𝑗𝑘 – amount of data (bytes) to transfer from process 𝑗 to process 𝑘 0 10 0 20 0 0 0 0 0 1 0 1 8 0 0 0 0 0 0 0 2 4 2 4 9

Top down breadth-first search Data transfer time for every iteration of the algorithm ◦ ~1 million vertices graph, 4 nodes parallelization ◦ Total time spent on data transfer – 0,83 sec. 0,06 0,05 0,04 Time, sec. 0,03 0,02 0,01 0 1 2 3 4 5 6 7 Iteration 10

Bottom up breadth-first search for all u in dist dist [u] ← -1 dist [s] ← 0 level ← 0 do parallel for each vert in V.this_node if dist[vert] = -1 for each neighb in vert.neighbors if bitmap_current.neighb = 1 dist [vert] ← level + 1 pred [vert] ← neighb bitmap_next.vert ← 1 break all_gather(bitmap_next) swap(bitmap_current, bitmap_next) level++ while (!check_end()) 11

Bottom up breadth-first search Data synchronization through collective communications node 0 node 0 node 3 node 3 node 1 node 1 node 2 node 2 all_gather node 0 node 0 node 3 node 3 node 1 node 1 node 2 node 2 12

Bottom up breadth-first search Data transfer time for every iteration of the algorithm ◦ ~1 million vertices graph, 4 nodes parallelization ◦ Total time spent on data transfer – 0,001 sec. 0,0002 0,00015 Time, sec. 0,0001 0,00005 0 1 2 3 4 5 6 7 Iteration 13

Data transfer problem Suggested solution – hybrid graph traversal ◦ First two iterations – “top - down” ◦ Next three iterations – “bottom - up” ◦ All the rest iterations – “top - down” 14

Hybrid graph traversal Data transfer time for every iteration of the algorithm ◦ ~1 million vertices graph, 4 nodes parallelization ◦ Total time spent on data transfer – 0,0005 sec. 0,00015 0,0001 Time, sec. 0,00005 0 1 2 3 4 5 6 7 Iteration 15

Graph marking problem Skewed degree distribution ◦ Many vertices with big number of in-/outgoing edges ◦ Few vertices with small number of in-/outgoing edges 16

Graph marking problem CSR is one of the most popular format to store graph data 0 1 2 3 4 Row pointers: 0, 3, 6, 7, 10, 12 Column ids: 1, 2, 3, 0, 3, 4, 0, 0, 1, 2, 1, 3 17

Graph marking problem Using CSR suppose to traverse through Row pointers array ◦ It’s not know in advance how many edges are incident to some vertex (e.g. how many elements need to traverse in Column ids array) Performance of every iteration in level synchronous algorithms depends on performance of processing mostly “heavy - weight” vertex ◦ Workload imbalance 18

Graph marking problem Suggested solution – method of workload balancing ◦ Divide Column ids array on equal parts, each consisting of Max edges elements ◦ Map every part of Rowstarts array and Column ids array using additional array Part column ◦ Every thread will process previously known number of edges which determined by corresponding elements of Part column array 19

Graph marking problem Graph processing without workload balancing 20

Graph marking problem Graph processing with workload balancing ◦ max_edges = 4 21

Graph marking problem Filling of Part column array parallel for i in V.this_node first ← V.this_node[i] last ← V.this_node[i+1] index ← round_up(first/max_edges) current ← index* max_edges while (current < last) part_column [index] ← i current += max_edges index++ 22

Top down breadth-first search Main loop of level synchronous breadth-first search modified for workload balancing // preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)* max_edges curr_vert ← part_column[i] for each edge є [ first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = level for each k є neighbors of curr_vert if dist[k] = -1 dist [k] ← level + 1 pred [k] ← curr_vert curr_vert++ // data synchronization... 23

Top down breadth-first search Time spent to graph traversal with- and without workload balancing ◦ ~1 million vertices graph, 4 nodes parallelization Without balancing With balancing 2 1,5 Time, sec. 1 0,5 0 1 2 3 4 5 6 7 Iteration 24

Bottom-up breadth-first search Main loop of level synchronous breadth-first search modified for workload balancing // preparation... parallel for i in part_column first_edge ← i*max_edges last_edge ← (i+1)* max_edges curr_vert ← part_column[i] for each edge є [ first_edge;last_edge) if neighbors of curr_vert є [first_edge;last_edge) if dist[curr_vert] = -1 for each k є neighbors of curr_vert if bitmap_current.k = 1 dist[curr_vert ] ← level + 1 pred[curr_vert ] ← k bitmap_next.vert ← 1 break curr_vert++ // data synchronization... 25

Bottom-up breadth-first search Time spent to graph traversal with- and without workload balancing ◦ ~1 million vertices graph, 4 nodes parallelization Without balancing With balancing 0,014 0,012 0,01 Time, sce. 0,008 0,006 0,004 0,002 0 1 2 3 4 5 6 7 Iteration 26

Combining methods Methods can be used together to achieve maximum performance of breadth-first search algorithm ◦ Method of workload balancing – to reduce time spent on graph processing on each iteration ◦ Hybrid traversal – to reduce data transfer overheads in data synchronization step of every iteration 27

Benchmarking All methods are integrated in custom implementation of Graph500 benchmark Measure performance of custom implementation for various number of nodes ◦ 1, 2, 4, 8 nodes of “ Uran ” supercomputer ◦ CPU Intel Xeon X5675, 192 GB DRAM ◦ “Scale” varies from 20 till 25 Compare custom implementation with reference Graph500 implementations ◦ Simple implementation ◦ Replicated implementation Performance metrics – speed of graph traversal ◦ Measuring in Traversed Edges Per Second (TEPS) 28

Results (1 node) custom replicated simple 900 800 700 Speed, MTEPS 600 500 400 300 200 100 0 20 21 22 23 24 25 Scale 29

Results (2 nodes) custom replicated simple 1800 1600 1400 Speed, MTEPS 1200 1000 800 600 400 200 0 20 21 22 23 24 25 Scale 30

Results (4 nodes) custom replicated simple 3000 2500 Speed, MTEPS 2000 1500 1000 500 0 20 21 22 23 24 25 Scale 31

Results (8 nodes) custom replicated simple 4500 4000 3500 Speed, MTEPS 3000 2500 2000 1500 1000 500 0 20 21 22 23 24 25 Scale 32

Results Combining methods of workload balancing and traversal hybridization allows to achieve performance improvement of parallel breadth-first search Custom implementation has potential for further parallelization 33

Conclusion Method of workload balancing helps to reduce overheads connected with graph processing Method of traversal hybridization helps to reduce overheads connected with data transfer on every iteration Future plans ◦ Investigate scalability of developed algorithm ◦ Modify custom implementation for using performance accelerators and coprocessors 34

Questions? 35

graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, - PowerPoint PPT Presentation

Parallel high-performance graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU Graph algorithms Bioinformatics Social networks analysis Business-analytics Data mining City planning and others

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system

Medusa Simplified Graph Processing on GPUs Motivation Graph processing algorithms are often

9/14/16 1 Graph Processing Graphs & Analytics Parallel Graph Processing on Web Graphs

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Computability of the Zero-Error capacity with Kolmogorov Oracle Holger Boche 1 and Christian Deppe

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

National Knowledge Network Overview Experience life with 1000000000 bps 1 November 2012 NKN

An Exploration into Object Storage Lance Evans Raghu Chandrasekar Office of the CTO Storage

Stream Processing for Remote Collaborative Data Analysis Scott Klasky 146 , C. S. Chang 2 , Jong

A Uniform Architecture for Parsing and Generation of Natural Language G unter Neumann DFKI

7. Generative grammar 7.1 Language as a subset of the free monoid 7.1.1 Definition of language A

Recovering Numerical Reproducibility in Hydrodynamics Simulations Philippe Langlois, Rafife

graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, - PowerPoint PPT Presentation

Parallel high-performance graph processing CHERNOSKUTOV MIKHAIL IMM UB RAS, IMCS URFU, YEKATERINBURG E-MAIL: MACH@IMM.URAN.RU Graph algorithms Bioinformatics Social networks analysis Business-analytics Data mining City planning and others

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Graph Data Processing M. Tamer Ozsu 1 / 75 Outline Introduction RDF Graph Querying

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

GraVF: GraVF: A Vertex-Centric A Vertex-Centric Graph Processing Graph Processing Framework

FOOD PROCESSING FOOD PROCESSING GREEN BEAN PROCESSING GREEN BEAN PROCESSING GREEN BEAN

Multiscale Processing on Networks and Community Mining Part 1 - Communities in networks Graph

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system

Medusa Simplified Graph Processing on GPUs Motivation Graph processing algorithms are often

9/14/16 1 Graph Processing Graphs &amp; Analytics Parallel Graph Processing on Web Graphs

Split clique graph complexity L. Alcn and M. Gutierrez La Plata, Argentina L. Faria and C. M.

XL1C: Graph Times-Series Using Ratio Display 3/9/2017 V0D XL1C: V0D XL1C: V0D Graph by Time

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

Computability of the Zero-Error capacity with Kolmogorov Oracle Holger Boche 1 and Christian Deppe

Efficient Scientific Data Efficient Scientific Data Management on Supercomputers Management on

National Knowledge Network Overview Experience life with 1000000000 bps 1 November 2012 NKN

An Exploration into Object Storage Lance Evans Raghu Chandrasekar Office of the CTO Storage

Stream Processing for Remote Collaborative Data Analysis Scott Klasky 146 , C. S. Chang 2 , Jong

A Uniform Architecture for Parsing and Generation of Natural Language G unter Neumann DFKI

7. Generative grammar 7.1 Language as a subset of the free monoid 7.1.1 Definition of language A

Recovering Numerical Reproducibility in Hydrodynamics Simulations Philippe Langlois, Rafife

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

9/14/16 1 Graph Processing Graphs & Analytics Parallel Graph Processing on Web Graphs