Truss Decomposition on Shared-Memory Parallel Systems Shaden Smith 1 - PowerPoint PPT Presentation

Truss Decomposition on Shared-Memory Parallel Systems Shaden Smith 1 , 2 , Xing Liu 2 , Nesreen K. Ahmed 2 , Ancy Sarah Tom 1 , Fabrizio Petrini 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2 Intel Parallel Computing Lab shaden@cs.umn.edu GraphChallenge Finalist, HPEC 2017 1 / 8

Truss decomposition We are interested in computing the complete truss decomposition of a graph on shared-memory parallel systems. Notation: ◮ A k -truss is a subgraph in which each edge is contained in at least ( k − 2) triangles in the same subgraph. ◮ The truss number of an edge, Γ( e ), is the maximum k -truss that contains e . 2 / 8

Serial peeling algorithm Peeling builds the truss decomposition bottom-up. 1: Compute initial supports and store in sup ( · ) 2: k ← 3 3: while | E | > 0 do for each edge e not in current k -truss do 4: for each edge e ′ ∈ ∆ e do 5: sup ( e ′ ) ← sup ( e ′ ) − 1 6: end for 7: Γ( e ) ← k − 1 8: Remove e from E 9: end for 10: k ← k + 1 11: 12: end while 3 / 8

Multi-Stage Peeling (MSP) We break the peeling process into several bulk-synchronous substeps. High-level idea: ◮ Store the graph as an adjacency list for each vertex (i.e., CSR). ◮ Do a 1D decomposition on the vertices. ◮ Operations which modify graph state (e.g., edge deletion and support updates) are grouped by source vertex. ◮ Batching localizes updates to a specific adjacency list and eliminates race conditions. 4 / 8

Multi-Stage Peeling (MSP) Step 1: frontier generation 4 / 8

Multi-Stage Peeling (MSP) Step 2: triangle enumeration 4 / 8

Multi-Stage Peeling (MSP) Step 3: support updates 4 / 8

Multi-Stage Peeling (MSP) Step 4: edge deletion 4 / 8

Experimental Setup Software: ◮ Parallel baseline: asynchronous nucleus decomposition (AND) 1 , written in C and parallelized with OpenMP ◮ MSP is written in C and parallelized with OpenMP ◮ Compiled with icc v17.0 Hardware: ◮ 56-core shared-memory system (2 × 28-core Skylake Xeon) ◮ 192GB DDR4 memory 1 A. E. Sariyuce, C. Seshadhri, and A. Pinar, “ Parallel local algorithms for core, truss, and nucleus decompositions ,” arXiv preprint arXiv:1704.00386, 2017. 5 / 8

Graphs More datasets in paper. Graph | V | | E | | ∆ | k max 3.8M 16.5M 7.5M 36 cit-Patents 3.0M 106.3M 524.6M 75 soc-Orkut 41.7M 1.2B 34.8B 1998 twitter 2.4M 64.1M 2.1B 485 rmat22 4.5M 129.3M 4.5B 625 rmat23 8.9M 260.3M 9.9B 791 rmat24 17.0M 523.5M 21.6B 996 rmat25 K , M , and B denote thousands, millions, and billions, respectively. The first group of graphs is taken from real-world datasets, and the second group is synthetic. 6 / 8

Strong scaling Parallel Scalability ideal 56 cit-Patents soc_orkut rmat22 rmat23 rmat24 Speedup 28 16 8 4 1 1 4 8 16 28 56 Cores 6 / 8

Parallel baseline comparison MSP is up to 28 × faster than AND and 20 × faster than the serial peeling algorithm. Graph Peeling AND MSP 2.89 0.23 12 . 6 × 0.58 5 . 0 × cit-Patents 228.06 64.31 3 . 5 × 11.30 20 . 2 × soc-Orkut - - - 1566.72 twitter 403.59 398.46 1 . 0 × 42.22 9 . 6 × rmat22 980.68 1083.66 0 . 9 × 85.14 11 . 5 × rmat23 2370.54 4945.70 0 . 5 × 175.29 13 . 5 × rmat24 5580.47 - - 352.37 15 . 8 × rmat25 Values are runtimes, in seconds, of the full truss decomposition. Peeling is the optimized serial implementation. AND and MSP are executed on 56 cores. 7 / 8

Wrapping up Multi-stage peeling (MSP): ◮ processes graph mutations in batches to avoid race conditions ◮ resulting algorithm is free of atomics and mutexes ◮ can decompose a billion-scale graph on a single node in minutes Relative to the state-of-the-art: ◮ Up to 28 × speedup over the state-of-the-art parallel algorithm ◮ Serial optimizations achieve over 1400 × speedup over the provided Matlab benchmark ( in paper ). shaden@cs.umn.edu 8 / 8

Backup 8 / 8

Peeling algorithm 1: Compute initial supports and store in sup 2: k ← 3 3: while | E | > 0 do F k ← { e ∈ E : sup ( e ) < k − 2 } 4: while |F k | > 0 do 5: for e ∈ F k do 6: for e ′ ∈ ∆ e do 7: sup ( e ′ ) ← sup ( e ′ ) − 1 8: end for 9: E ← E \ { e } 10: Γ( e ) ← k − 1 11: F k ← { e ∈ E : sup ( e ) < k − 2 } 12: end for 13: end while 14: k ← k + 1 15: 16: end while 8 / 8

Parallelization challenges A natural first approach to parallelization is to peel edges concurrently. There are several challenges when parallelizing: ◮ graph data structure is dynamic ◮ supports must be decremented safely ◮ triangles may be counted multiple times 8 / 8

Serial benchmark comparison The optimized peeling implementation achieves 1400 × speedup over the GraphChallenge benchmark (both serial). Graph Octave Peeling Speedup 169.23 0.22 769 . 1 × soc-Slashdot0811 448.23 0.40 1120 . 6 × cit-HepTh 675.03 0.46 1467 . 4 × soc-Epinions1 787.95 0.79 997 . 4 × loc-gowalla 972.66 4.03 241 . 4 × cit-Patents Values are runtime in seconds. Octave is the serial Octave benchmark provided by the GraphChallenge specification. Peeling is the proposed serial implementation of the peeling algorithm. Speedup is measured relative to Octave . 8 / 8

8 / 8 INITIAL-SUPPORTS rmat25 rmat24 rmat23 rmat22 FRONTIER twitter soc_orkut cit-Patents SUPPORT-UPDATES loc-gowalla_edges soc-Epinions1 Serial breakdown cit-HepTh soc-Slashdot0811 1.0 0.8 0.6 0.4 0.2 0.0 Fraction of total computation time

8 / 8 INITIAL-SUPPORTS rmat25 rmat24 rmat23 rmat22 FRONTIER twitter soc_orkut cit-Patents SUPPORT-UPDATES loc-gowalla_edges soc-Epinions1 Parallel breakdown cit-HepTh soc-Slashdot0811 1.0 0.8 0.6 0.4 0.2 0.0 Fraction of total computation time

Cost per truss The time per k -truss on soc-orkut is unsurprising. 1e8 1.2 1.0 1.0 0.8 Time (s) to peel level k Size of k-truss (edges) 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 10 20 30 40 50 60 70 80 k 8 / 8

Cost per truss rmat25 is more challenging. 1e8 14 4.5 4.0 12 3.5 10 Time (s) to peel level k Size of k-truss (edges) 3.0 8 2.5 2.0 6 1.5 4 1.0 2 0.5 0 0.0 0 200 400 600 800 1000 k 8 / 8

Truss Decomposition on Shared-Memory Parallel Systems Shaden Smith 1 - PowerPoint PPT Presentation

Truss Decomposition on Shared-Memory Parallel Systems Shaden Smith 1 , 2 , Xing Liu 2 , Nesreen K. Ahmed 2 , Ancy Sarah Tom 1 , Fabrizio Petrini 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2

Truss St Tru s Structures Truss Definitions and Details 1 Truss: Mimic Beam Behavior 2

Truss Bridges of Kentucky 1899 Amanda Abner Rebecca Turner 1893 Truss Bridges of Kentucky

truss structures Chris Hunt, Michael Wisnom, Ben Woods CDT Conference 2019 16 th April 2019 2

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Higher-Order Modulators and the Toolbox Richard Schreier richard.schreier@analog.com

Phase & Frequency Noise Metrology Enrico Rubiola Outline Introduction Measurement

Chirp readout for kinetic inductance detectors Attila Kovcs Caltech Christopher M. McKenney

CSE 154 LECTURE 23: XML Storing structured data in arbitrary text formats (bad) My note: BEGIN

Rhythm John Sizemore (Team Leader) Cristopher Stauffer Yuankai Huo Lauren Stephanian

Programming Exercise 2: Logistic Regression Machine Learning October 26, 2011 Introduction In

1 Scale space representation Scale space representation Keypoint detection Keypoint detection

Practical and Effective Sandboxing for Non-root users Taesoo Kim and Nickolai Zeldovich MIT

Truss Decomposition on Shared-Memory Parallel Systems Shaden Smith 1 - PowerPoint PPT Presentation

Truss Decomposition on Shared-Memory Parallel Systems Shaden Smith 1 , 2 , Xing Liu 2 , Nesreen K. Ahmed 2 , Ancy Sarah Tom 1 , Fabrizio Petrini 2 , and George Karypis 1 1 Department of Computer Science & Engineering, University of Minnesota 2

Truss St Tru s Structures Truss Definitions and Details 1 Truss: Mimic Beam Behavior 2

Truss Bridges of Kentucky 1899 Amanda Abner Rebecca Turner 1893 Truss Bridges of Kentucky

truss structures Chris Hunt, Michael Wisnom, Ben Woods CDT Conference 2019 16 th April 2019 2

Distributed Shared Memory Shared memory : difficult to realize vs . easy to program with.

Outline Asynchronous shared memory model Wait-free Consensus in shared memory with R/W

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

COMP 590-154: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors

CSC2/458 Parallel and Distributed Systems Parallel Memory Systems: Coherence Sreepathi Pai

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

Distributed Shared Memory Presented by Humayun Arafat 1 Outline Background Shared Memory,

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Higher-Order Modulators and the Toolbox Richard Schreier richard.schreier@analog.com

Phase &amp; Frequency Noise Metrology Enrico Rubiola Outline Introduction Measurement

Chirp readout for kinetic inductance detectors Attila Kovcs Caltech Christopher M. McKenney

CSE 154 LECTURE 23: XML Storing structured data in arbitrary text formats (bad) My note: BEGIN

Rhythm John Sizemore (Team Leader) Cristopher Stauffer Yuankai Huo Lauren Stephanian

Programming Exercise 2: Logistic Regression Machine Learning October 26, 2011 Introduction In

1 Scale space representation Scale space representation Keypoint detection Keypoint detection

Practical and Effective Sandboxing for Non-root users Taesoo Kim and Nickolai Zeldovich MIT

Phase & Frequency Noise Metrology Enrico Rubiola Outline Introduction Measurement