Parallel Triangle Counting and K-Truss Identification Using - PowerPoint PPT Presentation

Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017 1

Graph-Centric vs. Matrix-Centric Abstractions 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 * = 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 : active node : neighborhood • Active element • Bulk operations • Node/edge where computation is needed • Matrix-matrix/vector multiplication • Element-wise manipulation • Operator • Reduction • Computation at active element • Parallelism • Neighborhood: Set of nodes/edges read/written by the update • Inside individual operations • Parallelism • Disjoint updates • Read-only operators, e.g. triangle counting 2

Galois: Graph-Centric Programming Framework Shared-Memory Galois [1] IrGL [2] (C++ Library) (Compiler) • Parallel data structures • Translates Galois programs to CUDA • Graphs, bags, etc. • Applies GPU-specific optimizations • Parallel loops over active elements • Iteration outlining • for_each, do_all, etc. • Cooperative conversion • Nested parallelism • Support for • Load balancing • Scheduling • Dynamic work [1] D. Nguyen, A. Lenharthand K. Pingali . “A lightweight infrastructure for graph analytics,” in SOSP 2013. 3 [2] S. Pai and K. Pingali . “A compiler for throughput optimization of graph algorithms on GPUs ,” in OOPSLA 2016.

Advantages of Graph-Centric Approach 4

Eliminating Barriers in a Round Graph-centric methods: Matrix-centric methods: K-Truss begins Operator for edges Matrix operation for each step Matrix operation for Enumerate triangles triangle enumeration Barrier in a round Operator for e n Operator for e 1 Operator for e 2 Operator for e 3 Matrix operation for Count number of triangles counting # triangles for edges for edges … Barrier in a round Reduction to check for edges Do all edges have sufficient w/ insufficient support support? Barrier in a round No Yes Matrix operation for Remove edges w/ removing selected edges insufficient support Barrier between rounds Barrier between rounds K-Truss done 5

Exploiting Domain Knowledge in Operators 0 1 K-Truss begins Early termination when edge 2 3 Enumerate triangles support reaches k – 2. Count number of 4 5 Sorted edge lists to locate edges using triangles for edges binary search when removing edges Do all edges have sufficient Graph as Compressed Sparse Row (CSR) support? No Yes Remove edges w/ EdgeRemoved 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -- insufficient support K-Truss done EdgeDst 1 2 3 0 2 3 0 1 3 4 0 1 2 4 5 2 3 5 3 4 -- Edge removals may EdgeRange 0 3 6 10 15 18 20 Sorted edge lists to speed be visible in current up edge list intersection round, reducing the from O(deg(u)*deg(v)) to number of rounds. O(deg(u)+deg(v)) 6

Avoiding Runtime Memory Management Graph-centric methods: Load graphs and update node/edge data in the graphs EdgeData Fixed after e e e e e e e e e e e e e e e e e e e e -- graphs are EdgeDst 1 2 3 0 2 3 0 1 3 4 0 1 2 4 5 2 3 5 3 4 -- loaded. EdgeRange 0 3 6 10 15 18 20 0 1 NodeData n n n n n n -- 2 3 Matrix-centric methods: Construct matrices at runtime 4 5 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 Needs runtime 1 1 1 1 1 1 1 1 = 2 1 2 1 2 1 1 2 1 1 * memory 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 2 management. 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 Adjacency matrix Incidence matrix Product matrix 7

Advantages of Graph-Centric Approach • Eliminates barriers in a round • Exploits domain knowledge in operators • Avoids runtime memory management 8

[4] Smallest Experimental Setup Platform • CPU • Broadwell-EP Xeon E5-2650 v4 @ 2.2 GHz • Largest 30 MB LLC, 192 GB RAM • g++ 4.9 • 1, 12 or 24 threads • GPU • Pascal-based NVIDIA GTX 1080 • 8 GB RAM • NVCC 8.0 Baseline from IEEE HPEC static graph challenge [3] • Triangle counting: serial miniTri in C++ • K-truss computation: reference implementation in Julia 0.60 Parameter • Compute k max -truss for each graph. • k max : the maximum k for a graph to return non-empty truss. [3] S. Samsi et al. “Static graph challenge: subgraph isomorphism,” in IEEE HPEC , 2017. 9 [4] J. Leskovec and A. Krevl. SNAP datasets: Stanford large network dataset collection. Retrieved from http://snap.Stanford.edu/data, June 2014.

Runtime 10

K-Truss Runtime 4800 End-to-end runtime after the timeout graph is loaded and before the results are printed. Lower is better Speedup over Julia Variant Geo Mean Julia 1.00 11

K-Truss Runtime 4800 End-to-end runtime after the timeout graph is loaded and before the results are printed. Lower is better Speedup over Julia Variant Geo Mean Julia 1.00 Cpu-01 428.87 12

K-Truss Runtime 4800 End-to-end runtime after the timeout graph is loaded and before the results are printed. Lower is better Speedup over Julia Variant Geo Mean Julia 1.00 Cpu-01 428.87 Cpu-24 623.62 Maximum speedup of cpu-24 over cpu-01: 14.30X (~117M edges) 13

K-Truss Runtime 4800 End-to-end runtime after the timeout graph is loaded and before the results are printed. Lower is better Speedup over Julia Variant Geo Mean Julia 1.00 Cpu-01 428.87 Cpu-24 623.62 Gpu 2,213.14 Maximum speedup of cpu-24 over cpu-01: 14.30X (~117M edges) 14

Triangles Runtime 4800 End-to-end runtime after the timeout graph is loaded and before the results are printed. Lower is better Speedup over MiniTri Variant Geo Mean MiniTri 1.00 Cpu-01 163.23 Cpu-24 380.57 Gpu 1,760.47 Maximum speedup of cpu-24 over cpu-01: 17.22X (~15.7M edges) 15

Memory Usage 16

K-Truss Memory Usage Measurement 192GB Total CPU memory Julia: @time Lower is better % over Julia Variant Geo Mean Julia 100.00 17

K-Truss Memory Usage Measurement 192GB Total CPU memory Julia: @time CPU: Galois’ internal allocator Lower is better % over Julia Variant Geo Mean Julia 100.00 Cpu-01 0.54 18

K-Truss Memory Usage Measurement 192GB Total CPU memory Julia: @time CPU: Galois’ internal allocator Lower is better % over Julia Variant Geo Mean Julia 100.00 Cpu-01 0.54 Cpu-24 11.05 19

K-Truss Memory Usage Measurement 192GB Total CPU memory Julia: @time CPU: Galois’ internal allocator GPU: cudaMemGetInfo Lower is better 8GB Total GPU memory % over Julia Variant Geo Mean Julia 100.00 Cpu-01 0.54 Cpu-24 11.05 Gpu 1.09 20

Triangles Memory Usage Measurement 192GB Total CPU memory CPU: Galois’ internal allocator GPU: cudaMemGetInfo Lower is better 8GB Total GPU memory % over MiniTri Variant Geo Mean MiniTri 100.00 Cpu-01 94.31 Cpu-24 791.64 Gpu 50.14 21

Energy Usage 22

K-Truss Energy Usage Measurement Julia: Intel RAPL counters CPU: Intel RAPL counters GPU: nvprof Lower is better % over Julia Variant Geo Mean Julia 100.00 Cpu-01 2.27 Cpu-24 2.03 Gpu 0.48 23

Triangles Energy Usage Measurement CPU: Intel RAPL counters GPU: nvprof Lower is better % over MiniTri Variant Geo Mean MiniTri 100.00 Cpu-01 12.95 Cpu-24 12.07 Gpu 2.55 24

Conclusions • Graph-centric methods deliver two to three orders of magnitude improvements over matrix-centric IEEE HPEC static graph challenge reference implementations. • Advantages of graph-centric methods over matrix-centric methods • Eliminates barriers in a round. • Exploits domain knowledge in operators. • Early operator termination • On-the-spot edge removals • Sorting of edge lists for faster edge list intersections and edge removals • Avoids runtime memory management. 25

Thank you! Questions? Comments? 26

Parallel Triangle Counting and K-Truss Identification Using - PowerPoint PPT Presentation

Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017 1 Graph-Centric vs. Matrix-Centric Abstractions 1 1 1 1

Truss St Tru s Structures Truss Definitions and Details 1 Truss: Mimic Beam Behavior 2

Truss Bridges of Kentucky 1899 Amanda Abner Rebecca Turner 1893 Truss Bridges of Kentucky

Triangle Counting in Large Sparse Graph Meng-Tsung Tsai r95065@cise.ntu.edu.tw Triangle Counting

Parallel Triangle Counting in MPI Jason Li and David Wise Background A triangle in a

truss structures Chris Hunt, Michael Wisnom, Ben Woods CDT Conference 2019 16 th April 2019 2

Braunston - The Triangle The Triangle redevelopment Phase 1 triangle cleared of

From Pascals Triangle to Sierpinskis Triangle Nicoleta Babutiu Q@A Todays journey

What are the symmetries of an equilateral triangle? Triangle C R1R2 FAFBFC ID counting

H O W E T R U S S HISTORY, USE AND STRUCTURAL ANALYSIS 143 bridges are supported by the Howe

UNCERTAIN GEOMETRY Truss structures Frame structures Civil engineering codes ACI Code (USA)

Warren Truss (Not the Politician) ARCH 3281 History Use of trusses trace back to classical Greek

Member of the National Lumber Family of Companies | Headquarters in Mansfield, MA Reliable Truss

44 Days And Counting 44 Days And Counting 2010 World Equestrian Games Overview September 25

Counting and Probability Whats to come? Counting and Probability Whats to come?

Counting CS1200, CSE IIT Madras Meghana Nasre April 2, 2020 CS1200, CSE IIT Madras Meghana

Counting CS1200, CSE IIT Madras Meghana Nasre March 26, 2020 CS1200, CSE IIT Madras Meghana

Sustainabilityrelated Knowledge Communication between Strategic Staging, Information, and

Tracking Knowledge Proficiency of Students with Educational Priors Yuying Chen 1 , Qi Liu 1 ,

www.drupaleurope.org Drupal PKM A Personal Knowledge Management Drupal distro

Information and Knowledge Management Working Group Makelesi Gonelevu Timo Baur Purpose The

Structural Testing Also known as glass/white/open box testing Structural testing is based

Texture Based Classification Of Seismic Image Patches Using Topological Data Analysis June 6,

First-Order Knowledge Compilation Guy Van den Broeck Dagstuhl Sept 18, 2017 Overview 1.

Self-Stabilizing Algorithms for graph parameters Phd student : Brahim NEGGAZI 1 Laboratoire