Parallel Triangle Counting and K-Truss Identification Using - - PowerPoint PPT Presentation

parallel triangle counting and k truss identification
SMART_READER_LITE
LIVE PREVIEW

Parallel Triangle Counting and K-Truss Identification Using - - PowerPoint PPT Presentation

Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017 1 Graph-Centric vs. Matrix-Centric Abstractions 1 1 1 1


slide-1
SLIDE 1

Parallel Triangle Counting and K-Truss Identification Using Graph-Centric Methods

Chad Voegele, Yi-Shan Lu, Sreepathi Pai, Keshav Pingali The University of Texas at Austin 09/13/2017

1

slide-2
SLIDE 2

Graph-Centric vs. Matrix-Centric Abstractions

  • Active element
  • Node/edge where computation is needed
  • Operator
  • Computation at active element
  • Neighborhood: Set of nodes/edges

read/written by the update

  • Parallelism
  • Disjoint updates
  • Read-only operators, e.g. triangle counting
  • Bulk operations
  • Matrix-matrix/vector multiplication
  • Element-wise manipulation
  • Reduction
  • Parallelism
  • Inside individual operations

2 : active node : neighborhood

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

* =

slide-3
SLIDE 3

Galois: Graph-Centric Programming Framework

Shared-Memory Galois [1] (C++ Library)

  • Parallel data structures
  • Graphs, bags, etc.
  • Parallel loops over active elements
  • for_each, do_all, etc.
  • Support for
  • Load balancing
  • Scheduling
  • Dynamic work

IrGL [2] (Compiler)

  • Translates Galois programs to CUDA
  • Applies GPU-specific optimizations
  • Iteration outlining
  • Cooperative conversion
  • Nested parallelism

3

[1] D. Nguyen, A. Lenharthand K. Pingali. “A lightweight infrastructure for graph analytics,” in SOSP 2013. [2] S. Pai and K. Pingali. “A compiler for throughput optimization of graph algorithms on GPUs,” in OOPSLA 2016.

slide-4
SLIDE 4

Advantages of Graph-Centric Approach

4

slide-5
SLIDE 5

Eliminating Barriers in a Round

5

Graph-centric methods: Operator for edges Operator for e2 Operator for e3 Operator for en

Matrix-centric methods: Matrix operation for each step Barrier between rounds Matrix operation for triangle enumeration Matrix operation for counting # triangles for edges Matrix operation for removing selected edges Reduction to check for edges w/ insufficient support Barrier in a round Barrier in a round Barrier in a round

Enumerate triangles Count number of triangles for edges Remove edges w/ insufficient support

Do all edges have sufficient support?

K-Truss done No Yes K-Truss begins

Barrier between rounds Operator for e1

slide-6
SLIDE 6

Exploiting Domain Knowledge in Operators

6

EdgeDst EdgeRange

  • EdgeRemoved

Graph as Compressed Sparse Row (CSR) 1 2 3 5 4 Sorted edge lists to speed up edge list intersection from O(deg(u)*deg(v)) to O(deg(u)+deg(v)) Sorted edge lists to locate edges using binary search when removing edges

Enumerate triangles Count number of triangles for edges Remove edges w/ insufficient support

Do all edges have sufficient support?

K-Truss done No Yes K-Truss begins

Early termination when edge support reaches k – 2. Edge removals may be visible in current round, reducing the number of rounds.

1 2 3 2 3 1 3 4 1 2 4 5 2 3 5 3 4

  • 3

6

10 15 18 20

slide-7
SLIDE 7

Avoiding Runtime Memory Management

7

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

* =

3 6

10 15 18 20

1 2 3 2 3 1 3 4 1 2 4 5 2 3 5 3 4

  • e

e e e e e e e e e e e e e e e e e e e

  • EdgeDst

EdgeData EdgeRange

n n n n n n

  • NodeData

Graph-centric methods: Load graphs and update node/edge data in the graphs Fixed after graphs are loaded.

Adjacency matrix Incidence matrix Product matrix

Matrix-centric methods: Construct matrices at runtime Needs runtime memory management. 1 2 3 5 4

slide-8
SLIDE 8

Advantages of Graph-Centric Approach

  • Eliminates barriers in a round
  • Exploits domain knowledge in operators
  • Avoids runtime memory management

8

slide-9
SLIDE 9

Experimental Setup

9

Platform

  • CPU
  • Broadwell-EP Xeon E5-2650 v4 @ 2.2 GHz
  • 30 MB LLC, 192 GB RAM
  • g++ 4.9
  • 1, 12 or 24 threads
  • GPU
  • Pascal-based NVIDIA GTX 1080
  • 8 GB RAM
  • NVCC 8.0

Baseline from IEEE HPEC static graph challenge [3]

  • Triangle counting: serial miniTri in C++
  • K-truss computation: reference implementation in Julia 0.60

Parameter

  • Compute kmax-truss for each graph.
  • kmax: the maximum k for a graph to return non-empty truss.

[3] S. Samsiet al. “Static graph challenge: subgraph isomorphism,” in IEEE HPEC, 2017. [4] J. Leskovec and A. Krevl. SNAP datasets: Stanford large network dataset collection. Retrieved from http://snap.Stanford.edu/data, June 2014.

[4]

Largest Smallest

slide-10
SLIDE 10

Runtime

10

slide-11
SLIDE 11

11

K-Truss Runtime

4800

Lower is better

timeout

Speedup over Julia

Variant Geo Mean Julia 1.00 End-to-end runtime after the graph is loaded and before the results are printed.

slide-12
SLIDE 12

12

K-Truss Runtime

4800

Lower is better

timeout

Speedup over Julia

Variant Geo Mean Julia 1.00 Cpu-01 428.87 End-to-end runtime after the graph is loaded and before the results are printed.

slide-13
SLIDE 13

13

K-Truss Runtime

4800

Lower is better

timeout

Speedup over Julia

Variant Geo Mean Julia 1.00 Cpu-01 428.87 Cpu-24 623.62 End-to-end runtime after the graph is loaded and before the results are printed. Maximum speedup of cpu-24 over cpu-01: 14.30X (~117M edges)

slide-14
SLIDE 14

14

K-Truss Runtime

4800

Lower is better

timeout

Speedup over Julia

Variant Geo Mean Julia 1.00 Cpu-01 428.87 Cpu-24 623.62 Gpu 2,213.14 End-to-end runtime after the graph is loaded and before the results are printed. Maximum speedup of cpu-24 over cpu-01: 14.30X (~117M edges)

slide-15
SLIDE 15

15

Triangles Runtime

4800

Lower is better

timeout

Speedup over MiniTri

Variant Geo Mean MiniTri 1.00 Cpu-01 163.23 Cpu-24 380.57 Gpu 1,760.47 End-to-end runtime after the graph is loaded and before the results are printed. Maximum speedup of cpu-24 over cpu-01: 17.22X (~15.7M edges)

slide-16
SLIDE 16

Memory Usage

16

slide-17
SLIDE 17

17

K-Truss Memory Usage

192GB

Lower is better

Total CPU memory

% over Julia

Variant Geo Mean Julia 100.00 Measurement Julia: @time

slide-18
SLIDE 18

18

K-Truss Memory Usage

Lower is better

% over Julia

Variant Geo Mean Julia 100.00 Cpu-01 0.54 Measurement Julia: @time CPU: Galois’ internal allocator 192GB Total CPU memory

slide-19
SLIDE 19

19

K-Truss Memory Usage

Lower is better

% over Julia

Variant Geo Mean Julia 100.00 Cpu-01 0.54 Cpu-24 11.05 Measurement Julia: @time CPU: Galois’ internal allocator 192GB Total CPU memory

slide-20
SLIDE 20

20

K-Truss Memory Usage

Lower is better

% over Julia

Variant Geo Mean Julia 100.00 Cpu-01 0.54 Cpu-24 11.05 Gpu 1.09 Measurement Julia: @time CPU: Galois’ internal allocator GPU: cudaMemGetInfo 192GB Total CPU memory 8GB Total GPU memory

slide-21
SLIDE 21

21

Triangles Memory Usage

Lower is better

% over MiniTri

Variant Geo Mean MiniTri 100.00 Cpu-01 94.31 Cpu-24 791.64 Gpu 50.14 Measurement CPU: Galois’ internal allocator GPU: cudaMemGetInfo 192GB Total CPU memory 8GB Total GPU memory

slide-22
SLIDE 22

Energy Usage

22

slide-23
SLIDE 23

23

K-Truss Energy Usage

Lower is better

% over Julia

Variant Geo Mean Julia 100.00 Cpu-01 2.27 Cpu-24 2.03 Gpu 0.48 Measurement Julia: Intel RAPL counters CPU: Intel RAPL counters GPU: nvprof

slide-24
SLIDE 24

24

Triangles Energy Usage

Lower is better

% over MiniTri

Variant Geo Mean MiniTri 100.00 Cpu-01 12.95 Cpu-24 12.07 Gpu 2.55 Measurement CPU: Intel RAPL counters GPU: nvprof

slide-25
SLIDE 25

Conclusions

  • Graph-centric methods deliver two to three orders of magnitude

improvements over matrix-centric IEEE HPEC static graph challenge reference implementations.

  • Advantages of graph-centric methods over matrix-centric methods
  • Eliminates barriers in a round.
  • Exploits domain knowledge in operators.
  • Early operator termination
  • On-the-spot edge removals
  • Sorting of edge lists for faster edge list intersections and edge removals
  • Avoids runtime memory management.

25

slide-26
SLIDE 26

Thank you!

Questions? Comments?

26