[PPT] - Oded Green Going to talk about 2 things A scalable and dynamic data PowerPoint Presentation

SLIDE 1

cuSTINGER – A Sparse Dynamic Graph and Matrix Data Structure

Oded Green

SLIDE 2

Going to talk about 2 things

A scalable and dynamic data structure for

graph algorithms and linear algebra based problems

A framework for static and dynamic analytics

NVIDIA GTC cuSTINGER Oded Green, 2017

2

SLIDE 3

Upfront Summary of Results

Can support upto 90 million updates per second
Low overhead in comparison with CSR

– Initializing is also relatively inexpensive 20%200% – Equal performance

Great performance for static graph algorithms
Simple to use

NVIDIA GTC cuSTINGER Oded Green, 2017

3

SLIDE 4

Big Data problems need Graph Analysis

Communication networks:

Worldwide connectivity
High velocity changes
Different types of extracted

data:

– Physical communication network. – Persontoperson communication network. NVIDIA GTC cuSTINGER Oded Green, 2017

4

HealthCare networks:

Various players.
Pattern matching and

epidemic monitoring.

Problem sizes have

doubled in last 5 years.

Financial networks:

Transactions between

players.

Different transactions

types (property graph)

Graphs are a unifying motif for data analytics.

SLIDE 5

STINGER

STINGER: SpatioTemporal Interaction Networks and

Graphs (STING) Extensible Representation

Enable algorithm designers to implement dynamic &

streaming graph algorithms with ease.

Portable semantics for various platforms

– Linked list of edge blocks not ideal for the GPU

Good performance for all types of graph problems

and algorithms static and dynamic.

Assumes globally addressable memory access

NVIDIA GTC cuSTINGER Oded Green, 2017

5

SLIDE 6

STINGER and cuSTINGER Properties

✓ A Simple programming model ✓ Millions of updates per second to graph ✓ Updates are not bottlenecks for analytics. ✓ Advanced memory manager

✓ Transfers data between host and device automatically ✓ Reduces initialization time ✓ Allows for simple update processes STINGER Papers: [Bader et al.; 2007; Tech Report], [Ediger et al.; HPEC; 2012], [McColl et al.; PPAA; 2014] cuSTINGER paper: [Green&Bader; HPEC, 2016]: cuSTINGER: Supporting dynamic graph algorithms for GPUs

NVIDIA GTC cuSTINGER Oded Green, 2017

6

SLIDE 7

Definitions

Dynamic graphs (matrices)

– Graph can change over time. – Changes can be to topology, edges, or vertices.

For example new edges between two vertices.

– Changes to edge or vertex weights

Streaming graphs:

– Graphs changing at high rates. – 100s of thousands of updates per second.

NVIDIA GTC cuSTINGER Oded Green, 2017

7

SLIDE 8

Dynamic graph example

Only a subset of the entire

graph…

Dynamic:

– At time 𝑢:

𝑤 and 𝑥 become friends.
𝑗𝑜𝑡𝑓𝑠𝑢_𝑓𝑒𝑕𝑓 (𝑤, 𝑥)

– At time Ƹ 𝑢:

𝑣 and 𝑤 no longer friends
d𝑓𝑚𝑓𝑢𝑓𝑓𝑒𝑕𝑓 𝑣,𝑤
Additional operations include

vertex insertions & deletions

NVIDIA GTC cuSTINGER Oded Green, 2017

8

𝑤 𝑣 𝑥

SLIDE 9

“Separation of powers”

Dynamic graph data structure and dynamic

graph algorithms are in two different repositories

– Easy to integrate with external library – Can also be used with matrices

First part of today’s talk will be on the dynamic

data structure

NVIDIA GTC cuSTINGER Oded Green, 2017

9

SLIDE 10

Part 1 – Data Structure

cuSTINGER Version 2.0

Improved initialization time

– 100s of time faster than Version 1.0

New memory manager

– Reduces fragmentation – Enables memory reclamation – Offers good memory bounds

Scalable data structure

– Can easily grow 1000X its initial size without needing to be reinitialized

Faster updates

Coming soon…(probably late May)

NVIDIA GTC cuSTINGER Oded Green, 2017

10

SLIDE 11

Restrictions of existing static graph representations

NVIDIA GTC cuSTINGER Oded Green, 2017

11

Name Pros Cons

Adjacency Matrix

Flexible
Limited utilization for

sparse data Linked lists

Flexible
Poor locality
Allocation time is costly

COO (Edge list) unsorted

Has some flexibility
Updates are simple
Poor locality
Stores both the source

and destination CSR

Uses exact amount of

memory

Good locality
Inflexible

SLIDE 12

Compressed Sparse Row (CSR)

Pros:

Uses precise storage

requirements

Great locality

– Good for GPUs

Handful of arrays

– Simple to use and manage

Cons:

Inflexible.
Network growth

unsupported

Topology changes

unsupported

Property graphs not

supported

NVIDIA GTC cuSTINGER Oded Green, 2017

12

1 2 3 4 5 6 7 2 4 7 9 11 13 14 14

Src/Row Offset

1 2 5 3 4 2 6 2 5 1 4 3 2 5 2 7 4 1 4 1 2 4 1 7 1 2

Dest./Col. Value

SLIDE 13

Part 1: cuSTINGER – A High Level View

Supports updates

– Supports edge insertion\deletion and deletion. – Supports vertex insertion\deletion.

NVIDIA GTC cuSTINGER Oded Green, 2017

13

1 2 3 4 5 6 7 2 2 3 2 2 2 1 2 2 4 2 2 2 1 Vertex Id Used BSize Pointer 1 2 2 5 0 5 2 7 0 3 4 4 1 4 2 6 1 2 2 5 4 1 1 4 7 1 3 2

Overallocated space USERINTERFACE

Dest./Col. Value

SLIDE 14

cuSTINGER – Property Graph Support

NVIDIA GTC cuSTINGER Oded Green, 2017

14

1 2 3 4 5 6 7 2 2 3 2 2 2 1 2 2 4 2 2 2 1 Vertex Id Used BSize Pointer 1 2 2 5 0 5 2 7 0 3 4 4 1 4 2 6 1 2 2 5 4 1 1 4 7 1 3 2

USERINTERFACE

Dest./Col. Weigth Type Time 1 User 1 User 2 ….

These are optional fields

SLIDE 15

Challenges

Memory allocations are costly.
Seems like we are suggesting that we need

𝑃(𝑊) allocations

– Absolutely not. – Our first implementation did this… ouch…

NVIDIA GTC cuSTINGER Oded Green, 2017

15

SLIDE 16

Memory Manager

Made up three parts:

1. Vectorized Bit Trees
2. BlockArrays
3. 𝐶+𝑈𝑠𝑓𝑓𝑡 of BlockArrays
THIS IS AN INTERNAL REPRESENTATION (HIDDEN FROM

USERS)

NVIDIA GTC cuSTINGER Oded Green, 2017

16

SLIDE 17

Definition – an array made up of equal size

blocks.

Each block can contain an equal number of

edges

BlockArrays

NVIDIA GTC cuSTINGER Oded Green, 2017

17

1 2 2 5 0 5 2 7 2 6 1 2 2 5 4 1

BlockArray (with 4 blocks) Block (with 2 edges)

SLIDE 18

cuSTINGER – BlockArray allocations

NVIDIA GTC cuSTINGER Oded Green, 2017

18

1 2 3 4 5 6 7 2 2 3 2 2 2 1 2 2 4 2 2 2 1 Vertex Id Used BSize Pointer 1 2 2 5 0 5 2 7 0 3 4 4 1 4 2 6 1 2 2 5 4 1 1 4 7 1 3 2

𝑪𝑩𝟏,𝟐 𝑪𝑩𝟐,𝟐 𝑪𝑩𝟐,𝟑 𝑪𝑩𝟑,𝟐 available space Overallocated space USERINTERFACE USER INTERFACE

Dest./Col. Value

Relatively small number of BlockArrays are needed

– Exact number not known at compile time (or even at runtime given updates)

INTERNAL REPRESENTATION. HIDDEN FROM USERS

SLIDE 19

Memory Manager

Made up three parts:

1. Vectorized Bit Trees

– Helps determine which blocks are empty – Key components for efficient memory reclamation

2. BlockArrays
3. 𝐶+𝑈𝑠𝑓𝑓𝑡 of BlockArrays

NVIDIA GTC cuSTINGER Oded Green, 2017

19

SLIDE 20

Each block is either full (0) or empty (1).

VecTrees

NVIDIA GTC cuSTINGER Oded Green, 2017

20

1 2 2 5 5 2 7 2 6 1 2 2 5 4 1 block VectTree implementation Machine word

𝐶𝐵1,1

VectTree representation 1 2 2 5 2 6 1 2

1 1 1 1 1 1 1 1 1

Next available position

𝐶𝐵1,1

Machine word VectTree implementation

(a) Full BlockArray (b) Partially Empty BlockArray

SLIDE 21

VecTrees Complexity

Given a BlockArray with 𝐶𝐵
Storage complexity 𝑃 𝐶𝐵
bits. In practice

this is close to 2 ⋅ 𝐶𝐵 bits

– Relatively small overhead.

VecTree Updates require 𝑃 log 𝐶𝐵
perations

NVIDIA GTC cuSTINGER Oded Green, 2017

21

SLIDE 22

Memory Manager

Made up three parts:

1. Vectorized Bit Trees
2. BlockArrays
3. 𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

– Responsible for deciding when more memory needs to be allocated

NVIDIA GTC cuSTINGER Oded Green, 2017

22

SLIDE 23

𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

Each block sizes has a different tree.
The KEY of the 𝑪+𝑼𝒔𝒇𝒇𝒕 is the number of

empty blocks

NVIDIA GTC cuSTINGER Oded Green, 2017

23

1 2 ... 31

𝐶+𝑈𝑠𝑓𝑓 Array

𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 1 edge in a block 𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 4 edges in a block

3 4 4 1 4

B+ Node

𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 2 edges in a block

1 available block

𝐶𝐵2,2

Log. of block size

SLIDE 24

𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

Currently supports adjacency lists with upto

231 edges

Can easily support up 263 edge blocks!!!

NVIDIA GTC cuSTINGER Oded Green, 2017

24

1 2 ... 31

𝐶+𝑈𝑠𝑓𝑓 Array

Log. of block size

𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 1 edge in a block 𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 4 edges in a block 𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 2 edges in a block

2 5 1 4 1 1 5 3 7 2 7 6 1

B+ Node

1 available block

𝐶𝐵2,2

3 4 4 1 4

𝐶𝐵2,1

0 available blocks

B+ Node

SLIDE 25

𝑪+𝑼𝒔𝒇𝒇𝒕 Properties

A new BlockArray is allocated when all existing

BlockArrays are full.

Great for memory utilization.

NVIDIA GTC cuSTINGER Oded Green, 2017

25

SLIDE 26

cuSTINGER – Update Process for Edge Insertions

1. Count number of insertions for each 𝑇𝑝𝑣𝑠𝑑𝑓 2. Check edge availability for each 𝑇𝑝𝑣𝑠𝑑𝑓. If not enough edges:

1. 𝑛𝑓𝑛𝑝𝑠𝑧 𝑛𝑏𝑜𝑕𝑓𝑠 – get larger block that can store all edges 2. Copy existing edges from old block to new block 3. 𝑛𝑓𝑛𝑝𝑠𝑧 𝑛𝑏𝑜𝑏𝑕𝑓𝑠 old block is reclaimed

3. Insert new edges (while avoiding duplicates)

NVIDIA GTC cuSTINGER Oded Green, 2017

26

1 2 3 4 5 6 7 2 3 3 2 4 2 1 2 4 4 2 4 2 1 Vertex Id Used BSize Pointer

USERINTERFACE

1 4 4 1 3 7 1 6 1 Source Destination Value

Update Batch

2 5 4 1 2 5 4 1 2 5 3 7 4 1 6 1

SLIDE 27

cuSTINGER – Full View (after update)

NVIDIA GTC cuSTINGER Oded Green, 2017

27

1 1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 2 3 3 2 4 2 1 2 4 4 2 4 2 1 Vertex Id Used BSize Pointer 1 2 2 5 2 6 1 2 1 4 7 1 3 2

𝑪𝑩𝟏,𝟐 𝑪𝑩𝟐,𝟐 𝑪𝑩𝟐,𝟑 USERINTERFACE USER INTERFACE

1 1 1 1

2 5 3 7 4 1 6 1 0 3 4 4 1 4 0 5 1 2 7 1 𝑪𝑩𝟑,𝟑

𝑪𝑩𝟑,𝟐

1 4 4 1 3 7 1 6 1 Source Destination Value

Update Batch

VecTree bit status INTERNAL REPRESENTATION. HIDDEN FROM USERS

Dest./Col. Value

SLIDE 28

cuSTINGER – Go Home with this View

NVIDIA GTC cuSTINGER Oded Green, 2017

28

1 2 3 4 5 6 7 2 3 3 2 4 2 1 2 4 4 2 4 2 1 Vertex Id Used BSize Pointer 1 2 2 5 2 6 1 2 1 4 7 1 3 2

USERINTERFACE USER INTERFACE

0 3 4 4 1 4 1 4 4 1 3 7 1 6 1 Source Destination Value

Update Batch

Dest./Col. Value 2 5 3 7 4 1 6 1 0 5 1 2 7 1

SLIDE 29

cuSTINGER – Data Structure

https://github.com/cuStinger/cuStinger
Build instructions

– git clone recursive https://github.com/cuStinger/cuStinger.git – mkdir build && cd build – cmake .. – make j8

NVIDIA GTC cuSTINGER Oded Green, 2017

29

SLIDE 30

Performance Analysis

Initialization Overhead
Memory Utilization
Update rate

– Number of sustainable updates per second

CSR Vs. cuSTINGER

– SpMV

NVIDIA GTC cuSTINGER Oded Green, 2017

30

SLIDE 31

For the K80 GPU, we use only one GPU
We report for the K80, unless noted

Experimental Setup

GPU 𝝂Arch SMs SPs Memory (GB) Memory Type K40 Kepler 15 2880 12 GDDR5 K80 Kepler 2x13 2x2496 2x12 GDDR5 P100 Pascal 56 3584 16 HBM2

NVIDIA GTC cuSTINGER Oded Green, 2017

31

SLIDE 32

Inputs Graphs

DIMACS 10 Graph Implementation Challenge
SNAP – Stanford Network Analysis Project
Florida Matrix Collection

The following is only a subset of these graphs:

NVIDIA GTC cuSTINGER Oded Green, 2017

32

Name Type |𝑾| |𝑭|* Source 𝑑𝑝𝐵𝑣𝑢ℎ𝑝𝑠𝑡𝐸𝐶𝑀𝑄 Collaboration 299𝑙 1.95𝑁 DIMACS 𝑏𝑡 − 𝑡𝑙𝑗𝑢𝑢𝑓𝑠 Trace route 1.69𝑁 11.1𝑁 SNAP 𝑙𝑠𝑝𝑜_21 Random 2𝑁 201𝑁 DIMACS 𝑑𝑗𝑢 − 𝑞𝑏𝑢𝑓𝑜𝑢𝑡 Citation 3.77𝑁 16.5𝑁 SNAP 𝑑𝑏𝑕𝑓15 Matrix 5.15𝑁 94𝑁 DIMACS 𝑣𝑙 − 2002 Webcrawl 18.52𝑁 523𝑁 DIMACS

SLIDE 33

Memory Utilization Edges

𝑉𝑢𝑗𝑚𝑗𝑨𝑏𝑢𝑗𝑝𝑜 =

𝑉𝑡𝑓𝑒 𝐵𝑚𝑚𝑝𝑑𝑏𝑢𝑓𝑒

70% average utilization

NVIDIA GTC cuSTINGER Oded Green, 2017

33

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Space Efficiency

Edge Utilization

SLIDE 34

Memory Utilization Blocks

𝑉𝑢𝑗𝑚𝑗𝑨𝑏𝑢𝑗𝑝𝑜 =

𝑉𝑡𝑓𝑒 𝐵𝑚𝑚𝑝𝑑𝑏𝑢𝑓𝑒

90% average utilization

NVIDIA GTC cuSTINGER Oded Green, 2017

34

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Space Efficiency

Block Utilization

SLIDE 35

Memory Utilization Overall

70% average utilization
30% overhead in comparison to CSR

NVIDIA GTC cuSTINGER Oded Green, 2017

35

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Space Efficiency

Overall Utilization

SLIDE 36

Initialization Time

Copying from the CPU to GPU is costly
Initializing cuSTINGER does not add much overhead
We still need to optimize this process for P100
Over 100𝑦 than cuSTINGER Version 1.0

NVIDIA GTC cuSTINGER Oded Green, 2017

36

1 2 3 4 CSR Copy Host to Device Initilization on Device

Overhead vs. CSR Memcpy

SLIDE 37

Insertion Rates

Supports up to 90M updates per second
Version 2.0 is

– 4𝑌 − 10𝑌 faster than Version 1.0 – Does not have 𝑞𝑓𝑠𝑔𝑝𝑠𝑛𝑏𝑜𝑑𝑓 𝑒𝑗𝑞 like Version 1.0

Scalable growth in update rate

NVIDIA GTC cuSTINGER Oded Green, 2017

37

Version 1.0 Version 2.0

SLIDE 38

SpMV: CSR Vs. cuSTINGER

Simply replace CSR accesses with cuSTINGER

– A real “apples to apples” comparison

NVIDIA GTC cuSTINGER Oded Green, 2017

38

1 10 100 1.000 10.000 100.000

MFLOPS

CSR cuSTINGER

SLIDE 39

Part 2: Algorithms for cuSTINGER

Goal support Static and Dynamic graph

algorithms

Already showed that the graph update process

is efficient

All algorithms are implemented using the same

set of operations

– We show that these operators are efficient for static graph algorithms

NVIDIA GTC cuSTINGER Oded Green, 2017

39

SLIDE 40

cuSTINGER Programming Model

“Keep things simple”

– Limit the amount of GPU programming users need to do.

Uses vertex and edge frontiers

– Similar to Gunrock & LIGRA – Necessary for good utilization – Still requires good loadbalancing – Edge frontiers are created implicitly from vertex frontiers.

NVIDIA GTC cuSTINGER Oded Green, 2017

40

SLIDE 41

Vertex and Edge Frontiers

Ligra

– CPU HPC graph framework by Julian Shun – Two backends: CILK and OpenMP – Edge frontiers created implicitly from vertex frontiers

Typically one phase
Gunrock

– Highly tuned GPU graph library from Prof. John Owens – Supports multiGPU analytics (same sharednode) – Each operation consists of two phases – Edge frontiers created explicitly by programmer.

NVIDIA GTC cuSTINGER Oded Green, 2017

41

SLIDE 42

Case Study 1: Label Propagation Connected Components

Connected

component algorithms such as ShiloachVishkin

Initially every vertex

is in its components

Vertices move to with

smallest ID

– Vertices can swap components multiple times

NVIDIA GTC - cuSTINGER - Oded Green, 2017

42

4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1

SLIDE 43

One Iteration of Connected Components Algorithm

// Label propagation 1) 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑤 ∈ 𝑊 2) 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑣 ∈ 𝑏𝑒𝑘 𝑤 3) 𝑗𝑔 𝐷𝐷 𝑣 < 𝐷𝐷 𝑤 4) 𝐷𝐷 𝑣 ← 𝐷𝐷 𝑤 // Shortcutting 5) 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑤 ∈ 𝑊 6) 𝐷𝐷 𝑤 ← 𝐷𝐷 𝐷𝐷 𝑤

NVIDIA GTC - cuSTINGER - Oded Green, 2017

43

Traverse all edges 𝑃 𝐹 Traverse all vertices 𝑃 𝑊

SLIDE 44

Revisiting the algorithm in parallel*

Notice, no triple brackets <<<>>>

* We have more optimized versions in the library (require about 4 more lines of code….)

44

NVIDIA GTC - cuSTINGER - Oded Green, 2017

SLIDE 45

Case Study 2: Katz Centrality

Given pseudo code for a variant of Katz

Centrality

It took 56 hours to port from the CPU to the GPU.

– NVIDIA P100 GPU, initial speedup: 70X – CPU version had some addition optimizations. Within additional two hours speedup was: 100X

This was feasible because of the preexisting

loadbalanced primitives in the library

NVIDIA GTC - cuSTINGER - Oded Green, 2017

45

SLIDE 46

cuSTINGER Algorithms

https://github.com/cuStinger/cuStingerAlg
Build instructions

– git clone –recursive https://github.com/cuStinger/cuStingerAlg.git – mkdir build && cd build – cmake .. – make j8

By default, cuSTINGER will also be cloned

– Though you will need to build both repos

NVIDIA GTC cuSTINGER Oded Green, 2017

46

SLIDE 47

cuSTINGERv2 Algorithms

https://github.com/cuStinger/cuStingerAlg
Build instructions

– git clone –recursive https://github.com/cuStinger/cuStingerAlg.git – cd cuStingerAlg/build – cmake .. – make j

NVIDIA GTC cuSTINGER Oded Green, 2017

47

SLIDE 48

Performance Analysis

Connected Components
Breadth First Search
Triangle Counting

– Static – Dynamic

NVIDIA GTC cuSTINGER Oded Green, 2017

48

SLIDE 49

Connected Components – NVIDIA P100

Using label propagation
Within 25% for most cases.

NVIDIA GTC cuSTINGER Oded Green, 2017

49

Name Gunrock (msec.) cuSTINGER (msec.) Speedup coAuthorsDBLP 1.72 2.17 0.79X asSkitter 3.68 17.4 0.21X kron_21 86.84 66.4 1.33X citpatents 38.85 40.8 0.95X Cage15 46.1 56.1 0.82X uk2002 407 489 0.83X

SLIDE 50

BFS – Classic TopDown – NVIDIA P100

Using a similar algorithm in Gunrock

– Gunrock has additional optimizations that can make it faster than cuSTINGER

NVIDIA GTC cuSTINGER Oded Green, 2017

50

Name Gunrock (msec.) cuSTINGER (msec.) Speedup coAuthorsDBLP 2.74 2.44 1.12X asSkitter 7.74 10.6 0.73X kron_21 45.4 25.7 1.76X citpatents 16.5 23.3 0.71X cage15 29.1 43.2 0.67X uk2002 39.9 81.6 0.49X

SLIDE 51

Triangle Counting: CSR Vs. cuSTINGER

Triangle counting algorithm taken from [Green et al; 𝐽𝐵3;2014]
Simply replace CSR accesses with cuSTINGER
Executed on a K40

NVIDIA GTC cuSTINGER Oded Green, 2017

51

Name |𝑾| |𝑭| TimeCSR (sec.) TimecuSTINGER (sec.) Execution Difference 𝑑𝑝𝐵𝑣𝑢ℎ𝑝𝑠𝑡𝐸𝐶𝑀𝑄 299𝑙 1.95𝑁 0.218 0.242 +10% 𝑏𝑡 − 𝑡𝑙𝑗𝑢𝑢𝑓𝑠 1.69𝑁 11.1𝑁 57.14 59.37 +3.8% 𝑙𝑠𝑝𝑜_21 2𝑁 201𝑁 2992 2996 +0.14% 𝑑𝑗𝑢 − 𝑞𝑏𝑢𝑓𝑜𝑢𝑡 3.77𝑁 16.5𝑁 0.814 0.830 +2% 𝑑𝑏𝑕𝑓15 5.15𝑁 94𝑁 6.544 7.204 +10% 𝑣𝑙 − 2002 18.52𝑁 523𝑁 424.9 431.4 +1.6%

SLIDE 52

Summary

52

NVIDIA GTC cuSTINGER Oded Green, 2017

SLIDE 53

Library Overview

Completed algorithms and ongoing Of course many more algorithms to come…

NVIDIA GTC cuSTINGER Oded Green, 2017

53

Algorithm Static Dynamic Reference

Breadth first search

✓

Triangle Counting

✓ ✓

Static [Green et al; IA32014] Dynamic – new algorithm [Makkar; 2017 submitted] Connect components

✓

ngoing

[McColl; HiPC 2013] Betweenness Centrality

✓

ngoing

[Green; SocialCom 2012] Page Rank

✓

ngoing

New algorithm (non linear algebra based) Katz Centrality

✓

New algorithm (non linear algebra based)

SLIDE 54

Upcoming projects using cuSTINGER

Extend dynamic triangle counting to

Jaccard Indices

Scalable pattern and motif detection on the

GPU

NVIDIA GTC cuSTINGER Oded Green, 2017

54

SLIDE 55

Take away

Dynamic data structure for sparse data sets
Supports high update rates
Scalable in both data size and in performance

NVIDIA GTC cuSTINGER Oded Green, 2017

55

SLIDE 56

Collaborators

Prof. David Bader (Georgia Tech)
Prof. Jimeng Sun (Georgia Tech)
Dr. Jason Riedy (Georgia Tech)
Federico Busato, Visiting PhD student (Universita di Verona)
James Fox, PhD student (Georgia Tech)
Euna Kim, PhD student (Georgia Tech)
Muhammad Osama Sakhi, BSc student (Georgia Tech)
Alok Tripathy, BSc student (Georgia Tech)
Manas George, BSc student (Georgia Tech)
Graduates (GT)

– Devavret Makkar, MSc. (Tower Research)

NVIDIA GTC cuSTINGER Oded Green, 2017

56

SLIDE 57

Thank you

NVIDIA GTC cuSTINGER Oded Green, 2017

57

Email: ogreen@gatech.edu
Data structure:

– https://github.com/cuStinger/cuStingerAlg

Algorithms:

– https://github.com/cuStinger/cuStingerAlg.git

Versions 2.0, coming soon to a GPU near you

cuSTINGER – A Sparse Dynamic Graph and Matrix Data Structure

Oded Green

Going to talk about 2 things

graph algorithms and linear algebra based problems

Upfront Summary of Results

– Initializing is also relatively in­expensive 20%­200% – Equal performance

Big Data problems need Graph Analysis

STINGER

Graphs (STING) Extensible Representation

streaming graph algorithms with ease.

and algorithms ­ static and dynamic.

STINGER and cuSTINGER Properties

✓ A Simple programming model ✓ Millions of updates per second to graph ✓ Updates are not bottlenecks for analytics. ✓ Advanced memory manager

Definitions

Dynamic graph example

graph…

– At time 𝑢:

– At time Ƹ 𝑢:

vertex insertions & deletions

“Separation of powers”

graph algorithms are in two different repositories

– Easy to integrate with external library – Can also be used with matrices

data structure

Part 1 – Data Structure

cuSTINGER Version 2.0

Restrictions of existing static graph representations

Name Pros Cons

Compressed Sparse Row (CSR)

Part 1: cuSTINGER – A High Level View

cuSTINGER – Property Graph Support

Challenges

𝑃(𝑊) allocations

– Absolutely not. – Our first implementation did this… ouch…

Memory Manager

Made up three parts:

USERS)

blocks.

edges

BlockArrays

cuSTINGER – BlockArray allocations

Memory Manager

Made up three parts:

– Helps determine which blocks are empty – Key components for efficient memory reclamation

Vec­Trees

Vec­Trees Complexity

this is close to 2 ⋅ 𝐶𝐵 bits

– Relatively small overhead.

Memory Manager

Made up three parts:

– Responsible for deciding when more memory needs to be allocated

𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

empty blocks

𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

231 edges

𝑪+𝑼𝒔𝒇𝒇𝒕 Properties

BlockArrays are full.

cuSTINGER – Update Process for Edge Insertions

1. Count number of insertions for each 𝑇𝑝𝑣𝑠𝑑𝑓 2. Check edge availability for each 𝑇𝑝𝑣𝑠𝑑𝑓. If not enough edges:

3. Insert new edges (while avoiding duplicates)

cuSTINGER – Full View (after update)

cuSTINGER – Go Home with this View

cuSTINGER – Data Structure

– git clone ­­recursive https://github.com/cuStinger/cuStinger.git – mkdir build && cd build – cmake .. – make ­j8

Performance Analysis

– Number of sustainable updates per second

– SpMV

Experimental Setup

GPU 𝝂Arch SMs SPs Memory (GB) Memory Type K40 Kepler 15 2880 12 GDDR5 K80 Kepler 2x13 2x2496 2x12 GDDR5 P100 Pascal 56 3584 16 HBM2

Inputs Graphs

The following is only a subset of these graphs:

Memory Utilization ­ Edges

Memory Utilization ­ Blocks

Memory Utilization ­ Overall

Initialization Time

Insertion Rates

Version 1.0 Version 2.0

SpMV: CSR Vs. cuSTINGER

– A real “apples to apples” comparison

Part 2: Algorithms for cuSTINGER

algorithms

– Initializing is also relatively inexpensive 20%200% – Equal performance

and algorithms static and dynamic.

VecTrees

VecTrees Complexity

– git clone recursive https://github.com/cuStinger/cuStinger.git – mkdir build && cd build – cmake .. – make j8

Memory Utilization Edges

Memory Utilization Blocks

Memory Utilization Overall

– Similar to Gunrock & LIGRA – Necessary for good utilization – Still requires good loadbalancing – Edge frontiers are created implicitly from vertex frontiers.

– Highly tuned GPU graph library from Prof. John Owens – Supports multiGPU analytics (same sharednode) – Each operation consists of two phases – Edge frontiers created explicitly by programmer.

component algorithms such as ShiloachVishkin

loadbalanced primitives in the library

– git clone –recursive https://github.com/cuStinger/cuStingerAlg.git – mkdir build && cd build – cmake .. – make j8

– git clone –recursive https://github.com/cuStinger/cuStingerAlg.git – cd cuStingerAlg/build – cmake .. – make j

BFS – Classic TopDown – NVIDIA P100

Completed algorithms and ongoing Of course many more algorithms to come…