Oded Green Going to talk about 2 things A scalable and dynamic data - - PowerPoint PPT Presentation

oded green going to talk about 2 things
SMART_READER_LITE
LIVE PREVIEW

Oded Green Going to talk about 2 things A scalable and dynamic data - - PowerPoint PPT Presentation

cuSTINGER A Sparse Dynamic Graph and Matrix Data Structure Oded Green Going to talk about 2 things A scalable and dynamic data structure for graph algorithms and linear algebra based problems A framework for static and dynamic


slide-1
SLIDE 1

cuSTINGER – A Sparse Dynamic Graph and Matrix Data Structure

Oded Green

slide-2
SLIDE 2

Going to talk about 2 things

  • A scalable and dynamic data structure for

graph algorithms and linear algebra based problems

  • A framework for static and dynamic analytics

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

2

slide-3
SLIDE 3

Upfront Summary of Results

  • Can support up­to 90 million updates per second
  • Low overhead in comparison with CSR

– Initializing is also relatively in­expensive 20%­200% – Equal performance

  • Great performance for static graph algorithms
  • Simple to use

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

3

slide-4
SLIDE 4

Big Data problems need Graph Analysis

Communication networks:

  • World­wide connectivity
  • High velocity changes
  • Different types of extracted

data:

– Physical communication network. – Person­to­person communication network. NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

4

Health­Care networks:

  • Various players.
  • Pattern matching and

epidemic monitoring.

  • Problem sizes have

doubled in last 5 years.

Financial networks:

  • Transactions between

players.

  • Different transactions

types (property graph)

Graphs are a unifying motif for data analytics.

slide-5
SLIDE 5

STINGER

  • STINGER: Spatio­Temporal Interaction Networks and

Graphs (STING) Extensible Representation

  • Enable algorithm designers to implement dynamic &

streaming graph algorithms with ease.

  • Portable semantics for various platforms

– Linked list of edge blocks not ideal for the GPU

  • Good performance for all types of graph problems

and algorithms ­ static and dynamic.

  • Assumes globally addressable memory access

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

5

slide-6
SLIDE 6

STINGER and cuSTINGER Properties

✓ A Simple programming model ✓ Millions of updates per second to graph ✓ Updates are not bottlenecks for analytics. ✓ Advanced memory manager

✓ Transfers data between host and device automatically ✓ Reduces initialization time ✓ Allows for simple update processes STINGER Papers: [Bader et al.; 2007; Tech Report], [Ediger et al.; HPEC; 2012], [McColl et al.; PPAA; 2014] cuSTINGER paper: [Green&Bader; HPEC, 2016]: cuSTINGER: Supporting dynamic graph algorithms for GPUs

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

6

slide-7
SLIDE 7

Definitions

  • Dynamic graphs (matrices)

– Graph can change over time. – Changes can be to topology, edges, or vertices.

  • For example new edges between two vertices.

– Changes to edge or vertex weights

  • Streaming graphs:

– Graphs changing at high rates. – 100s of thousands of updates per second.

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

7

slide-8
SLIDE 8

Dynamic graph example

  • Only a subset of the entire

graph…

  • Dynamic:

– At time 𝑢:

  • 𝑤 and 𝑥 become friends.
  • 𝑗𝑜𝑡𝑓𝑠𝑢_𝑓𝑒𝑕𝑓 (𝑤, 𝑥)

– At time Ƹ 𝑢:

  • 𝑣 and 𝑤 no longer friends
  • d𝑓𝑚𝑓𝑢𝑓𝑓𝑒𝑕𝑓 𝑣,𝑤
  • Additional operations include

vertex insertions & deletions

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

8

𝑤 𝑣 𝑥

slide-9
SLIDE 9

“Separation of powers”

  • Dynamic graph data structure and dynamic

graph algorithms are in two different repositories

– Easy to integrate with external library – Can also be used with matrices

  • First part of today’s talk will be on the dynamic

data structure

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

9

slide-10
SLIDE 10

Part 1 – Data Structure

cuSTINGER Version 2.0

  • Improved initialization time

– 100s of time faster than Version 1.0

  • New memory manager

– Reduces fragmentation – Enables memory reclamation – Offers good memory bounds

  • Scalable data structure

– Can easily grow 1000X its initial size without needing to be re­initialized

  • Faster updates

Coming soon…(probably late May)

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

10

slide-11
SLIDE 11

Restrictions of existing static graph representations

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

11

Name Pros Cons

Adjacency Matrix

  • Flexible
  • Limited utilization for

sparse data Linked lists

  • Flexible
  • Poor locality
  • Allocation time is costly

COO (Edge list) ­ unsorted

  • Has some flexibility
  • Updates are simple
  • Poor locality
  • Stores both the source

and destination CSR

  • Uses exact amount of

memory

  • Good locality
  • Inflexible
slide-12
SLIDE 12

Compressed Sparse Row (CSR)

Pros:

  • Uses precise storage

requirements

  • Great locality

– Good for GPUs

  • Handful of arrays

– Simple to use and manage

Cons:

  • Inflexible.
  • Network growth

unsupported

  • Topology changes

unsupported

  • Property graphs not

supported

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

12

1 2 3 4 5 6 7 2 4 7 9 11 13 14 14

Src/Row Offset

1 2 5 3 4 2 6 2 5 1 4 3 2 5 2 7 4 1 4 1 2 4 1 7 1 2

Dest./Col. Value

slide-13
SLIDE 13

Part 1: cuSTINGER – A High Level View

  • Supports updates

– Supports edge insertion\deletion and deletion. – Supports vertex insertion\deletion.

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

13

1 2 3 4 5 6 7 2 2 3 2 2 2 1 2 2 4 2 2 2 1 Vertex Id Used BSize Pointer 1 2 2 5 0 5 2 7 0 3 4 4 1 4 2 6 1 2 2 5 4 1 1 4 7 1 3 2

Over­allocated space USER­INTERFACE

Dest./Col. Value

slide-14
SLIDE 14

cuSTINGER – Property Graph Support

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

14

1 2 3 4 5 6 7 2 2 3 2 2 2 1 2 2 4 2 2 2 1 Vertex Id Used BSize Pointer 1 2 2 5 0 5 2 7 0 3 4 4 1 4 2 6 1 2 2 5 4 1 1 4 7 1 3 2

USER­INTERFACE

Dest./Col. Weigth Type Time 1 User 1 User 2 ….

  • These are optional fields
slide-15
SLIDE 15

Challenges

  • Memory allocations are costly.
  • Seems like we are suggesting that we need

𝑃(𝑊) allocations

– Absolutely not. – Our first implementation did this… ouch…

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

15

slide-16
SLIDE 16

Memory Manager

Made up three parts:

  • 1. Vectorized Bit Trees
  • 2. BlockArrays
  • 3. 𝐶+𝑈𝑠𝑓𝑓𝑡 of BlockArrays
  • THIS IS AN INTERNAL REPRESENTATION (HIDDEN FROM

USERS)

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

16

slide-17
SLIDE 17
  • Definition – an array made up of equal size

blocks.

  • Each block can contain an equal number of

edges

BlockArrays

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

17

1 2 2 5 0 5 2 7 2 6 1 2 2 5 4 1

BlockArray (with 4 blocks) Block (with 2 edges)

slide-18
SLIDE 18

cuSTINGER – BlockArray allocations

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

18

1 2 3 4 5 6 7 2 2 3 2 2 2 1 2 2 4 2 2 2 1 Vertex Id Used BSize Pointer 1 2 2 5 0 5 2 7 0 3 4 4 1 4 2 6 1 2 2 5 4 1 1 4 7 1 3 2

𝑪𝑩𝟏,𝟐 𝑪𝑩𝟐,𝟐 𝑪𝑩𝟐,𝟑 𝑪𝑩𝟑,𝟐 available space Over­allocated space USER­INTERFACE USER­ INTERFACE

Dest./Col. Value

  • Relatively small number of BlockArrays are needed

– Exact number not known at compile time (or even at runtime given updates)

INTERNAL REPRESENTATION. HIDDEN FROM USERS

slide-19
SLIDE 19

Memory Manager

Made up three parts:

  • 1. Vectorized Bit Trees

– Helps determine which blocks are empty – Key components for efficient memory reclamation

  • 2. BlockArrays
  • 3. 𝐶+𝑈𝑠𝑓𝑓𝑡 of BlockArrays

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

19

slide-20
SLIDE 20
  • Each block is either full (0) or empty (1).

Vec­Trees

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

20

1 2 2 5 5 2 7 2 6 1 2 2 5 4 1 block Vect­Tree implementation Machine word

𝐶𝐵1,1

Vect­Tree representation 1 2 2 5 2 6 1 2

1 1 1 1 1 1 1 1 1

Next available position

𝐶𝐵1,1

Machine word Vect­Tree implementation

(a) Full BlockArray (b) Partially Empty BlockArray

slide-21
SLIDE 21

Vec­Trees Complexity

  • Given a BlockArray with 𝐶𝐵
  • Storage complexity 𝑃 𝐶𝐵
  • bits. In practice

this is close to 2 ⋅ 𝐶𝐵 bits

– Relatively small overhead.

  • Vec­Tree Updates require 𝑃 log 𝐶𝐵
  • perations

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

21

slide-22
SLIDE 22

Memory Manager

Made up three parts:

  • 1. Vectorized Bit Trees
  • 2. BlockArrays
  • 3. 𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

– Responsible for deciding when more memory needs to be allocated

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

22

slide-23
SLIDE 23

𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

  • Each block sizes has a different tree.
  • The KEY of the 𝑪+𝑼𝒔𝒇𝒇𝒕 is the number of

empty blocks

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

23

1 2 ... 31

𝐶+𝑈𝑠𝑓𝑓 Array

𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 1 edge in a block 𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 4 edges in a block

3 4 4 1 4

B+ Node

𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 2 edges in a block

1 available block

𝐶𝐵2,2

  • Log. of block size
slide-24
SLIDE 24

𝑪+𝑼𝒔𝒇𝒇𝒕 of BlockArrays

  • Currently supports adjacency lists with up­to

231 edges

  • Can easily support up 263 edge blocks!!!

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

24

1 2 ... 31

𝐶+𝑈𝑠𝑓𝑓 Array

  • Log. of block size

𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 1 edge in a block 𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 4 edges in a block 𝐶+𝑈𝑠𝑓𝑓 for BlockArray with 2 edges in a block

2 5 1 4 1 1 5 3 7 2 7 6 1

B+ Node

1 available block

𝐶𝐵2,2

3 4 4 1 4

𝐶𝐵2,1

0 available blocks

B+ Node

slide-25
SLIDE 25

𝑪+𝑼𝒔𝒇𝒇𝒕 Properties

  • A new BlockArray is allocated when all existing

BlockArrays are full.

  • Great for memory utilization.

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

25

slide-26
SLIDE 26

cuSTINGER – Update Process for Edge Insertions

1. Count number of insertions for each 𝑇𝑝𝑣𝑠𝑑𝑓 2. Check edge availability for each 𝑇𝑝𝑣𝑠𝑑𝑓. If not enough edges:

1. 𝑛𝑓𝑛𝑝𝑠𝑧 𝑛𝑏𝑜𝑕𝑓𝑠 – get larger block that can store all edges 2. Copy existing edges from old block to new block 3. 𝑛𝑓𝑛𝑝𝑠𝑧 𝑛𝑏𝑜𝑏𝑕𝑓𝑠 ­ old block is reclaimed

3. Insert new edges (while avoiding duplicates)

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

26

1 2 3 4 5 6 7 2 3 3 2 4 2 1 2 4 4 2 4 2 1 Vertex Id Used BSize Pointer

USER­INTERFACE

1 4 4 1 3 7 1 6 1 Source Destination Value

Update Batch

2 5 4 1 2 5 4 1 2 5 3 7 4 1 6 1

slide-27
SLIDE 27

cuSTINGER – Full View (after update)

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

27

1 1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 2 3 3 2 4 2 1 2 4 4 2 4 2 1 Vertex Id Used BSize Pointer 1 2 2 5 2 6 1 2 1 4 7 1 3 2

𝑪𝑩𝟏,𝟐 𝑪𝑩𝟐,𝟐 𝑪𝑩𝟐,𝟑 USER­INTERFACE USER­ INTERFACE

1 1 1 1

2 5 3 7 4 1 6 1 0 3 4 4 1 4 0 5 1 2 7 1 𝑪𝑩𝟑,𝟑

𝑪𝑩𝟑,𝟐

1 4 4 1 3 7 1 6 1 Source Destination Value

Update Batch

Vec­Tree bit status INTERNAL REPRESENTATION. HIDDEN FROM USERS

Dest./Col. Value

slide-28
SLIDE 28

cuSTINGER – Go Home with this View

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

28

1 2 3 4 5 6 7 2 3 3 2 4 2 1 2 4 4 2 4 2 1 Vertex Id Used BSize Pointer 1 2 2 5 2 6 1 2 1 4 7 1 3 2

USER­INTERFACE USER­ INTERFACE

0 3 4 4 1 4 1 4 4 1 3 7 1 6 1 Source Destination Value

Update Batch

Dest./Col. Value 2 5 3 7 4 1 6 1 0 5 1 2 7 1

slide-29
SLIDE 29

cuSTINGER – Data Structure

  • https://github.com/cuStinger/cuStinger
  • Build instructions

– git clone ­­recursive https://github.com/cuStinger/cuStinger.git – mkdir build && cd build – cmake .. – make ­j8

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

29

slide-30
SLIDE 30

Performance Analysis

  • Initialization Overhead
  • Memory Utilization
  • Update rate

– Number of sustainable updates per second

  • CSR Vs. cuSTINGER

– SpMV

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

30

slide-31
SLIDE 31
  • For the K80 GPU, we use only one GPU
  • We report for the K80, unless noted

Experimental Setup

GPU 𝝂Arch SMs SPs Memory (GB) Memory Type K40 Kepler 15 2880 12 GDDR5 K80 Kepler 2x13 2x2496 2x12 GDDR5 P100 Pascal 56 3584 16 HBM2

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

31

slide-32
SLIDE 32

Inputs Graphs

  • DIMACS 10 Graph Implementation Challenge
  • SNAP – Stanford Network Analysis Project
  • Florida Matrix Collection

The following is only a subset of these graphs:

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

32

Name Type |𝑾| |𝑭|* Source 𝑑𝑝𝐵𝑣𝑢ℎ𝑝𝑠𝑡𝐸𝐶𝑀𝑄 Collaboration 299𝑙 1.95𝑁 DIMACS 𝑏𝑡 − 𝑡𝑙𝑗𝑢𝑢𝑓𝑠 Trace route 1.69𝑁 11.1𝑁 SNAP 𝑙𝑠𝑝𝑜_21 Random 2𝑁 201𝑁 DIMACS 𝑑𝑗𝑢 − 𝑞𝑏𝑢𝑓𝑜𝑢𝑡 Citation 3.77𝑁 16.5𝑁 SNAP 𝑑𝑏𝑕𝑓15 Matrix 5.15𝑁 94𝑁 DIMACS 𝑣𝑙 − 2002 Webcrawl 18.52𝑁 523𝑁 DIMACS

slide-33
SLIDE 33

Memory Utilization ­ Edges

  • 𝑉𝑢𝑗𝑚𝑗𝑨𝑏𝑢𝑗𝑝𝑜 =

𝑉𝑡𝑓𝑒 𝐵𝑚𝑚𝑝𝑑𝑏𝑢𝑓𝑒

  • 70% average utilization

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

33

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Space Efficiency

Edge Utilization

slide-34
SLIDE 34

Memory Utilization ­ Blocks

  • 𝑉𝑢𝑗𝑚𝑗𝑨𝑏𝑢𝑗𝑝𝑜 =

𝑉𝑡𝑓𝑒 𝐵𝑚𝑚𝑝𝑑𝑏𝑢𝑓𝑒

  • 90% average utilization

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

34

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Space Efficiency

Block Utilization

slide-35
SLIDE 35

Memory Utilization ­ Overall

  • 70% average utilization
  • 30% overhead in comparison to CSR

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

35

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Space Efficiency

Overall Utilization

slide-36
SLIDE 36

Initialization Time

  • Copying from the CPU to GPU is costly
  • Initializing cuSTINGER does not add much overhead
  • We still need to optimize this process for P100
  • Over 100𝑦 than cuSTINGER Version 1.0

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

36

1 2 3 4 CSR Copy Host to Device Initilization on Device

Overhead vs. CSR Memcpy

slide-37
SLIDE 37

Insertion Rates

  • Supports up to 90M updates per second
  • Version 2.0 is

– 4𝑌 − 10𝑌 faster than Version 1.0 – Does not have 𝑞𝑓𝑠𝑔𝑝𝑠𝑛𝑏𝑜𝑑𝑓 𝑒𝑗𝑞 like Version 1.0

  • Scalable growth in update rate

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

37

Version 1.0 Version 2.0

slide-38
SLIDE 38

SpMV: CSR Vs. cuSTINGER

  • Simply replace CSR accesses with cuSTINGER

– A real “apples to apples” comparison

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

38

1 10 100 1.000 10.000 100.000

MFLOPS

CSR cuSTINGER

slide-39
SLIDE 39

Part 2: Algorithms for cuSTINGER

  • Goal support Static and Dynamic graph

algorithms

  • Already showed that the graph update process

is efficient

  • All algorithms are implemented using the same

set of operations

– We show that these operators are efficient for static graph algorithms

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

39

slide-40
SLIDE 40

cuSTINGER Programming Model

  • “Keep things simple”

– Limit the amount of GPU programming users need to do.

  • Uses vertex and edge frontiers

– Similar to Gunrock & LIGRA – Necessary for good utilization – Still requires good load­balancing – Edge frontiers are created implicitly from vertex frontiers.

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

40

slide-41
SLIDE 41

Vertex and Edge Frontiers

  • Ligra

– CPU HPC graph framework by Julian Shun – Two backends: CILK and OpenMP – Edge frontiers created implicitly from vertex frontiers

  • Typically one phase
  • Gunrock

– Highly tuned GPU graph library from Prof. John Owens – Supports multi­GPU analytics (same shared­node) – Each operation consists of two phases – Edge frontiers created explicitly by programmer.

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

41

slide-42
SLIDE 42

Case Study 1: Label Propagation Connected Components

  • Connected

component algorithms such as Shiloach­Vishkin

  • Initially every vertex

is in its components

  • Vertices move to with

smallest ID

– Vertices can swap components multiple times

NVIDIA GTC - cuSTINGER - Oded Green, 2017

42

4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3 1

slide-43
SLIDE 43

One Iteration of Connected Components Algorithm

// Label propagation 1) 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑤 ∈ 𝑊 2) 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑣 ∈ 𝑏𝑒𝑘 𝑤 3) 𝑗𝑔 𝐷𝐷 𝑣 < 𝐷𝐷 𝑤 4) 𝐷𝐷 𝑣 ← 𝐷𝐷 𝑤 // Shortcutting 5) 𝐺𝑝𝑠 𝑏𝑚𝑚 𝑤 ∈ 𝑊 6) 𝐷𝐷 𝑤 ← 𝐷𝐷 𝐷𝐷 𝑤

NVIDIA GTC - cuSTINGER - Oded Green, 2017

43

Traverse all edges 𝑃 𝐹 Traverse all vertices 𝑃 𝑊

slide-44
SLIDE 44

Revisiting the algorithm in parallel*

  • Notice, no triple brackets <<<>>>

* We have more optimized versions in the library (require about 4 more lines of code….)

44

NVIDIA GTC - cuSTINGER - Oded Green, 2017

slide-45
SLIDE 45

Case Study 2: Katz Centrality

  • Given pseudo code for a variant of Katz

Centrality

  • It took 5­6 hours to port from the CPU to the GPU.

– NVIDIA P100 GPU, initial speedup: 70X – CPU version had some addition optimizations. Within additional two hours speedup was: 100X

  • This was feasible because of the pre­existing

load­balanced primitives in the library

NVIDIA GTC - cuSTINGER - Oded Green, 2017

45

slide-46
SLIDE 46

cuSTINGER Algorithms

  • https://github.com/cuStinger/cuStingerAlg
  • Build instructions

– git clone –recursive https://github.com/cuStinger/cuStingerAlg.git – mkdir build && cd build – cmake .. – make ­j8

  • By default, cuSTINGER will also be cloned

– Though you will need to build both repos

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

46

slide-47
SLIDE 47

cuSTINGERv2 Algorithms

  • https://github.com/cuStinger/cuStingerAlg
  • Build instructions

– git clone –recursive https://github.com/cuStinger/cuStingerAlg.git – cd cuStingerAlg/build – cmake .. – make ­j

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

47

slide-48
SLIDE 48

Performance Analysis

  • Connected Components
  • Breadth First Search
  • Triangle Counting

– Static – Dynamic

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

48

slide-49
SLIDE 49

Connected Components – NVIDIA P100

  • Using label propagation
  • Within 25% for most cases.

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

49

Name Gunrock (msec.) cuSTINGER (msec.) Speedup coAuthorsDBLP 1.72 2.17 0.79X as­Skitter 3.68 17.4 0.21X kron_21 86.84 66.4 1.33X cit­patents 38.85 40.8 0.95X Cage15 46.1 56.1 0.82X uk­2002 407 489 0.83X

slide-50
SLIDE 50

BFS – Classic Top­Down – NVIDIA P100

  • Using a similar algorithm in Gunrock

– Gunrock has additional optimizations that can make it faster than cuSTINGER

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

50

Name Gunrock (msec.) cuSTINGER (msec.) Speedup coAuthorsDBLP 2.74 2.44 1.12X as­Skitter 7.74 10.6 0.73X kron_21 45.4 25.7 1.76X cit­patents 16.5 23.3 0.71X cage15 29.1 43.2 0.67X uk­2002 39.9 81.6 0.49X

slide-51
SLIDE 51

Triangle Counting: CSR Vs. cuSTINGER

  • Triangle counting algorithm taken from [Green et al; 𝐽𝐵3;2014]
  • Simply replace CSR accesses with cuSTINGER
  • Executed on a K40

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

51

Name |𝑾| |𝑭| Time­CSR (sec.) Time­cuSTINGER (sec.) Execution Difference 𝑑𝑝𝐵𝑣𝑢ℎ𝑝𝑠𝑡𝐸𝐶𝑀𝑄 299𝑙 1.95𝑁 0.218 0.242 +10% 𝑏𝑡 − 𝑡𝑙𝑗𝑢𝑢𝑓𝑠 1.69𝑁 11.1𝑁 57.14 59.37 +3.8% 𝑙𝑠𝑝𝑜_21 2𝑁 201𝑁 2992 2996 +0.14% 𝑑𝑗𝑢 − 𝑞𝑏𝑢𝑓𝑜𝑢𝑡 3.77𝑁 16.5𝑁 0.814 0.830 +2% 𝑑𝑏𝑕𝑓15 5.15𝑁 94𝑁 6.544 7.204 +10% 𝑣𝑙 − 2002 18.52𝑁 523𝑁 424.9 431.4 +1.6%

slide-52
SLIDE 52

Summary

52

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

slide-53
SLIDE 53

Library Overview

Completed algorithms and on­going Of course many more algorithms to come…

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

53

Algorithm Static Dynamic Reference

Breadth first search

Triangle Counting

✓ ✓

Static ­ [Green et al; IA32014] Dynamic – new algorithm [Makkar; 2017 submitted] Connect components

  • n­going

[McColl; HiPC 2013] Betweenness Centrality

  • n­going

[Green; SocialCom 2012] Page Rank

  • n­going

New algorithm (non linear algebra based) Katz Centrality

New algorithm (non linear algebra based)

slide-54
SLIDE 54

Upcoming projects using cuSTINGER

  • Extend dynamic triangle counting to

Jaccard Indices

  • Scalable pattern and motif detection on the

GPU

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

54

slide-55
SLIDE 55

Take away

  • Dynamic data structure for sparse data sets
  • Supports high update rates
  • Scalable in both data size and in performance

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

55

slide-56
SLIDE 56

Collaborators

  • Prof. David Bader (Georgia Tech)
  • Prof. Jimeng Sun (Georgia Tech)
  • Dr. Jason Riedy (Georgia Tech)
  • Federico Busato, Visiting PhD student (Universita di Verona)
  • James Fox, PhD student (Georgia Tech)
  • Euna Kim, PhD student (Georgia Tech)
  • Muhammad Osama Sakhi, BSc student (Georgia Tech)
  • Alok Tripathy, BSc student (Georgia Tech)
  • Manas George, BSc student (Georgia Tech)
  • Graduates (GT)

– Devavret Makkar, MSc. (Tower Research)

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

56

slide-57
SLIDE 57

Thank you

NVIDIA GTC ­ cuSTINGER ­ Oded Green, 2017

57

  • Email: ogreen@gatech.edu
  • Data structure:

– https://github.com/cuStinger/cuStingerAlg

  • Algorithms:

– https://github.com/cuStinger/cuStingerAlg.git

  • Versions 2.0, coming soon to a GPU near you