[PPT] - WIREFRAME: Supporting Data-dependent Parallelism through PowerPoint Presentation

SLIDE 1

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 1

WIREFRAME:   Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs

AmirAli Abdolrashidi†, Devashree Tripathy†, Mehmet E. Belviranli‡, Laxmi N. Bhuyan†, Daniel Wong†

†University of California Riverside

‡Oak Ridge National Laboratory

SLIDE 2

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 2

Introduction

SLIDE 3

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 2

Introduction

SLIDE 4

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 3

Motivation

Despite the support for parallelism, GPUs lack support for

data-dependent parallelism.

SLIDE 5

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 4

Example: Wavefront Pattern

3 1 1 2 1 1 1 1 1 3 1 2

Barrier Thread block

SLIDE 6

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 4

Example: Wavefront Pattern

3 1 1 2 1 1 1 1 1 3 1 2

SLIDE 7

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 4

Example: Wavefront Pattern

3 1 1 2 1 1 1 1 1 3 1 2

…until the application ends

SLIDE 8

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

Kernel Launch
Synchronize

SLIDE 9

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

Kernel Launch
Synchronize

Enormous host-side kernel launch overhead!

SLIDE 10

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

Kernel Launch
Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

SLIDE 11

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

Kernel Launch
Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

SLIDE 12

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

Kernel Launch
Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

SLIDE 13

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

Kernel Launch
Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

SLIDE 14

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 15

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 16

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 17

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 18

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 19

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 20

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

…

SLIDE 21

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

…

SLIDE 22

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

…

SLIDE 23

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 7

Example

CDP (Nested)

No more host-side kernel launch
Device-side kernel launch still has

significant overhead

NO multi-parent dependency support
Still NO general dependency support!

RUN:

Parent Kernel Launch
Synchronize

Parent Kernel: for i = 1 to nWaves:

Child Kernel Launch
Synchronize

SLIDE 24

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 8

Motivation

There is a need for a generalized support for finer-grain inter-block

data dependency for more performance and efficiency.

Intra-Block Global Inter-Block

Thread Thread Block Barrier

c

SLIDE 25

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 9

Motivation

Current limitations
High device-side kernel launch overhead
No general inter-block data dependency support

SLIDE 26

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

SLIDE 27

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model

SLIDE 28

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph

SLIDE 29

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Convert to CSR Node Array Edge Array

SLIDE 30

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Global Memory

Global Node Array Global Edge Array

Convert to CSR Node Array Edge Array

SLIDE 31

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Global Memory

Global Node Array Global Edge Array Pending Update Buffer

DATS Hardware (Dependency Graph Buffer)

Local Edge Array Local Node Array Node Insertion Buffer

Convert to CSR Node Array Edge Array

SLIDE 32

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Global Memory

Global Node Array Global Edge Array Pending Update Buffer

DATS Hardware (Dependency Graph Buffer)

Local Edge Array Local Node Array Node Insertion Buffer

TB Scheduler Convert to CSR Node Array Edge Array

SLIDE 33

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 11

Programming Model

New functions are needed to support dependency in CUDA
Add dependency
Policy settings
Proposing DepLinks model
Would assign a dependency graph generation function to a kernel
Easy to learn and use

SLIDE 34

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 12

4 1 8 5 2 12 9 6 13 10 7 14 11 15 3

Wireframe Pseudo-code

parent1 := (X-1, Y) parent2 := (X, Y-1)

RUN:

Kernel Launch (DepLinks)

DepLinks @BLOCK (X,Y):

Add Dependency (parent1)
Add Dependency (parent2)

X Y (0,0)

SLIDE 35

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 12

4 1 8 5 2 12 9 6 13 10 7 14 11 15 3

Wireframe Pseudo-code

parent1 := (X-1, Y) parent2 := (X, Y-1)

RUN:

Kernel Launch (DepLinks)

DepLinks @BLOCK (X,Y):

Add Dependency (parent1)
Add Dependency (parent2)

6 5 2

parent1 parent2

X Y (0,0)

SLIDE 36

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 12

4 1 8 5 2 12 9 6 13 10 7 14 11 15 3

Wireframe Pseudo-code

parent1 := (X-1, Y) parent2 := (X, Y-1)

RUN:

Kernel Launch (DepLinks)

DepLinks @BLOCK (X,Y):

Add Dependency (parent1)
Add Dependency (parent2)

One kernel launch!

6 5 2

parent1 parent2

X Y (0,0)

SLIDE 37

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 13

  Dependency Graph 

Parent count and level of every node

determined at runtime

Sent to the GPU’s global memory

SLIDE 38

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 14

Node Renaming

To minimize data level range in the buffers

Level 0 Level 1 Level 2 Level 3 Level 4 Level 5

SLIDE 39

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 15

Dependency-Aware TB Scheduler (DATS)

Thread block scheduler
Issues the relevant thread block at the time for execution based on the

dependency graph

Dependency Graph Buffer (DGB)
Cache data from global memory
Challenge: Efficient caching and data utilization

SLIDE 40

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 16

Dependency-Aware TB Scheduler (DATS)

Data stored in compressed sparse (CSR) format

To reduce memory usage
Thread blocks à Node Array
Dependencies à Edge Array
space complexity

SLIDE 41

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

SLIDE 42

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

SLIDE 43

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

SLIDE 44

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

(Circular buffer)

7 9 11 12

Global ID Edge index Local Edge Start Node index

SLIDE 45

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 18

Node State Table

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

2 4 6

1 2 3 4 4 5 6

Tail Head

Local Edge Start

Local Node Array Local Edge Array

Global Node ID

SLIDE 46

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 18

Node State Table

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

2 4 6

1 2 3 4 4 5 6

T H

Local Edge Start Global Node ID

2 5 1 4 7 3 6 8

SLIDE 47

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

2 5 1 4 7 3 6 8

SLIDE 48

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1

D

1

2 5 1 4 7 3 6 8

SLIDE 49

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2 2

D

1

2 5 1 4 7 3 6 8

SLIDE 50

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2 2 3 3

R R D

1

2 5 1 4 7 3 6 8

SLIDE 51

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2 2 3 3

R R D

4 1

2 5 1 4 7 3 6 8

SLIDE 52

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

2 5 1 4 7 3 6 8

SLIDE 53

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1

D

1

2 5 1 4 7 3 6 8

SLIDE 54

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1

D

1

2 5 1 4 7 3 6 8

SLIDE 55

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2

#4 #5

D

2

Pending Update Buffer

1

2 5 1 4 7 3 6 8

SLIDE 56

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 2 5 1 4 7 3 6 8

SLIDE 57

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5

1

D

2 5 1 4 7 3 6 8

SLIDE 58

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5

1 2 3

R D

2 5 1 4 7 3 6 8

SLIDE 59

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 #4

1 2 3

R

4

D

2 5 1 4 7 3 6 8

SLIDE 60

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 #4

1 2 3

R

4 5

D

2 5 1 4 7 3 6 8

SLIDE 61

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 #4 Enough spaces to load to DGB

1 2 3

R

4 5

D

2 5 1 4 7 3 6 8

SLIDE 62

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 22

Example: …Reloading data

State W W D R Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

6

T H

2

6

#4 #5 #4 Load complete! 2 5 1 4 7 3 6 8

SLIDE 63

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 22

Example: …Reloading data

State W W D R Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

6

T H

2

6

#4 #5 #4 Load complete! 2 5 1 4 7 3 6 8

SLIDE 64

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 22

Example: …Reloading data

State W W D R Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

6

T H

2

6

#4 #5 #4 Load complete!

6

2 5 1 4 7 3 6 8

SLIDE 65

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 23

Example: Update Buffer Load

State W W D P Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

6

T H

2

6

#4 #5 #4 2 5 1 4 7 3 6 8

SLIDE 66

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 23

Example: Update Buffer Load

State W W D P Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

6

T H

2

6

#5

1 3

1

2

R

2 5 1 4 7 3 6 8

SLIDE 67

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 23

Example: Update Buffer Load

State W W D P Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

6

T H

2

6

1 3 4

1

2

R R

2 5 1 4 7 3 6 8

SLIDE 68

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 24

Challenges

Minimizing global memory usage
Used CSR format
Minimizing the buffer size
Limit Level Range
Local Node/Edge Array size

SLIDE 69

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 25

Level Range

Unbalanced execution may

entail using the baseline TB scheduling policy (LRR).

Sample benchmark (HEAT2D) w/ LRR scheduler

SLIDE 70

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 26

Level Range

Unbounded level range means:
Larger DGB is required
Limiting TB execution

Key challenge: Efficient scheduling

SLIDE 71

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 27

Level-bound Scheduling (LVL)

Prioritizing lower-level thread blocks in the graph
More ready nodes à More parallelism
Minimizing the buffering operation
Limiting the level range to avoid serialization

SLIDE 72

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28

Local Node Array Size

Empirical estimation used

SLIDE 73

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28

Local Node Array Size

Empirical estimation used
Reduce size
Until performance suffers

SLIDE 74

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

Empirical estimation used
Reduce size
Until performance suffers

SLIDE 75

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

Empirical estimation used
Reduce size
Until performance suffers

SLIDE 76

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

Empirical estimation used
Reduce size
Until performance suffers

Size chosen (128 entries)

SLIDE 77

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

Empirical estimation used
Reduce size
Until performance suffers
LVL saves 64% PUB size

Size chosen (128 entries) 64% PUB size reduction

SLIDE 78

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

Empirical estimation used

SLIDE 79

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

Empirical estimation used

SLIDE 80

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

Empirical estimation used
LVL requires 75% less storage

75% Edge Array reduction

SLIDE 81

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

Empirical estimation used
LVL requires 75% less storage
256 entries

SLIDE 82

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 30

Evaluation

Evaluation platform
GPGPU-Sim v3.2.2 (GTX480)
Six data dependency-heavy benchmarks
Cases
Global, CDP
DepLinks primitives
LRR and LVL
LVL=3

SLIDE 83

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

4K Graph Size

SLIDE 84

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

NOT accounting for data transfer time

SLIDE 85

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Theoretical value based on [1]

[1] Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar

Yalamanchili. 2016. Dynamic thread block launch: A lightweight

execution mechanism to support irregular applications on GPUs.

SLIDE 86

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Barriers enforced by DepLinks instead; kernel launch

verhead removed

SLIDE 87

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Barriers removed. Nodes can now run ahead.

SLIDE 88

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Level-bound TB scheduling

SLIDE 89

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 32

Overall Speedup (LVL)

0.9 1.1 1.3 1.5 1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

Speedup across different graph sizes

+45%

SLIDE 90

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 32

Overall Speedup (LVL)

0.9 1.1 1.3 1.5 1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

Speedup across different graph sizes

+45%

+65%

SLIDE 91

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 33

Evaluation Summary

2KB area overhead
No significant impact on L2 miss rate
Low global memory request overhead
0.13% Average

SLIDE 92

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 34

Conclusion

Presenting Wireframe, hardware support for GPU data dependency
Supporting generalized inter-block dependencies through hardware
Minimizing buffering through level-bound TB scheduling
45% average speedup improvement over the baseline

SLIDE 93

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 35

Thank you!

Questions?

SLIDE 94

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 36

Computations vs Launch Overhead

With a constant data size
Kernel launches increase with

graph size

is still sizable at 9K nodes.
times on average

Comp/Launch Ratio

3.5 7 10.5 14

DTW HIST SOR GeoMean

1K 4K 9K

SLIDE 95

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 37

Performance

Impact on L2 ~ 0.5%

SLIDE 96

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 38

Performance (IPC)

SLIDE 97

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 39

WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs

AmirAli Abdolrashidi†, Devashree Tripathy†, Mehmet E. Belviranli‡, Laxmi N. Bhuyan†, Daniel Wong†

Introduction

Introduction

Motivation

data-dependent parallelism.

Example: Wavefront Pattern

Barrier Thread block

Example: Wavefront Pattern

Example: Wavefront Pattern

…until the application ends

Example

Global Barriers (Original)

Example

Global Barriers (Original)

Example

Global Barriers (Original)

Example

Global Barriers (Original)

Example

Global Barriers (Original)

Example

Global Barriers (Original)

Example

CDP (Nested)

RUN:

Example

CDP (Nested)

RUN:

Example

CDP (Nested)

RUN:

Example

CDP (Nested)

RUN:

Example

CDP (Nested)

RUN:

Example

CDP (Nested)

RUN:

Example

CDP (Nested)

RUN:

…

Example

CDP (Nested)

RUN:

…

Example

CDP (Nested)

RUN:

…

Example

CDP (Nested)

significant overhead

RUN:

Motivation

data dependency for more performance and efficiency.

Intra-Block Global Inter-Block

Motivation

Wireframe Overview

Wireframe Overview

Wireframe Overview

Wireframe Overview

Wireframe Overview

Wireframe Overview

Wireframe Overview

Programming Model

Wireframe Pseudo-code

RUN:

Wireframe Pseudo-code

RUN:

Wireframe Pseudo-code

RUN:

One kernel launch!

Dependency Graph

determined at runtime

Node Renaming

Dependency-Aware TB Scheduler (DATS)

WIREFRAME:   Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs

  Dependency Graph