WIREFRAME: Supporting Data-dependent Parallelism through - - PowerPoint PPT Presentation

wireframe supporting data dependent parallelism through
SMART_READER_LITE
LIVE PREVIEW

WIREFRAME: Supporting Data-dependent Parallelism through - - PowerPoint PPT Presentation

WIREFRAME: Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs AmirAli Abdolrashidi , Devashree Tripathy , Mehmet E. Belviranli , Laxmi N. Bhuyan , Daniel Wong University of California


slide-1
SLIDE 1

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 1

WIREFRAME: 
 Supporting Data-dependent Parallelism through Dependency Graph Execution in GPUs

AmirAli Abdolrashidi†, Devashree Tripathy†, Mehmet E. Belviranli‡, Laxmi N. Bhuyan†, Daniel Wong†

†University of California Riverside

‡Oak Ridge National Laboratory

slide-2
SLIDE 2

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 2

Introduction

slide-3
SLIDE 3

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 2

Introduction

slide-4
SLIDE 4

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 3

Motivation

  • Despite the support for parallelism, GPUs lack support for

data-dependent parallelism.

slide-5
SLIDE 5

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 4

Example: Wavefront Pattern

3 1 1 2 1 1 1 1 1 3 1 2

Barrier Thread block

slide-6
SLIDE 6

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 4

Example: Wavefront Pattern

3 1 1 2 1 1 1 1 1 3 1 2

slide-7
SLIDE 7

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 4

Example: Wavefront Pattern

3 1 1 2 1 1 1 1 1 3 1 2

…until the application ends

slide-8
SLIDE 8

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

  • Kernel Launch
  • Synchronize
slide-9
SLIDE 9

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

  • Kernel Launch
  • Synchronize

Enormous host-side kernel launch overhead!

slide-10
SLIDE 10

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

  • Kernel Launch
  • Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

slide-11
SLIDE 11

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

  • Kernel Launch
  • Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

slide-12
SLIDE 12

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

  • Kernel Launch
  • Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

slide-13
SLIDE 13

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 5

Example

Global Barriers (Original)

for i = 1 to nWave:

  • Kernel Launch
  • Synchronize

Enormous host-side kernel launch overhead! Waiting on non-parent thread blocks

slide-14
SLIDE 14

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-15
SLIDE 15

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-16
SLIDE 16

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-17
SLIDE 17

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-18
SLIDE 18

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-19
SLIDE 19

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-20
SLIDE 20

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize

slide-21
SLIDE 21

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize

slide-22
SLIDE 22

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 6

Example

CDP (Nested)

Kernel Execution Pattern

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize

slide-23
SLIDE 23

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 7

Example

CDP (Nested)

  • No more host-side kernel launch
  • Device-side kernel launch still has

significant overhead

  • NO multi-parent dependency support
  • Still NO general dependency support!

RUN:

  • Parent Kernel Launch
  • Synchronize

Parent Kernel: for i = 1 to nWaves:

  • Child Kernel Launch
  • Synchronize
slide-24
SLIDE 24

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 8

Motivation

  • There is a need for a generalized support for finer-grain inter-block

data dependency for more performance and efficiency.

Intra-Block Global Inter-Block

Thread Thread Block Barrier

c

slide-25
SLIDE 25

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 9

Motivation

  • Current limitations
  • High device-side kernel launch overhead
  • No general inter-block data dependency support
slide-26
SLIDE 26

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

slide-27
SLIDE 27

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model

slide-28
SLIDE 28

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph

slide-29
SLIDE 29

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Convert to CSR Node Array Edge Array

slide-30
SLIDE 30

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Global Memory

Global Node Array Global Edge Array

Convert to CSR Node Array Edge Array

slide-31
SLIDE 31

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Global Memory

Global Node Array Global Edge Array Pending Update Buffer

DATS Hardware (Dependency Graph Buffer)

Local Edge Array Local Node Array Node Insertion Buffer

Convert to CSR Node Array Edge Array

slide-32
SLIDE 32

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 10

Wireframe Overview

Host (CPU) Device (GPU)

#define parent1 dim3 (blockIdx.x-1, blockIdx.y, blockIdx.z); #define parent2 dim3 (blockIdx.x, blockIdx.y- 1, blockIdx.z); void* DepLink() { if (blockIdx.x > 0) WF::AddDependency(parent1); if (blockIdx.y > 0) WF::AddDependency(parent2); } int main() { kernel<<<GridSize, BlockSize, DepLink>>>(0, args); } __WF__ void kernel(args) { processWave(); }

Programming Model Dependency Graph Global Memory

Global Node Array Global Edge Array Pending Update Buffer

DATS Hardware (Dependency Graph Buffer)

Local Edge Array Local Node Array Node Insertion Buffer

TB Scheduler Convert to CSR Node Array Edge Array

slide-33
SLIDE 33

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 11

Programming Model

  • New functions are needed to support dependency in CUDA
  • Add dependency
  • Policy settings
  • Proposing DepLinks model
  • Would assign a dependency graph generation function to a kernel
  • Easy to learn and use
slide-34
SLIDE 34

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 12

4 1 8 5 2 12 9 6 13 10 7 14 11 15 3

Wireframe Pseudo-code

parent1 := (X-1, Y) parent2 := (X, Y-1)

RUN:

  • Kernel Launch (DepLinks)

DepLinks @BLOCK (X,Y):

  • Add Dependency (parent1)
  • Add Dependency (parent2)

X Y (0,0)

slide-35
SLIDE 35

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 12

4 1 8 5 2 12 9 6 13 10 7 14 11 15 3

Wireframe Pseudo-code

parent1 := (X-1, Y) parent2 := (X, Y-1)

RUN:

  • Kernel Launch (DepLinks)

DepLinks @BLOCK (X,Y):

  • Add Dependency (parent1)
  • Add Dependency (parent2)

6 5 2

parent1 parent2

X Y (0,0)

slide-36
SLIDE 36

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 12

4 1 8 5 2 12 9 6 13 10 7 14 11 15 3

Wireframe Pseudo-code

parent1 := (X-1, Y) parent2 := (X, Y-1)

RUN:

  • Kernel Launch (DepLinks)

DepLinks @BLOCK (X,Y):

  • Add Dependency (parent1)
  • Add Dependency (parent2)

One kernel launch!

6 5 2

parent1 parent2

X Y (0,0)

slide-37
SLIDE 37

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 13


 Dependency Graph


  • Parent count and level of every node

determined at runtime

  • Sent to the GPU’s global memory
slide-38
SLIDE 38

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 14

Node Renaming

  • To minimize data level range in the buffers

Level 0 Level 1 Level 2 Level 3 Level 4 Level 5

slide-39
SLIDE 39

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 15

Dependency-Aware TB Scheduler (DATS)

  • Thread block scheduler
  • Issues the relevant thread block at the time for execution based on the

dependency graph

  • Dependency Graph Buffer (DGB)
  • Cache data from global memory
  • Challenge: Efficient caching and data utilization
slide-40
SLIDE 40

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 16

Dependency-Aware TB Scheduler (DATS)

Data stored in compressed sparse (CSR) format

  • To reduce memory usage
  • Thread blocks à Node Array
  • Dependencies à Edge Array
  • space complexity
slide-41
SLIDE 41

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

slide-42
SLIDE 42

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

slide-43
SLIDE 43

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

slide-44
SLIDE 44

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 17

DATS Overview

2 4 6 7 9 11 12 14 1 2 3 4 4 5 6 6 7 7 8 9 8 9 9

Global Node Array

Global Edge Array

GLOBAL MEMORY

Global Edge Start Global Node ID

1 2 3 4 5 6 7 8 9 10 11 13 12 14 1 2 3 4 5 6 7 8

Edge Start Global Node ID Parent Counter Level 32 bits 16 bits 16 bits 16 bits Base Pointer

Translation of Global Edge Start to Local Edge Start

𝑀𝐹𝑇𝑗 = (𝐻𝐹𝑇𝑗)% 𝐹𝑒𝑕𝑓𝐵𝑠𝑠𝑏𝑧

2 3 4

1 2 3

6 7 7 8 8 5 6

DEPENDENCY GRAPH BUFFER (DGB) Local Node Array Local Edge Array

Local Edge Start Global Node ID

H

1 2 3 4 5 6

Pending Update Buffer Node Insertion Buffer

(Circular buffer)

7 9 11 12

Global ID Edge index Local Edge Start Node index

slide-45
SLIDE 45

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 18

Node State Table

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

2 4 6

1 2 3 4 4 5 6

Tail Head

Local Edge Start

Local Node Array Local Edge Array

Global Node ID

slide-46
SLIDE 46

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 18

Node State Table

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

2 4 6

1 2 3 4 4 5 6

T H

Local Edge Start Global Node ID

2 5 1 4 7 3 6 8

slide-47
SLIDE 47

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

2 5 1 4 7 3 6 8

slide-48
SLIDE 48

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1

D

1

2 5 1 4 7 3 6 8

slide-49
SLIDE 49

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2 2

D

1

2 5 1 4 7 3 6 8

slide-50
SLIDE 50

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2 2 3 3

R R D

1

2 5 1 4 7 3 6 8

slide-51
SLIDE 51

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 19

Example: Child Node Execution

State R W W W Parent Count 1 1 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2 2 3 3

R R D

4 1

2 5 1 4 7 3 6 8

slide-52
SLIDE 52

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

2 5 1 4 7 3 6 8

slide-53
SLIDE 53

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1

D

1

2 5 1 4 7 3 6 8

slide-54
SLIDE 54

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1

D

1

2 5 1 4 7 3 6 8

slide-55
SLIDE 55

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 20

Example: Update Buffer Store

State D P P W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

1 2

#4 #5

D

2

Pending Update Buffer

1

2 5 1 4 7 3 6 8

slide-56
SLIDE 56

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 2 5 1 4 7 3 6 8

slide-57
SLIDE 57

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5

1

D

2 5 1 4 7 3 6 8

slide-58
SLIDE 58

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5

1 2 3

R D

2 5 1 4 7 3 6 8

slide-59
SLIDE 59

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 #4

1 2 3

R

4

D

2 5 1 4 7 3 6 8

slide-60
SLIDE 60

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 #4

1 2 3

R

4 5

D

2 5 1 4 7 3 6 8

slide-61
SLIDE 61

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 21

Example: Invalidation…

State D P D W Parent Count 1 Level 1 1 2 Global Node ID 1 2 3

States: Wait Ready Processing Done

1 2 3 4 4 5 6

T H

2 4 6

#4 #5 #4 Enough spaces to load to DGB

1 2 3

R

4 5

D

2 5 1 4 7 3 6 8

slide-62
SLIDE 62

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 22

Example: …Reloading data

State W W D R Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

  • 6

T H

2

  • 6

#4 #5 #4 Load complete! 2 5 1 4 7 3 6 8

slide-63
SLIDE 63

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 22

Example: …Reloading data

State W W D R Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

  • 6

T H

2

  • 6

#4 #5 #4 Load complete! 2 5 1 4 7 3 6 8

slide-64
SLIDE 64

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 22

Example: …Reloading data

State W W D R Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

  • 6

T H

2

  • 6

#4 #5 #4 Load complete!

6

2 5 1 4 7 3 6 8

slide-65
SLIDE 65

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 23

Example: Update Buffer Load

State W W D P Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

  • 6

T H

2

  • 6

#4 #5 #4 2 5 1 4 7 3 6 8

slide-66
SLIDE 66

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 23

Example: Update Buffer Load

State W W D P Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

  • 6

T H

2

  • 6

#5

1 3

1

2

R

2 5 1 4 7 3 6 8

slide-67
SLIDE 67

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 23

Example: Update Buffer Load

State W W D P Parent Count 2 1 Level 2 2 1 2 Global Node ID 4 5 2 3

States: Wait Ready Processing Done

6 7 7

  • 6

T H

2

  • 6

1 3 4

1

2

R R

2 5 1 4 7 3 6 8

slide-68
SLIDE 68

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 24

Challenges

  • Minimizing global memory usage
  • Used CSR format
  • Minimizing the buffer size
  • Limit Level Range
  • Local Node/Edge Array size
slide-69
SLIDE 69

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 25

Level Range

  • Unbalanced execution may

entail using the baseline TB scheduling policy (LRR).

Sample benchmark (HEAT2D) w/ LRR scheduler

slide-70
SLIDE 70

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 26

Level Range

  • Unbounded level range means:
  • Larger DGB is required
  • Limiting TB execution

Key challenge: Efficient scheduling

slide-71
SLIDE 71

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 27

Level-bound Scheduling (LVL)

  • Prioritizing lower-level thread blocks in the graph
  • More ready nodes à More parallelism
  • Minimizing the buffering operation
  • Limiting the level range to avoid serialization
slide-72
SLIDE 72

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28

Local Node Array Size

  • Empirical estimation used
slide-73
SLIDE 73

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28

Local Node Array Size

  • Empirical estimation used
  • Reduce size
  • Until performance suffers
slide-74
SLIDE 74

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

  • Empirical estimation used
  • Reduce size
  • Until performance suffers
slide-75
SLIDE 75

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

  • Empirical estimation used
  • Reduce size
  • Until performance suffers
slide-76
SLIDE 76

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

  • Empirical estimation used
  • Reduce size
  • Until performance suffers

Size chosen (128 entries)

slide-77
SLIDE 77

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 28 IPC 125 250 375 500 Max Update buffer Size (entries) 40 80 120 160 Local Node Array Size (entries) 64 128 192 320 512 576 640 LRR_PUB LVL_PUB LRR_IPC LVL_IPC

Local Node Array Size

  • Empirical estimation used
  • Reduce size
  • Until performance suffers
  • LVL saves 64% PUB size

Size chosen (128 entries) 64% PUB size reduction

slide-78
SLIDE 78

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

  • Empirical estimation used
slide-79
SLIDE 79

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

  • Empirical estimation used
slide-80
SLIDE 80

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

  • Empirical estimation used
  • LVL requires 75% less storage

75% Edge Array reduction

slide-81
SLIDE 81

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 29

Local Edge Array Size

  • Empirical estimation used
  • LVL requires 75% less storage
  • 256 entries
slide-82
SLIDE 82

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 30

Evaluation

  • Evaluation platform
  • GPGPU-Sim v3.2.2 (GTX480)
  • Six data dependency-heavy benchmarks
  • Cases
  • Global, CDP
  • DepLinks primitives
  • LRR and LVL
  • LVL=3
slide-83
SLIDE 83

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

4K Graph Size

slide-84
SLIDE 84

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

NOT accounting for data transfer time

slide-85
SLIDE 85

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Theoretical value based on [1]

[1] Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar

  • Yalamanchili. 2016. Dynamic thread block launch: A lightweight

execution mechanism to support irregular applications on GPUs.

slide-86
SLIDE 86

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Barriers enforced by DepLinks instead; kernel launch

  • verhead removed
slide-87
SLIDE 87

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Barriers removed. Nodes can now run ahead.

slide-88
SLIDE 88

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 31

Performance Breakdown

Normalized Speedup

0.9 1.025 1.15 1.275 1.4

D T W H E A T 2 D H I S T I N T _ I M G S O R S W A v e r a g e LVL LRR DepLinks CDP Global

4K Graph Size

Level-bound TB scheduling

slide-89
SLIDE 89

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 32

Overall Speedup (LVL)

0.9 1.1 1.3 1.5 1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

  • Speedup across different graph sizes

+45%

slide-90
SLIDE 90

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 32

Overall Speedup (LVL)

0.9 1.1 1.3 1.5 1.7

DTW HEAT2D HIST INT_IMG SOR SW GeoMean

1K 4K 9K

Performance

  • Speedup across different graph sizes

+45%

+65%

slide-91
SLIDE 91

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 33

Evaluation Summary

  • 2KB area overhead
  • No significant impact on L2 miss rate
  • Low global memory request overhead
  • 0.13% Average
slide-92
SLIDE 92

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 34

Conclusion

  • Presenting Wireframe, hardware support for GPU data dependency
  • Supporting generalized inter-block dependencies through hardware
  • Minimizing buffering through level-bound TB scheduling
  • 45% average speedup improvement over the baseline
slide-93
SLIDE 93

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 35

Thank you!

Questions?

slide-94
SLIDE 94

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 36

Computations vs Launch Overhead

  • With a constant data size
  • Kernel launches increase with

graph size

  • is still sizable at 9K nodes.
  • times on average

Comp/Launch Ratio

3.5 7 10.5 14

DTW HIST SOR GeoMean

1K 4K 9K

slide-95
SLIDE 95

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 37

Performance

  • Impact on L2 ~ 0.5%
slide-96
SLIDE 96

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 38

Performance (IPC)

slide-97
SLIDE 97

WIREFRAME: Supporting Data-dependent Parallelism in GPUs

MICRO 50 39

Thank you!

Questions?