Efficiently Enforcing Strong Memory Ordering in GPUs Abhayendra - - PowerPoint PPT Presentation

efficiently enforcing strong memory ordering in gpus
SMART_READER_LITE
LIVE PREVIEW

Efficiently Enforcing Strong Memory Ordering in GPUs Abhayendra - - PowerPoint PPT Presentation

Efficiently Enforcing Strong Memory Ordering in GPUs Abhayendra Singh* , Shaizeen Aga, Satish Narayanasamy Google, University of Michigan, Ann Arbor Dec 9, 2015 University of Michigan Electrical Engineering and Computer Science *author


slide-1
SLIDE 1

Efficiently Enforcing Strong Memory Ordering in GPUs

Abhayendra Singh*, Shaizeen Aga, Satish Narayanasamy Google, University of Michigan, Ann Arbor Dec 9, 2015

University of Michigan Electrical Engineering and Computer Science

*author performed the work at the University of Michigan, Ann Arbor

slide-2
SLIDE 2

Increasing communication between threads in GPGPU applications

More irregular applications run on GPUs

data-dependent, higher communication TreeBuildingkernel in barneshut (Burtscher et al., IISWC’12)

slide-3
SLIDE 3

Heterogeneous systems will have more fine-grained communication

Fine-grain communication between CPU and GPU

Unified virtual memory Cache coherence [Power et al., MICRO’13]

CPU GPU Other Accelerator Memory

slide-4
SLIDE 4

Heterogeneous systems will have more fine-grained communication

Fine-grain communication between CPU and GPU

Unified virtual memory Cache coherence [Power et al., MICRO’13]

OpenCL supports fine-grain sharing More irregularity in applications

CPU GPU Other Accelerator Memory

slide-5
SLIDE 5

Memory Consistency Model

Defines rules that a programmer can use to reason about a parallel execution

slide-6
SLIDE 6

Memory Consistency Model

Defines rules that a programmer can use to reason about a parallel execution Sequential Consistency (SC) “program-order” ptr = NULL; done = false Producer Consumer a: ptr = alloc() c: if (done) b: done = true d: r1 = ptr->x

slide-7
SLIDE 7

Memory Consistency Model

Defines rules that a programmer can use to reason about a parallel execution Sequential Consistency (SC) “program-order” + “atomic memory” ptr = NULL; done = false Producer Consumer a: ptr = alloc() c: if (done) b: done = true d: r1 = ptr->x

slide-8
SLIDE 8

Data-race-free-0 (DRF-0) Memory Model

C++, Java OpenCL, CUDA

Heterogeneous-race-free (HRF) (Hower et al., ASPLOS’14)

slide-9
SLIDE 9

Data-race-free-0 (DRF-0) Memory Model

C++, Java OpenCL, CUDA

Heterogeneous-race-free (HRF) (Hower et al., ASPLOS’14)

SC if data-race-free

Programmers annotate synchronization variables

ptr = NULL; atomic done = false Producer Consumer a: ptr = alloc() c: if (done) b: done = true d: r1 = ptr->x

slide-10
SLIDE 10

Data-race-free-0 (DRF-0) Memory Model

C++, Java OpenCL, CUDA

Heterogeneous-race-free (HRF) (Hower et al., ASPLOS’14)

SC if data-race-free

Programmers annotate synchronization variables Compiler and runtime guarantee total order on synchronization operations

ptr = NULL; atomic done = false Producer Consumer a: ptr = alloc() c: if (done) b: done = true d: r1 = ptr->x

slide-11
SLIDE 11

Data-race-free-0 (DRF-0) Memory Model

C++, Java OpenCL, CUDA

Heterogeneous-race-free (HRF) (Hower et al., ASPLOS’14)

SC if data-race-free

Programmers annotate synchronization variables Compiler and runtime guarantee total order on synchronization operations

ptr = NULL; done = false Producer Consumer a: ptr = alloc() c: if (done) b: done = true d: r1 = ptr->x

reordering could lead to ptr being NULL

slide-12
SLIDE 12

Data-race-free-0 (DRF-0) Memory Model

C++, Java OpenCL, CUDA

Heterogeneous-race-free (HRF) (Hower et al., ASPLOS’14)

SC if data-race-free

Programmers annotate synchronization variables Compiler and runtime guarantee total order on synchronization operations

Undefined semantics for programs with a data-race

slide-13
SLIDE 13

Documented data-races in GPGPU programs

Other data-races:

N-body simulation [Betts et al., OOPSLA 2012] RadixSort [Li et al., PPoPP 2012] Efficient Synchronization Primitives for GPUs [Tyler Sorensen, MS thesis, 2014] Image source: [Alglave et al., ASPLOS 2015] Bug: a data-race in code for dynamic load balancing [Tyler Sorensen, MS thesis, 2014]

slide-14
SLIDE 14

Is there a motivation for DRF-0 over SC?

Performance of DRF-0 better than SC? Very little for CPUs

IEEE Computer’98, PACT’02, ISCA’12

Is there a performance justification for DRF-0 (or TSO) over SC in GPUs?

slide-15
SLIDE 15

Goals

Identify sources of SC violation in GPUs Understand overhead of various memory ordering constraints in GPUs DRF-0, TSO, SC Bridge the gap between SC and DRF-0 Access-type aware GPU architecture

slide-16
SLIDE 16

How can GPU violate SC?

Instructions are executed in-order

slide-17
SLIDE 17

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

slide-18
SLIDE 18

producer consumer

ptr = alloc() if (done) done = true r1 = ptr->x

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

1 2 3 4

slide-19
SLIDE 19

producer consumer

ptr = alloc() if (done) done = true r1 = ptr->x

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

1 2 3 4 cache miss cache hit

slide-20
SLIDE 20

producer consumer

ptr = alloc() if (done) done = true r1 = ptr->x

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

1 2 3 4 cache miss cache hit

slide-21
SLIDE 21

producer consumer

ptr = alloc() if (done) done = true r1 = ptr->x

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

1 2 3 4 cache miss cache hit

slide-22
SLIDE 22

producer consumer

ptr = alloc() if (done) done = true r1 = ptr->x

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

1 2 3 4 cache miss cache hit

slide-23
SLIDE 23

producer consumer

ptr = alloc() if (done) done = true r1 = ptr->x

How can GPU violate SC?

Instructions are executed in-order But, can complete out-of-order

– Caching at L1 – Reordering in interconnect – Partitioned address space

⟹ Can violate SC

1 2 3 4 cache miss cache hit SC violation

slide-24
SLIDE 24

Roadmap

Identify sources of SC violation Understand overhead of various memory ordering constraints in GPUs DRF-0, TSO, SC Bridge the gap between SC and DRF-0 Access-type aware GPU architecture

slide-25
SLIDE 25

Fences for various memory models

DRF-0 fences only for synchronization SC any shared or global access behaves like a fence

slide-26
SLIDE 26

Naïvely Enforcing Fence Constraints

Delay a warp till non-local memory accesses preceding a fence are complete Fence Shared memory access Global memory access

slide-27
SLIDE 27

Naïvely Enforcing Fence Constraints

Delay a warp till non-local memory accesses preceding a fence are complete GPU extension: Two counters per warp track its pending global loads and stores No need to track pending shared memory accesses

warp id pending loads pending stores w0 1

… … … … … …

Fence Shared memory access Global memory access

slide-28
SLIDE 28

Experimental Methodology

Simulator: GPGPU-sim v3.2.1

– extended with Ruby memory hierarchy – 16 SMs, crossbar interconnect

L1 Cache Coherence protocol

– MESI for write-back – Valid/Invalid for write-through

Benchmarks

– applications from Rodinia, Polybench benchmark suite – Applications used in GPU coherence [Singh et al., HPCA’13]

slide-29
SLIDE 29

18 out of 22 applications incur insignificant SC overhead

0.5 1

DRF-0 SC

  • Avg. execution time

normalized DRF-0

slide-30
SLIDE 30

Warp-level-parallelism (WLP) masks SC overhead

Warp-0

Cache miss Cache Hit

slide-31
SLIDE 31

Warp-level-parallelism (WLP) masks SC overhead

Warp-0

Cache miss Cache Hit

slide-32
SLIDE 32

Warp-level-parallelism (WLP) masks SC overhead

Warp-0 w2 w1 w3

SC can exploit inter-warp MLP

Cache miss Cache Hit

slide-33
SLIDE 33

Warp-level-parallelism (WLP) masks SC overhead

Warp-0 w2 w1 w3

SC can exploit inter-warp MLP Adequate WLP => Low SC overhead

Cache miss Cache Hit

slide-34
SLIDE 34

Warp-level-parallelism (WLP) masks SC overhead

1 2 3 8 thread block/SM 1 thread block/SM DRF-0 SC Execution time normalized to DRF-0 (benchmark: guassian)

1.97 2.2

SC can exploit inter-warp MLP Adequate WLP => Low SC overhead

slide-35
SLIDE 35

Higher SC overhead in apps where intra-warp MLP is important

Warp-1 Warp-2

Need for intra-warp MLP

App has fewer warps Want fewer warps to avoid cache thrashing

Cache miss Cache Hit

slide-36
SLIDE 36

Higher SC overhead in apps where intra-warp MLP is important

Warp-1 Warp-2

Need for intra-warp MLP

App has fewer warps Want fewer warps to avoid cache thrashing

Cache miss Cache Hit

slide-37
SLIDE 37

Higher SC overhead in apps where intra-warp MLP is important

Warp-1 Warp-2

Need for intra-warp MLP

App has fewer warps Want fewer warps to avoid cache thrashing

In-order execution limits the ability to exploit intra-warp MLP in DRF-0

Cache miss Cache Hit

slide-38
SLIDE 38

Higher SC overhead in apps where intra-warp MLP is important

Warp-1 Warp-2

Need for intra-warp MLP

App has fewer warps Want fewer warps to avoid cache thrashing

In-order execution limits the ability to exploit intra-warp MLP in DRF-0

Cache miss Cache Hit

Unlike DRF-0, SC cannot exploit intra-warp MLP

slide-39
SLIDE 39

4 out of 22 applications exhibit significant SC overhead

1 2 3 3mm fdtd-2d gemm gramschm DRF-0 SC Execution time normalized to DRF-0

39

Reason: Unlike DRF-0, SC cannot exploit intra-warp MLP

slide-40
SLIDE 40

TSO is not suitable for GPUs

1 2 3 3mm fdtd-2d gemm gramschm Applications with significant performance

  • verhead

DRF-0 TSO SC

TSO does not offer much performance or programmability advantage over SC

Execution time normalized to baseline (DRF-0) 1 2 3 DRF-0 TSO SC Applications with small performance

  • verhead
slide-41
SLIDE 41

Roadmap

How GPU optimizations can violate memory ordering constraints? Understand overhead of various memory ordering constraints in GPUs DRF-0, TSO, SC Bridge the gap between SC and DRF-0 Access-type aware GPU architecture

slide-42
SLIDE 42

Access-type Aware Optimization for GPU

Relax ordering constraint for safe accesses

Accesses to thread-private or read-only location are safe (Shasha & Snir, TOPLAS’88, Singh et al., ISCA’12)

Thread-level classification is prohibitively expensive

Classify accesses as unsafe or SM-safe

slide-43
SLIDE 43

Type-aware SC Extensions to Baseline GPU

Frontend Shared memory Coal- escing L2 module 1 L2 module 7 L1D Interconnect

SM 1

  • 1. Classify memory accesses
slide-44
SLIDE 44

Type-aware SC Extensions to Baseline GPU

Frontend Shared memory Coal- escing MAC 0 MAC 7 L2 module 1 L2 module 7 L1D Interconnect

SM 1

  • 1. Classify memory accesses
slide-45
SLIDE 45

Type-aware SC Extensions to Baseline GPU

Frontend Shared memory Coal- escing Type- Cache MAC 0 MAC 7 L2 module 1 L2 module 7 L1D Interconnect

SM 1

  • 1. Classify memory accesses
slide-46
SLIDE 46

Type-aware SC Extensions to Baseline GPU

Frontend Shared memory Coal- escing

.

Type- Cache MAC 0 MAC 7 L2 module 1 L2 module 7 L1D Interconnect

SM 1

Type-wait Queue (TQ)

.

  • 1. Classify memory accesses
slide-47
SLIDE 47

Unsafe queue

Type-aware SC Extensions to Baseline GPU

Frontend Shared memory Coal- escing

.

Type- Cache MAC 0 Memory Ordering Buffer (MoB)

Safe queue

MAC 7 L2 module 1 L2 module 7 L1D Interconnect

SM 1

Type-wait Queue (TQ)

. . . .

  • 1. Classify memory accesses
  • 2. Relax ordering constraints

for SM-safe accesses

slide-48
SLIDE 48

Type-aware SC Extensions to Baseline GPU

  • 1. Classify memory accesses
  • 2. Relax ordering constraints

for SM-safe accesses Problem: Ensuring ordering among conflicting accesses to an SM-safe location within an SM ü See details in the paper

Unsafe queue

Frontend Shared memory Coal- escing

.

Type- Cache MAC 0 Memory Ordering Buffer (MoB)

Safe queue

MAC 7 L2 module 1 L2 module 7 L1D Interconnect

SM 1

Type-wait Queue (TQ)

. . . .

slide-49
SLIDE 49

Type-aware SC incurs only small performance overhead

1 2 3 3mm fdtd-2d gemm gramschm

Applications with significant performance overhead

DRF-0 SC Type-aware SC

Proposed design is able to exploit intra-warp MLP for SM-safe accesses

slide-50
SLIDE 50

Future research directions

Build SC-preserving GPU compiler Overhead SC-preserving LLVM compiler for C++: ~2% [Marino et al., PLDI’11]

GPU Compiler GPU hardware

Language level memory model

?

slide-51
SLIDE 51

Conclusion

Quantified performance overhead of various memory models in GPUs TSO is unattractive for GPUs: No performance or programmability benefits over SC Performance gap between SC and DRF-0 is insignificant for most applications Access type aware optimization bridges the gap in remaining applications

slide-52
SLIDE 52

Questions?