VOLTA Architecture and performance optimization Guillaume - - PowerPoint PPT Presentation

volta architecture and performance optimization
SMART_READER_LITE
LIVE PREVIEW

VOLTA Architecture and performance optimization Guillaume - - PowerPoint PPT Presentation

VOLTA Architecture and performance optimization Guillaume Thomas-Collignon Paulius Micikevicius Littles law Control flow, Threads are Threads AGENDA Instructions, Tensor Cores Memory Architecture, L1, Smem 2 VOLTA V100 80 SM Per


slide-1
SLIDE 1

Guillaume Thomas-Collignon Paulius Micikevicius

VOLTA Architecture and performance optimization

slide-2
SLIDE 2

2

AGENDA

Little’s law Control flow, Threads are Threads Instructions, Tensor Cores Memory Architecture, L1, Smem

slide-3
SLIDE 3

3

VOLTA V100

Per Streaming Multiprocessor:

  • 64 FP32 lanes
  • 32 FP64 lanes
  • 64 INT32 lanes
  • 16 SFU lanes (transcendentals)
  • 32 LD/ST lanes (Gmem/Lmem/Smem)
  • 8 Tensor Cores
  • 4 TEX lanes

SM L1

SM L1 SM L1 SM L1 SM L1

L2 DRAM

80 SM

slide-4
SLIDE 4

4

Little’s Law

For Escalators Our escalator parameters:

  • 1 Person per step
  • A step arrives every 2 seconds

Bandwidth: 0.5 person/s

  • 20 steps tall

Latency = 40 seconds

slide-5
SLIDE 5

5

Little’s Law

For Escalators

  • One person in flight ?

Achieved bandwidth = 0.025 person/s

  • To saturate bandwidth:

Need one person arriving with every step, we need 20 persons in flight

  • Need Bandwidth x Latency persons in flight

A step arrives every 2 seconds Bandwidth: 0.5 person/s 20 steps tall : Latency = 40 seconds

slide-6
SLIDE 6

6

Little’s law

Optimization goals:

  • 1. Saturate Compute units

Accelerate computing Get close to the peak performance

  • 2. Saturate Memory Bandwidth

If compute density too low to saturate computation Need to hide the latencies to achieve this

For GPUs

slide-7
SLIDE 7

7

Memory Bandwidth

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 1024 1536 2048 2560 3072 3584 4096 4608 5120 5632 6144 6656 7168 7680 8192 8704 9216 9728 10240 10752 11264 11776 12288 12800 13312 13824 14336 14848 15360 16384 17408 18432 19456 20480 21504 22528 23552 24576 25600 28672 29696 32768

% of Peak Bandwidth

Bytes in flight per SM

V100

Volta reaches 90% of peak bandwidth with ~6KB of data in flight per SM

slide-8
SLIDE 8

8

CUDA basics

Blocks of threads, warps

  • Single Instruction Multiple Threads (SIMT) model
  • CUDA hierarchy: Grid -> Blocks -> Threads
  • One warp = 32 threads.
  • Why does it matter ?

Many optimizations based on behavior at the warp level

slide-9
SLIDE 9

9

CUDA basics

  • Thread blocks can be 1D, 2D, 3D

Only for convenience. HW “looks” at threads in 1D

  • Consecutive 32 threads belong to the same warp

Mapping threads

80 Threads:

40 threads in X 2 rows of threads in Y

40 2

slide-10
SLIDE 10

10

CUDA basics

  • Thread blocks can be 1D, 2D, 3D

Only for convenience. HW “looks” at threads in 1D

  • Consecutive 32 threads belong to the same warp

Mapping threads

80 Threads:

40 threads in X 2 rows of threads in Y

40 2

3 warps (96 threads) 16 inactive threads in 3rd warp 1 2 2 3 3

2 40

slide-11
SLIDE 11

11

CUDA basics

  • Different warps can execute different code

No impact on performance Each warp maintains its own Program Counter

  • Different code path inside the same warp ?

Threads that don’t participate are masked out, but the whole warp executes both sides of the branch

Control Flow

slide-12
SLIDE 12

12

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D;

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-13
SLIDE 13

13

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-14
SLIDE 14

14

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A B

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-15
SLIDE 15

15

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A B D

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-16
SLIDE 16

16

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-17
SLIDE 17

17

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D B

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-18
SLIDE 18

18

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D B C

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-19
SLIDE 19

19

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D D B C

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

slide-20
SLIDE 20

20

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D D B C

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

A

slide-21
SLIDE 21

21

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D D B C

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

A C

slide-22
SLIDE 22

22

Control Flow

1 2 2 3 3

ThreadIdx.x

39 1

ThreadIdx.y

A; if(threadIdx.y==0) B; else C; D; A A B D D B C

Warp 1

… 31

Warp 2

… 31

Warp 3

… 31

Instructions, time

A C D

slide-23
SLIDE 23

23

Control Flow

  • Minimize thread divergence inside a warp
  • Divergence between warps is fine
  • Maximize “useful” cycles for each thread

Takeaways

slide-24
SLIDE 24

24

Threads Are Threads

  • Program counter:

Before Volta: Per warp Volta: Per thread

  • Volta guarantees Forward Progress for diverged threads

in a warp

  • Allows to exchange data between diverged threads in a
  • warp. E.g. mutexes among warp threads.

Allows to write natural code that would deadlock before

New in Volta

slide-25
SLIDE 25

25

Threads Are Threads

lock = 0; while (lock == 0) lock = tryGetLock(); doSomething; releaseLock();

Example

These device functions could be implemented with atomics, or volatile pointers

Pre-Volta: The code might deadlock in the loop, if the thread that gets the lock cannot forward- progress and release the lock

slide-26
SLIDE 26

26

Threads are Threads

  • Don’t assume the threads in a warp are re-converged or

executing in lock-step mode.

Use __syncwarp() to synchronize the threads in a warp.

  • Shuffle and warp vote functions are deprecated.

Use the new equivalent “_sync” functions.

Extra parameter tells the compiler/HW which threads are expected to participate, because they might not reach it all at the same time. E.g: __shfl_up(value, 1) becomes __shfl_up_sync (0xffffffff, value, 1)

  • Full efficiency only when all the 32 threads of a warp are

converged!

Thread re-convergence

slide-27
SLIDE 27

27

Thread are Threads

  • Update/fix the code!
  • Use Cooperative Groups (GTC 2017 talk s7622)
  • Compile for an older architecture (disable forward progress)
  • arch=compute_60,sm_70 (binary)

–arch=compute_60 (PTX JIT) How to deal with warp-synchronous code?

slide-28
SLIDE 28

28

SM Resources

Each thread block needs: Registers (#registers/thread x #threads) Shared memory (0 ~ 96 KB) Volta limits per SM:

256KB Registers 96KB Shared memory 2048 threads max (64 warps) 32 thread blocks max

Can schedule any resident warp without context switch

slide-29
SLIDE 29

29

SM Resources

Each thread block needs: Registers (#registers/thread x #threads) Shared memory (0 ~ 96 KB) Volta limits per SM:

256KB Registers 96KB Shared memory 2048 threads max (64 warps) 32 thread blocks max

Can schedule any resident warp without context switch SM 128 KB Smem/L1 256KB Registers Compute Units Schedulers

slide-30
SLIDE 30

30

Occupancy = !"#$%&%' ()*+%, -. /#,%0'1 2%, 34

405$*)* ()*+%, -. /#,%0'1 2%, 34

(Use the occupancy calculator XLS in CUDA Toolkit)

Higher occupancy can help to hide latency! SM has more warp candidates to schedule while other warps are waiting for instructions to complete Achieved occupancy vs theoretical occupancy Need to run enough thread blocks to fill all the SMs!

Occupancy

slide-31
SLIDE 31

31

Increasing In-Flight instructions

2 Ways to improve parallelism:

  • Improve occupancy

More threads -> more instructions

  • Improve instruction parallelism (ILP)

More independent instructions per thread

slide-32
SLIDE 32

32

Instruction Issue

Instructions are issued in-order

If an instruction is not eligible, it stalls the warp

An instruction is eligible for issue if both are true:

  • A pipeline is available for execution

Some pipelines need multiple cycles to issue a warp

  • All the arguments are ready

Argument isn’t ready if a previous instruction hasn’t yet produced it

slide-33
SLIDE 33

33

Instruction Issue Example

__global__ void kernel (float *a, float *b, float *c) { int i= blockIdx.x * blockDim.x + threadIdx.x;

c[i] += a[i] * b[i];

}

LDG.E R2, [R2]; LDG.E R4, [R4]; LDG.E R9, [R6]; FFMA R9, R2, R4, R9; STG.E [R6], R9;

stall! stall! 12B / thread in flight

slide-34
SLIDE 34

34

Computing 2 values per thread

__global__ void kernel (float2 *a, float2 *b, float2 *c) { int i= blockIdx.x * blockDim.x + threadIdx.x;

c[i].x += a[i].x * b[i].x; c[i].y += a[i].y * b[i].y;

}

LDG.E.64 R2, [R2]; LDG.E.64 R4, [R4]; LDG.E.64 R6, [R8]; FFMA R7, R3, R5, R7; FFMA R6, R2, R4, R6; STG.E.64 [R8], R6;

24B/ thread in flight 2 Independent instructions stall! stall!

slide-35
SLIDE 35

35

Fast Math intrinsics

Fast but less accurate math intrinsics are available. 2 ways to use the intrinsics:

  • Whole file: compile with --fast-math
  • Individual calls

E.g. __sinf(x), __logf(x), __fdivide(x,y)

slide-36
SLIDE 36

36

Tensor Cores

125 Tflops Peak Matrix Multiplication Pipeline, half precision inputs Used in CUBLAS, CUDNN, CUTLASS*

Optimized libraries can reach ~90% of peak

Exposed in CUDA 9.1 (preview feature)

*(more info on CUTLASS at GTC 2018 session SS8854)

New in Volta

slide-37
SLIDE 37

37

Tensor Cores

WMMA warp-wide macro-instructions All threads in the warp must be active! Performs matrix multiplication on 16x16 tiles

(8x32x16 and 32x8x16 tiles also available)

D = A x B + C

A and B : FP16 only C and D : Same, either FP16 or FP32.

C B D A

16 16

Using Tensor Cores in your CUDA code

slide-38
SLIDE 38

38

Tensor Cores

Each warp processes a 16x16 output tile

Each warp: Loop on all input tiles Ak and Bk C = C + Ak x Bk Write the output tile Typical use B C A

slide-39
SLIDE 39

39

Tensor Cores

Each warp processes a 16x16 output tile

Each warp: Loop on all input tiles Ak and Bk C = C + Ak x Bk Write the output tile Typical use B C A

Can compute several tiles per threadblock, with inputs staged in shared memory

slide-40
SLIDE 40

40

Volta’s Memory System

V100

SM L1

SM L1 SM L1 SM L1 SM L1

L2 DRAM

80 Symmetric Multiprocessors 256KB register file (20 MB) Unified Shared Mem / L1 Cache 128KB, Variable split (~10MB Total, 14 TB/s) 6 MB L2 Cache (2.5TB/s Read, 1.6TB/s Write) 16/32 GB HBM2 (900 GB/s) “Free” ECC. PCIe

slide-41
SLIDE 41

41

Cache Lines & Sectors

Memory access granularity = 32 Bytes = 1 sector

(32B for Maxwell, Pascal, Volta. Kepler and before: variable, 32B or 128B, depending

  • n architecture, access type, caching / non-caching options)

A cache line is 128 Bytes, made of 4 sectors.

Cache ”management” granularity = 1 cache line

Moving data between L1, L2, DRAM

128 Byte cache line Sector 0 Sector 1 Sector 2 Sector 3 128-Byte alignment

slide-42
SLIDE 42

42

Memory Reads

Getting data from Global Memory

SM L1 … SM L1

L2 DRAM Checking if the data is in L1 (if not, check L2) Checking if the data is in L2 (if not, get in DRAM) Unit of data moved: Sectors

slide-43
SLIDE 43

43

Memory Writes

SM L1 … SM L1

L2 DRAM Before Volta : Writes were not cached in L1. New in Volta : L1 will cache writes. L1 is write-through: Write to L1 AND L2. L2 is write back : Will flush data to DRAM only when needed. Partial writes are supported (masked portion of sector,

but behavior can change with ECC on/off).

Instruction modifiers can influence cache behavior (inline PTX only)

slide-44
SLIDE 44

44

L1, L2 Caches

In general, not for cache blocking

  • 100s ~ 1000s of threads running per SM.

Tens of thousands of threads sharing the L2 cache. L1, L2 are small per thread. E.g. at 2048 threads/SM, with 80 SMs: 64 bytes L1, 38 Bytes L2 per thread. Running at lower occupancy increases bytes of cache per thread

  • Shared Memory is usually a better option to cache data explicitly:

User managed, no evictions out of your control. Why do GPU have caches?

slide-45
SLIDE 45

45

L1, L2 Caches

Caches on GPUs are useful for:

  • “Smoothing” irregular, unaligned access patterns
  • Caching common data accessed by many threads
  • Faster register spills, local memory
  • Fast atomics
  • Codes that don’t use shared memory (naïve code, OpenACC, …)

Why do GPU have caches?

slide-46
SLIDE 46

46

Access Patterns

For each warp: How many sectors needed?

Depends on addresses, active threads, access size. Natural element sizes = 1B, 2B, 4B, 8B, 16B.

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 4-Byte element access 4 sectors

slide-47
SLIDE 47

47

Access Patterns

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 8-Byte element access 8 sectors Examples of 8-byte elements: long long, int2, double, float2

slide-48
SLIDE 48

48

Access Patterns

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 4-Byte access 4 sectors

slide-49
SLIDE 49

49

Access Patterns

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 4-Byte access, unaligned 5 sectors 128 bytes requested, 160 bytes read (80% efficiency)

slide-50
SLIDE 50

50

Access Patterns

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 4-Byte access, unaligned 5 sectors NEXT WARP With >1 warp per block, this sector might be found in L1 or L2

slide-51
SLIDE 51

51

Access Patterns

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 Same address 1 sector

slide-52
SLIDE 52

52

Access Patterns

Warps and Sectors

32 64 96 128 160 224 256 320 288 192 352 Memory Addresses WARP 31 4-Byte strided access 32 sectors 128 bytes requested, 1024 bytes transferred! Using only a few bytes per sector. Wasting lots of BW!

slide-53
SLIDE 53

53

Access Patterns

  • Know your access patterns
  • Use the profiler (metrics, counters) to check how many

sectors are moved. Is that what you expect? Is it optimal?

  • Using the largest type possible (e.g. float4) will maximize

the number of sectors moved per instruction

Takeaways

slide-54
SLIDE 54

54

Shared Memory

Scratch-pad memory on each SM User-managed cache, HW does not evict data Data written to SMEM stays there till user overwrites Useful for: Storing frequently-accessed data, to reduce DRAM accesses Communication among threads of a threadblock Performance benefits compared to DRAM: 20-40x lower latency ~15x higher bandwidth Accessed at 4-byte granularity GMEM granularity is 32B.

slide-55
SLIDE 55

55

Volta Shared Memory

  • Default 48KB/threadblock, opt in to get 96KB
  • 32 banks, 4 bytes wide
  • Bandwidth: 4 bytes per bank per clock per SM

128 bytes per clk per SM

  • V100: ~14 TB/s aggregate across 80 SMs
  • Mapping addresses to banks:
  • Successive 4-byte words go to successive banks
  • Bank index computation examples:

(4B word index) % 32 ((1B word index) / 4 ) % 32 8B word spans two successive banks

slide-56
SLIDE 56

56

Logical View Of SMEM Banks

56

1 32 33

Bank-0

2 3 4 5 8 9 6 7

256 260 264

10 30 31

Bank-31

384

Bank-1

4 8 12 16 20 24 28 44

Byte-address:

32 38 40 120 128 124 128 132 136 140 144 148 248 256 252

With 4-Bytes data

slide-57
SLIDE 57

57

Shared Memory Instruction Operation

Threads in a warp provide addresses

HW determines into which 4-byte words addresses fall

Reads (LDS):

Fetch the data, distribute the requested bytes among threads Multi-cast capable

Writes (STS):

Multiple threads writing the same address: one “wins”

slide-58
SLIDE 58

58

Shared Memory Bank Conflicts

A bank conflict occurs when, inside a warp:

2 or more threads access within different 4B words in the same bank Think: 2 or more threads access different “rows” in the same bank

N-way bank conflict: N threads in a warp conflict

  • Increases latency
  • Worst case: 32-way conflict → 31 replays
  • Each replay adds a few cycles of latency

There is no bank conflict if:

  • Several threads access the same 4-byte word
  • Several threads access different bytes of the same 4-byte word
slide-59
SLIDE 59

59

No Bank Conflicts

59

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-60
SLIDE 60

60

No Bank Conflicts

60

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-61
SLIDE 61

61

No Bank Conflicts

61

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-62
SLIDE 62

62

No Bank Conflicts

62

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-63
SLIDE 63

63

2-way Bank Conflict

63

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-64
SLIDE 64

64

2-way Bank Conflict

64

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-65
SLIDE 65

65

3-way Bank Conflict

65

1 32 33

Bank-0

2 3 4 5 8 9 6 7

4 8 12 16 20 24 28 44

Byte-address: 10 30 31

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 T-30 T-31

slide-66
SLIDE 66

66

Bank Conflict Resolution

4B or smaller words:

  • Process addresses of all threads in a warp in a single phase

8B words are accessed in 2 phases:

  • Process addresses of the first 16 threads in a warp
  • Process addresses of the second 16 threads in a warp

16B words are accessed in 4 phases:

  • Each phase processes a quarter of a warp

Bank conflicts occur only between threads in the same phase

slide-67
SLIDE 67

67

8B words, No Conflicts

67

16 16

Bank-0

1 1 2 2 4 4 3 3

4 8 12 16 20 24 28 44

Byte-address: 5 15 15

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-15 T-16 T-17 T-18 T-19 T-20 T-21 T-31

Phase2 Phase1

slide-68
SLIDE 68

68

8B words, 2-way Conflict

68

16 16

Bank-0

1 1 2 2 4 4 3 3

4 8 12 16 20 24 28 44

Byte-address: 5 15 15

Bank-31

32 38 40 120 128

Bank-1

124 T-0 T-1 T-2 T-3 T-4 T-5 T-15 T-16 T-17 T-18 T-19 T-20 T-21 T-31

Phase2 (no conflict) Phase1 (2 way conflict)

slide-69
SLIDE 69

69

Case Study: Matrix Transpose

Staged via SMEM to coalesce GMEM addresses

32x32 threadblock, single-precision values 32x32 array in shared memory

Initial implementation:

A warp reads a row from GMEM, writes to a row of SMEM Synchronize the threads in a block A warp reads a column of from SMEM, writes to a row in GMEM

slide-70
SLIDE 70

70

Case Study: Matrix Transpose

32x32 SMEM array (.e.g. __shared__ float sm[32][32]) Warp accesses a row : No conflict Warp accesses a column : 32-way conflict Bank 0 Bank 1 … Bank 31

31 2 1 31 2 1 31 2 1 Threads: 0 1 2 31 2 1 31

Number indentifies which warp is accessing data Color indicates in which bank data resides

slide-71
SLIDE 71

71

Case Study: Matrix Transpose

Solution: add a column for padding: 32x33 (.e.g. __shared__ float sm[32][33]) Warp accesses a row or a column: no conflict Bank 0 Bank 1 … Bank 31

Number indentifies which warp is accessing data Color indicates in which bank data resides

Threads: 0 1 2 31 padding 31 2 1 31 2 1 31 2 1 31 2 1 31

Speedup 1.3x

slide-72
SLIDE 72

72

Summary: Shared Memory

Shared memory is a tremendous resource

Very high bandwidth (14 TB/s) Much lower latency than Global Memory Data is programmer-managed, no evictions by hardware Volta: up to 96KB of shared memory per thread block.

Performance issues to look out for:

Bank conflicts add latency and reduce throughput Use profiling tools to identify bank conflicts

slide-73
SLIDE 73

73

Volta’s L1 Cache

Pascal : 24KB Achievable BW = 2.6 TB/s Volta : Variable size 32 KB ~ 128KB Achievable BW = 14.4 TB/s Lower latency!

L1 caching: Global Mem, Texture, Local Mem (inc. register spills) Pascal vs Volta

slide-74
SLIDE 74

74

Volta’s Unified L1

SM L1/Tex$ Shared Mem SM L1 / Tex$ Shared Mem

Pascal Volta

6 possible Smem / L1 splits 96KB / 32KB 64KB / 64KB 32KB / 96KB 16KB / 112KB 8KB / 120KB 0KB /128 KB

How to specify the L1 / Smem split on Volta:

cudaFuncSetAttribute (MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

The driver usually does a pretty good job at choosing the right split.

slide-75
SLIDE 75

75

2D Stencil Experiment

index = iy * nx + ix; res = coef[0] * in[index]; for(i=1; i<=RADIUS; i++) res += coef[i] * (in[index-i] + in[index+i] + in[index-i*n1] + in[index+i*n1]);

  • ut[index] = res;

with and without Shared Memory

With Shared Memory:

  • Load the input array and halos in shared memory
  • __syncthreads()
  • Compute the stencil from the shared memory
slide-76
SLIDE 76

76

2D – Small stencils

Relative speed of L1 implementation versus Smem implementation

103% 78%

RADIUS=1

Volta Pascal

102% 55%

RADIUS=2

Volta Pascal

L1 implementation is faster than Shared Memory on Volta!

slide-77
SLIDE 77

77

2D – Larger Stencils

Relative speed of L1 implementation versus Smem implementation

94% 40%

RADIUS=4

Volta Pascal

95% 32%

RADIUS=8

Volta Pascal

79% 29%

RADIUS=16

Volta Pascal

Shared Memory implementation is always faster for larger stencils

slide-78
SLIDE 78

78

Constant Memory

  • Globally-scoped arrays qualified with __constant__
  • Total constant data size limited to 64 KB
  • Throughput = 4B per clock per SM (ideal if entire warp reads the

same address)

  • Can be used directly in arithmetic instructions (saving registers)
  • Example use : Stencil coefficients
slide-79
SLIDE 79

79

Running Faster

A piece of code can be:

  • Compute bound (saturating compute units)

Solution: Reduce the number of instructions executed Using vector types, intrinsics, tensor cores, FMAs

  • Bandwidth bound (saturating memory bandwidth)

Solution: Reduce the amount of data transferred Optimal access patterns, using lower precision

  • Latency bound

Solution: Increase the number of instructions / mem accesses in flight

Solving the bottlenecks

slide-80
SLIDE 80