VOLTA Architecture and performance optimization Guillaume - PowerPoint PPT Presentation

VOLTA Architecture and performance optimization Guillaume Thomas-Collignon Paulius Micikevicius

Little’s law Control flow, Threads are Threads AGENDA Instructions, Tensor Cores Memory Architecture, L1, Smem 2

VOLTA V100 80 SM Per Streaming Multiprocessor: SM SM SM SM SM 64 FP32 lanes • … L1 L1 L1 L1 L1 32 FP64 lanes • • 64 INT32 lanes 16 SFU lanes (transcendentals) • 32 LD/ST lanes (Gmem/Lmem/Smem) • L2 8 Tensor Cores • 4 TEX lanes • DRAM 3

Little’s Law For Escalators Our escalator parameters: • 1 Person per step • A step arrives every 2 seconds Bandwidth : 0.5 person/s • 20 steps tall Latency = 40 seconds 4

Little’s Law For Escalators A step arrives every 2 seconds Bandwidth : 0.5 person/s 20 steps tall : Latency = 40 seconds • One person in flight ? Achieved bandwidth = 0.025 person/s • To saturate bandwidth: Need one person arriving with every step, we need 20 persons in flight • Need Bandwidth x Latency persons in flight 5

Little’s law For GPUs Optimization goals: 1. Saturate Compute units Accelerate computing Get close to the peak performance 2. Saturate Memory Bandwidth If compute density too low to saturate computation Need to hide the latencies to achieve this 6

Volta reaches 90% of peak bandwidth with ~6KB of data in flight per SM % of Peak Bandwidth 100.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 0.0 1024 1536 2048 2560 3072 3584 4096 4608 5120 Memory Bandwidth 5632 6144 Bytes in flight per SM 6656 7168 7680 8192 8704 9216 9728 V100 10240 10752 11264 11776 12288 12800 13312 13824 14336 14848 15360 16384 17408 18432 19456 20480 21504 22528 23552 24576 25600 28672 29696 32768 7

CUDA basics Blocks of threads, warps Single Instruction Multiple Threads (SIMT) model • CUDA hierarchy: Grid -> Blocks -> Threads • One warp = 32 threads. • Why does it matter ? • Many optimizations based on behavior at the warp level 8

CUDA basics Mapping threads Thread blocks can be 1D, 2D, 3D • Only for convenience. HW “looks” at threads in 1D Consecutive 32 threads belong to the same warp • 80 Threads: 40 threads in X 2 rows of threads in Y 40 2 9

CUDA basics Mapping threads Thread blocks can be 1D, 2D, 3D • Only for convenience. HW “looks” at threads in 1D Consecutive 32 threads belong to the same warp • 80 Threads: 3 warps (96 threads) 40 threads in X 16 inactive threads in 3 rd warp 2 rows of threads in Y 40 40 1 2 2 2 2 3 3 10

CUDA basics Control Flow Different warps can execute different code • No impact on performance Each warp maintains its own Program Counter Different code path inside the same warp ? • Threads that don’t participate are masked out, but the whole warp executes both sides of the branch 11

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; Warp 1 … if(threadIdx.y==0) 31 B; 0 Warp 2 else … 31 C; 0 D; Warp 3 … 31 12

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A Warp 1 … if(threadIdx.y==0) 31 B; 0 Warp 2 else … 31 C; 0 D; Warp 3 … 31 13

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B Warp 1 … if(threadIdx.y==0) 31 B; 0 Warp 2 else … 31 C; 0 D; Warp 3 … 31 14

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 Warp 2 else … 31 C; 0 D; Warp 3 … 31 15

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 Warp 2 A else … 31 C; 0 D; Warp 3 … 31 16

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 B Warp 2 A else … 31 C; 0 D; Warp 3 … 31 17

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 C B Warp 2 A else … 31 C; 0 D; Warp 3 … 31 18

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 C B D Warp 2 A else … 31 C; 0 D; Warp 3 … 31 19

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 C B D Warp 2 A else … 31 C; 0 D; A Warp 3 … 31 20

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 C B D Warp 2 A else … 31 C; 0 D; A C Warp 3 … 31 21

Control Flow 0 ThreadIdx.x 39 0 1 2 ThreadIdx.y 1 2 3 3 Instructions, time 0 A; A B D Warp 1 … if(threadIdx.y==0) 31 B; 0 C B D Warp 2 A else … 31 C; 0 D; A C D Warp 3 … 31 22

Control Flow Takeaways Minimize thread divergence inside a warp • Divergence between warps is fine • Maximize “useful” cycles for each thread • 23

Threads Are Threads New in Volta Program counter : • Before Volta: Per warp Volta: Per thread Volta guarantees Forward Progress for diverged threads • in a warp Allows to exchange data between diverged threads in a • warp. E.g. mutexes among warp threads. Allows to write natural code that would deadlock before 24

Threads Are Threads Example lock = 0; while (lock == 0) Pre-Volta: The code might deadlock in the loop, lock = tryGetLock(); if the thread that gets the lock cannot forward- progress and release the lock doSomething; releaseLock(); These device functions could be implemented with atomics, or volatile pointers 25

Threads are Threads Thread re-convergence Don’t assume the threads in a warp are re-converged or • executing in lock-step mode. Use __syncwarp() to synchronize the threads in a warp. Shuffle and warp vote functions are deprecated. • Use the new equivalent “_sync” functions. Extra parameter tells the compiler/HW which threads are expected to participate, because they might not reach it all at the same time. E.g: __shfl_up(value, 1) becomes __shfl_up_sync ( 0xffffffff , value, 1) Full efficiency only when all the 32 threads of a warp are • converged! 26

Thread are Threads How to deal with warp-synchronous code? Update/fix the code! • Use Cooperative Groups (GTC 2017 talk s7622) • Compile for an older architecture (disable forward progress) • - arch= compute_60,sm_70 (binary) –arch= compute_60 (PTX JIT) 27

SM Resources Each thread block needs: Registers (#registers/thread x #threads) Shared memory (0 ~ 96 KB) Volta limits per SM: 256KB Registers 96KB Shared memory 2048 threads max (64 warps) 32 thread blocks max Can schedule any resident warp without context switch 28

SM Resources SM Each thread block needs: Registers (#registers/thread x #threads) Schedulers Shared memory (0 ~ 96 KB) Compute Units Volta limits per SM: 256KB Registers 256KB Registers 96KB Shared memory 2048 threads max (64 warps) 128 KB Smem/L1 32 thread blocks max Can schedule any resident warp without context switch 29

Occupancy Occupancy = !"#$%&%' ()*+%, -. /#,%0'1 2%, 34 405$*)* ()*+%, -. /#,%0'1 2%, 34 (Use the occupancy calculator XLS in CUDA Toolkit) Higher occupancy can help to hide latency! SM has more warp candidates to schedule while other warps are waiting for instructions to complete Achieved occupancy vs theoretical occupancy Need to run enough thread blocks to fill all the SMs! 30

Increasing In-Flight instructions 2 Ways to improve parallelism: Improve occupancy • More threads -> more instructions Improve instruction parallelism (ILP) • More independent instructions per thread 31

Instruction Issue Instructions are issued in-order If an instruction is not eligible, it stalls the warp An instruction is eligible for issue if both are true: • A pipeline is available for execution Some pipelines need multiple cycles to issue a warp • All the arguments are ready Argument isn’t ready if a previous instruction hasn’t yet produced it 32

Instruction Issue Example __global__ void kernel (float *a, float *b, float *c) { int i= blockIdx.x * blockDim.x + threadIdx.x; c[i] += a[i] * b[i]; LDG.E R2, [R2]; } LDG.E R4, [R4]; 12B / thread in flight LDG.E R9, [R6]; stall! FFMA R9, R2, R4, R9; stall! STG.E [R6], R9; 33

Computing 2 values per thread __global__ void kernel ( float2 *a, float2 *b, float2 *c) { int i= blockIdx.x * blockDim.x + threadIdx.x; c[i].x += a[i].x * b[i].x; c[i].y += a[i].y * b[i].y; LDG.E. 64 R2, [R2]; } 24B / thread LDG.E. 64 R4, [R4]; in flight LDG.E. 64 R6, [R8]; stall! FFMA R7, R3, R5, R7; 2 Independent instructions FFMA R6, R2, R4, R6; stall! STG.E. 64 [R8], R6; 34

Fast Math intrinsics Fast but less accurate math intrinsics are available. 2 ways to use the intrinsics: Whole file: compile with --fast-math • Individual calls • E.g. __sinf(x), __logf(x), __fdivide(x,y) 35

VOLTA Architecture and performance optimization Guillaume - PowerPoint PPT Presentation

VOLTA Architecture and performance optimization Guillaume Thomas-Collignon Paulius Micikevicius Littles law Control flow, Threads are Threads AGENDA Instructions, Tensor Cores Memory Architecture, L1, Smem 2 VOLTA V100 80 SM Per

Multisensory Learning In Adaptive Interactive Systems Erica Volta Erica Volta Who I am 2020

Dissecting the Volta GPU Architecture through Microbenchmarking GTC 2018 Zhe Jia, Marco

The MD Course, the Volta Experiment, and Other Stuff Steve Zinder Microbial Diversity June

VOLTA / TURING OPTIMIZATION G. Thomas-Collignon, NVIDIA, GTC 2019 S9234 Quick review of basic

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of

Space: Space: System Architecture System Architecture vs Optimization vs Optimization

NVIDIA Pascal and Volta Architectures Eric Nielsen and Aaron Walden Justin Luitjens Mohammad

CLOVER HMC AND STAGGERED MULTIGRID ON SUMMIT AND VOLTA Kate Clark, July 25th 2018 OUTLINE with

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

CPWF Volta Basin Project integrated management of rainwater for crop-livestock

ST STRATEGIC ATEGIC FO FORESIGHT ESIGHT GR GROUP OUP NBI, ENTEBB BBE: E: 09 AU AUGUST ST

Volt Volta Quo Quo Vadi Vadis? s? Erik Meijer Wes Dyer Jeffrey van Gogh Bart de Smet

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

INVESTOR PRESENTATION MAY 2020 Forward-Looking Statements. This document contains

Champion Excellence Oklahoma School Report Cards 1 2 Academic Achievement Recognizes

Investor Teleconference Presentation Second Quarter 2020 Fastenal Company July 14, 2020 1 Safe

10 Year Performance of an Unbonded Concrete Overlay - A Case Study City of Toronto Mark

Investor Presentation Fourth Quarter 2019 Televisa at a Glance 2019 Share of Revenue and OSI*

Parent/Community Forum June 25, 2020 WELCOME and THANKS FOR ATTENDING ROUGH DRAFT NO FINAL

Challenges Regarding IP Core Functional Reliability. Melanie Berg 1 , Kenneth LaBel 2 1.AS&D

Audited 2017FY Results Presentation: Transcript of the GMD/CEOs Presentation and Closing Remarks

VOLTA Architecture and performance optimization Guillaume - PowerPoint PPT Presentation

VOLTA Architecture and performance optimization Guillaume Thomas-Collignon Paulius Micikevicius Littles law Control flow, Threads are Threads AGENDA Instructions, Tensor Cores Memory Architecture, L1, Smem 2 VOLTA V100 80 SM Per

Multisensory Learning In Adaptive Interactive Systems Erica Volta Erica Volta Who I am 2020

Dissecting the Volta GPU Architecture through Microbenchmarking GTC 2018 Zhe Jia, Marco

The MD Course, the Volta Experiment, and Other Stuff Steve Zinder Microbial Diversity June

VOLTA / TURING OPTIMIZATION G. Thomas-Collignon, NVIDIA, GTC 2019 S9234 Quick review of basic

TESLA V100 GPU Xudong Shao Houxiang Ji Hao Gao The history of GPU architecture 2017 Volta

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Improving NAMD Performance on Volta GPUs David Hardy - Research Programmer, University of

Space: Space: System Architecture System Architecture vs Optimization vs Optimization

NVIDIA Pascal and Volta Architectures Eric Nielsen and Aaron Walden Justin Luitjens Mohammad

CLOVER HMC AND STAGGERED MULTIGRID ON SUMMIT AND VOLTA Kate Clark, July 25th 2018 OUTLINE with

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

CPWF Volta Basin Project integrated management of rainwater for crop-livestock

ST STRATEGIC ATEGIC FO FORESIGHT ESIGHT GR GROUP OUP NBI, ENTEBB BBE: E: 09 AU AUGUST ST

Volt Volta Quo Quo Vadi Vadis? s? Erik Meijer Wes Dyer Jeffrey van Gogh Bart de Smet

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

INVESTOR PRESENTATION MAY 2020 Forward-Looking Statements. This document contains

Champion Excellence Oklahoma School Report Cards 1 2 Academic Achievement Recognizes

Investor Teleconference Presentation Second Quarter 2020 Fastenal Company July 14, 2020 1 Safe

10 Year Performance of an Unbonded Concrete Overlay - A Case Study City of Toronto Mark

Investor Presentation Fourth Quarter 2019 Televisa at a Glance 2019 Share of Revenue and OSI*

Parent/Community Forum June 25, 2020 WELCOME and THANKS FOR ATTENDING ROUGH DRAFT NO FINAL

Challenges Regarding IP Core Functional Reliability. Melanie Berg 1 , Kenneth LaBel 2 1.AS&amp;D

Audited 2017FY Results Presentation: Transcript of the GMD/CEOs Presentation and Closing Remarks

Challenges Regarding IP Core Functional Reliability. Melanie Berg 1 , Kenneth LaBel 2 1.AS&D