1
CUDA GPU programming COMP 633 - Prins
COMP 633 - Parallel Computing Lecture 13 September 24, 2020 - - PowerPoint PPT Presentation
COMP 633 - Parallel Computing Lecture 13 September 24, 2020 Computational Accelerators COMP 633 - Prins CUDA GPU programming 1 Sample midterm problem All-pair nbody calculation In the PA1(a) sequential n-body simulation, we
1
CUDA GPU programming COMP 633 - Prins
2
CUDA GPU programming COMP 633 - Prins
3
CUDA GPU programming COMP 633 - Prins
4
CUDA GPU programming COMP 633 - Prins
– compression, computational photography
5
CUDA GPU programming COMP 633 - Prins
6
Low latency cores
Registers
High throughput cores
7
– Reduced operation latency
– Convert long latency memory accesses to short latency cache accesses
– Instruction dependency analysis and superscalar operation – Branch prediction for reduced branch latency – Data forwarding for reduced data latency Cache ALU Control ALU ALU ALU
DRAM
8
– High bandwidth main memory
– No branch prediction – No data forwarding
– Many, high latency, ALUs heavily pipelined for high throughput
– Threading logic – Thread state
DRAM
9
CUDA GPU programming COMP 633 - Prins
10
CUDA GPU programming COMP 633 - Prins
– generic term for compute-intensive attached devices
– not general purpose, only good for some problems – difficult to program – interface to host system can be a bottleneck – low precision arithmetic (this is now a feature!)
– cheap – increasingly general-purpose and simpler to program – improving host interfaces and performance – IEEE double precision – very high compute and local memory performance
– NSC China Tianhe-2: 48,000 Intel Xeon Phi – ORNL USA Summit: 27,600 Nvidia Tesla V100
– Simplified access from host – Improved integration of multiple GPUs – Low- and mixed-precision FP arithmetic
11
CUDA GPU programming COMP 633 - Prins
Nvidia Titan V100 Intel Xeon Phi 5110P (dual socket Intel Xeon E5 v3) 16 GB/s bidirectional
12
CUDA GPU programming COMP 633 - Prins
Device SM N-1 SM 1 SM 0
Instruction Unit Proc 0
Proc 1 Proc 31
Instruction Unit Proc 0
Proc 1 Proc 31
13
CUDA GPU programming COMP 633 - Prins
14
CUDA GPU programming COMP 633 - Prins
15
CUDA GPU programming COMP 633 - Prins
Device SM N-1 SM 1 SM 0 Device memory
Shared memory and L1 cache
Instruction Unit Proc 0 Registers
Proc 1 Registers Proc 31 Registers
Constant Cache Texture Cache
Host memory
Registers
Shared Memory
Device memory Host memory
16
CUDA GPU programming COMP 633 - Prins
Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)
17
CUDA GPU programming COMP 633 - Prins
18
CUDA GPU programming COMP 633 - Prins
19
CUDA GPU programming COMP 633 - Prins
20
CUDA GPU programming COMP 633 - Prins
21
CUDA GPU programming COMP 633 - Prins
Memory Scope Lifetime
__device__ __local__ int LocalVar;
local thread thread
__device__ __shared__ int SharedVar;
shared block block
__device__ int GlobalVar;
global grid application
__device__ __constant__ int ConstantVar;
constant grid application
adapted from: David Kirk/NVIDIA and Wen-mei W. Hwu, Fall 2007 ECE 498AL1
22
CUDA GPU programming COMP 633 - Prins
adapted from: David Kirk/NVIDIA and Wen-mei W. Hwu, Fall 2007 ECE 498AL1
23
CUDA GPU programming COMP 633 - Prins
adapted from: David Kirk/NVIDIA and Wen-mei W. Hwu, Fall 2007 ECE 498AL1
24
CUDA GPU programming COMP 633 - Prins
adapted from: David Kirk/NVIDIA and Wen-mei W. Hwu, Fall 2007 ECE 498AL1
25
CUDA GPU programming COMP 633 - Prins
26
CUDA GPU programming COMP 633 - Prins
27
CUDA GPU programming COMP 633 - Prins