Introduc)on to GPU Programming
h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf
Introduc)on to GPU Programming Mubashir Adnan Qureshi - - PowerPoint PPT Presentation
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf Tutorial
h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf
2
3
4 Graph is courtesy of NVIDIA
5 Figure is courtesy of NVIDIA
TPC
Geometry controller
SMC SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache MT issue I cache
SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache MT issue I cache
Texture units Texture L1 128-bit interconnect L2 ROP ROP L2
6 DRAM DRAM
TPC 1
Geometry controller SMC SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache MT issue I cache
SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache MT issue I cache
SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache
MT issueI cache
Texture units Texture L1
TPC 10
Geometry controller SMC SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache
MT issueI cache
SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache
MT issueI cache
SM
Shared memory
SFU SFU
SP SP SP SP SP SP SP SP
C cache
MT issueI cache
Texture units Texture L1 ROP L2 L2 ROP 512-bit memory interconnect
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM 7
Tesla GPU Tesla GPU Tesla GPU Tesla GPU
4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM NVIDIA SWITCH NVIDIA SWITCH Power supply
Thermal management
System monitoring PCI x16 PCI x16 12 Graph is courtesy of NVIDIA
9
10
30
void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); }
Computational kernel Memory allocation Kernel invocation Memory de-allocation
12
Host CPU Host Memory A B C
GPU card GPU Device Memory gA gB gC
13
14
– Data is moved across PCIe bus – Use func[ons to allocate/set/copy memory on GPU
– cudaMalloc(void** pointer, size_t nbytes) – cudaFree(void* pointer) – cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind direc[on);
– enum cudaMemcpyKind
int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice); Memory allocation
Copy data from the CPU (host) memory to the GPU (device) memory
15
vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); } Kernel invocation Copy results from device memory to the host memory Device memory de-allocation
16
17
… float x = input[threadID]; float y = func(x);
…
threadID Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block N-1 18
Shared memory
grid & thread block dimensionality
thread ID within a thread block number of threads per block block ID within a grid
19
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block N-1
Shared memory
assigned to SMs
– A block of threads executes on one SM & does not migrate – Several blocks can reside concurrently on one SM
– Any possible interleaving of blocks should be valid – Blocks may coordinate but not synchronize – Thread blocks can run in any order
Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Each block can execute in any
20 Slide is courtesy of NVIDIA
time
Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7
Host memory Device 0 memory Device 1 memory cudaMemcpy()
21
Kernel 0
Thread Block 0 Thread Block 1 Thread Block N-1
Kernel 1
Thread Block 0 Thread Block 1 Thread Block N-1
Per-device Global Memory
22
– Each thread has own local storage – Mostly registers (managed by the compiler) – Data life)me = thread life)me
– Each thread block has own shared memory
within that block
– Data life)me = block life)me
Thread Block
Per-block shared memory Per-thread local memory
23
Host CPU chipset DRAM Device DRAM
local global constant texture
GPU
Mul)processor Mul)processor Mul)processor
registers shared memory constant and texture caches
24
Memory Loca[on Cached Access Scope Life[me Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Applica[on Constant Off-chip Yes R All threads + host Applica[on Texture Off-chip Yes R All threads + host Applica[on
25