Introduc)on to GPU Programming Mubashir Adnan Qureshi - - PowerPoint PPT Presentation

introduc on to gpu programming
SMART_READER_LITE
LIVE PREVIEW

Introduc)on to GPU Programming Mubashir Adnan Qureshi - - PowerPoint PPT Presentation

Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf Tutorial


slide-1
SLIDE 1

Introduc)on to GPU Programming

h3p://www.ncsa.illinois.edu/People/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/CUDA/training/NVIDIA_GPU_Compu)ng_Webinars_CUDA_Memory_Op)miza)on.pdf

Mubashir Adnan Qureshi

slide-2
SLIDE 2

2

Tutorial Goals

  • NVIDIA GPU architecture
  • NVIDIA GPU application development flow
  • Write and run simple NVIDIA GPU kernels in

CUDA

  • Be aware of performance limiting factors and

understand performance tuning strategies

slide-3
SLIDE 3

3

Introduction

  • Why use Graphics Processing Units (GPUs) for

general-purpose computing

  • Modern GPU architecture

– NVIDIA

  • GPU programming overview

– CUDA C – OpenCL

slide-4
SLIDE 4

GPU vs. CPU Silicon Use

4 Graph is courtesy of NVIDIA

slide-5
SLIDE 5

NVIDIA GPU Architecture

  • N mul)processors called

SMs

  • Each has M cores

called SPs

  • SIMD
  • Same instruc)on

executed on SPs

  • Device memory shared

across all SMs

5 Figure is courtesy of NVIDIA

slide-6
SLIDE 6

NVIDIA GeForce9400M G GPU

  • 16 streaming processors

arranged as 2 streaming multiprocessors

  • At 0.8 GHz this provides

– 54 GFLOPS in single- precision (SP)

  • 128-bit interface to off-

chip GDDR3 memory

– 21 GB/s bandwidth

TPC

Geometry controller

SMC SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache MT issue I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache MT issue I cache

Texture units Texture L1 128-bit interconnect L2 ROP ROP L2

6 DRAM DRAM

slide-7
SLIDE 7

NVIDIA Tesla C1060 GPU

  • 240 streaming

processors arranged as 30 streaming mul)processors

  • At 1.3 GHz this

provides

– 1 TFLOPS SP – 86.4 GFLOPS DP

  • 512-bit interface to
  • ff-chip GDDR3

memory

– 102 GB/s bandwidth

TPC 1

Geometry controller SMC SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache MT issue I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache MT issue I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

Texture units Texture L1

TPC 10

Geometry controller SMC SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP SP SP SP SP SP SP

C cache

MT issue

I cache

Texture units Texture L1 ROP L2 L2 ROP 512-bit memory interconnect

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM 7

slide-8
SLIDE 8

NVIDIA Tesla S1070 Computing Server

  • 4 T10 GPUs

Tesla GPU Tesla GPU Tesla GPU Tesla GPU

4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM 4 GB GDDR3 SDRAM NVIDIA SWITCH NVIDIA SWITCH Power supply

Thermal management

System monitoring PCI x16 PCI x16 12 Graph is courtesy of NVIDIA

slide-9
SLIDE 9

9

GPU Use/Programming

  • GPU libraries

– NVIDIA’s CUDA BLAS and FFT libraries – Many 3rd party libraries

  • Low abstraction lightweight GPU

programming toolkits

– CUDA C – OpenCL

slide-10
SLIDE 10

10

nvcc

  • Any source file containing CUDA C language

extensions must be compiled with nvcc

  • nvcc is a compiler driver that invokes many other

tools to accomplish the job

  • Basic nvcc usage

– nvcc <filename>.cu [-o <executable>]

  • Builds release mode

– nvcc -deviceemu <filename>.cu

  • Builds device emula)on mode (all code runs on CPU)

– nvprof <executable>

  • Profiles the code
slide-11
SLIDE 11

30

Anatomy of a GPU Applica)on

  • Host side
  • Device side
slide-12
SLIDE 12

Reference CPU Version

void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; } int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); }

Computational kernel Memory allocation Kernel invocation Memory de-allocation

12

slide-13
SLIDE 13

Host CPU Host Memory A B C

Adding GPU support

GPU card GPU Device Memory gA gB gC

13

slide-14
SLIDE 14

14

Memory Spaces

  • CPU and GPU have separate memory spaces

– Data is moved across PCIe bus – Use func[ons to allocate/set/copy memory on GPU

  • Host (CPU) manages device (GPU) memory

– cudaMalloc(void** pointer, size_t nbytes) – cudaFree(void* pointer) – cudaMemcpy(void* dst, void* src, size_t nbytes, enum cudaMemcpyKind direc[on);

  • returns after the copy is complete
  • blocks CPU thread un[l all bytes have been copied
  • does not start copying un[l previous CUDA calls complete

– enum cudaMemcpyKind

  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • cudaMemcpyDeviceToDevice
slide-15
SLIDE 15

Adding GPU support

int main(int argc, char **argv) { int N = 16384; // default vector size float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float)); float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float)); cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice); Memory allocation

  • n the GPU card

Copy data from the CPU (host) memory to the GPU (device) memory

15

slide-16
SLIDE 16

Adding GPU support

vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); } Kernel invocation Copy results from device memory to the host memory Device memory de-allocation

16

slide-17
SLIDE 17

17

GPU Kernel

  • CPU version

void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; }

  • GPU version

global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }

slide-18
SLIDE 18

CUDA Programming Model

  • A CUDA kernel is executed by

an array of threads

– All threads run the same code (SIMD) – Each thread has an ID that it uses to compute memory addresses and make control decisions

  • Threads are arranged as a grid of thread blocks

– Threads within a block have access to a segment of shared memory

… float x = input[threadID]; float y = func(x);

  • utput[threadID] = y;

threadID Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block N-1 18

Shared memory

slide-19
SLIDE 19

Kernel Invoca)on Syntax

grid & thread block dimensionality

vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC); int i = blockIdx.x * blockDim.x + threadIdx.x;

thread ID within a thread block number of threads per block block ID within a grid

19

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block N-1

Shared memory

slide-20
SLIDE 20

Mapping Threads to the Hardware

  • Blocks of threads are transparently

assigned to SMs

– A block of threads executes on one SM & does not migrate – Several blocks can reside concurrently on one SM

  • Blocks must be independent

– Any possible interleaving of blocks should be valid – Blocks may coordinate but not synchronize – Thread blocks can run in any order

Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Kernel grid Device Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Each block can execute in any

  • rder relative to other blocks.

20 Slide is courtesy of NVIDIA

time

Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

slide-21
SLIDE 21

GPU Memory Hierarchy

  • Global (device) memory

– Accessible by all threads as well as host (CPU) – Data life)me is from alloca)on to dealloca)on

Host memory Device 0 memory Device 1 memory cudaMemcpy()

21

slide-22
SLIDE 22

GPU Memory Hierarchy

  • Global (device) memory

Kernel 0

Thread Block 0 Thread Block 1 Thread Block N-1

Kernel 1

Thread Block 0 Thread Block 1 Thread Block N-1

Per-device Global Memory

22

slide-23
SLIDE 23

GPU Memory Hierarchy

  • Local storage

– Each thread has own local storage – Mostly registers (managed by the compiler) – Data life)me = thread life)me

  • Shared memory

– Each thread block has own shared memory

  • Accessible only by threads

within that block

– Data life)me = block life)me

Thread Block

Per-block shared memory Per-thread local memory

23

slide-24
SLIDE 24

GPU Memory Hierarchy

Host CPU chipset DRAM Device DRAM

local global constant texture

GPU

Mul)processor Mul)processor Mul)processor

registers shared memory constant and texture caches

24

Memory Loca[on Cached Access Scope Life[me Register On-chip N/A R/W One thread Thread Local Off-chip No R/W One thread Thread Shared On-chip N/A R/W All threads in a block Block Global Off-chip No R/W All threads + host Applica[on Constant Off-chip Yes R All threads + host Applica[on Texture Off-chip Yes R All threads + host Applica[on

slide-25
SLIDE 25

25

GPU Kernel

  • CPU version

void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i]; }

  • GPU version

global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }

slide-26
SLIDE 26

Op)mizing Algorithms for GPUs

  • Maximize independent parallelism
  • Maximize arithme)c intensity (math/bandwidth)
  • Some)mes it’s be3er to recompute than to cache

GPU

  • GPU spends its transistors on ALUs, not memory
  • Do more computa)on on the GPU to avoid costly

data transfers

  • Even low parallelism computa)ons can some)mes be

faster than transferring back and forth to host

slide-27
SLIDE 27

Op)mize Memory Access

  • Coalesced vs. Non-coalesced = order of

magnitude

  • Global/Local device memory
  • Con)guous threads accessing con)guous memory
  • Shared Memory
  • Hundreds of )mes faster than global memory
  • Threads can cooperate via shared memory
  • Use it to avoid non-coalesced access