Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer - - PowerPoint PPT Presentation

multi gpu a hands on exercise
SMART_READER_LITE
LIVE PREVIEW

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer - - PowerPoint PPT Presentation

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies Connection instructions Navigate to nvlabs.qwiklab.com Login or create a new account Select the Instructor-Led Hands-on Labs class Find the


slide-1
SLIDE 1

Justin Luitjens

NVIDIA - Developer Technologies

Multi-GPU: A Hands-on Exercise

slide-2
SLIDE 2

Connection instructions

  • Navigate to nvlabs.qwiklab.com
  • Login or create a new account
  • Select the “Instructor-Led Hands-on Labs” class
  • Find the lab called “Scaling to Multiple GPUs” and click

Start

  • After a short wait, lab instance connection information

will be shown

  • Please ask Lab Assistants for help!
slide-3
SLIDE 3

Why Should You Use Multiple GPUs

  • Compute Faster

— More GPU’s = Faster time to solution

  • Compute Larger

— More GPU’s = More memory for larger problems

  • Compute Cheaper

— More GPU’s per node = less overhead in $, power and space

slide-4
SLIDE 4

What You Will Learn About Today

  • During today’s lab

— Scalability metrics — Managing multiple devices — Communicating between devices

  • Homework

— Communication Hiding — Synchronization

slide-5
SLIDE 5

What is not Covered Today

  • CUDA Basics
  • Kernel Optimization
  • MPI Parallelism
  • Multi-GPU debugging
slide-6
SLIDE 6

Scalability Metrics For Success

  • Serial Tim

ime: : 𝑼𝒕

  • Parallel Time: 𝑼𝒒
  • # of Processors: 𝑸
  • Speedup up: How much fast

ster ?

𝑻 =

𝑼𝒕 𝑼𝒒

Id Ideal : : P

  • Efficie

iency: : H How effic iciently are processors use sed?

𝑭 =

𝑻 𝑸

Id Ideal: : 1

slide-7
SLIDE 7

Case Study: 2D Laplace Solver

  • Given a 2D grid

— Set every vertex equal to the average of neighboring vertices

  • Repeat until converged
  • Common algorithmic pattern

— Electromagnetism, Astrophysics, Fluid Dynamics, Iterative Solvers

𝐵𝑙+1 𝑗, 𝑘 = 𝐵𝑙(𝑗 − 1, 𝑘) + 𝐵𝑙 𝑗 + 1, 𝑘 + 𝐵𝑙 𝑗, 𝑘 − 1 + 𝐵𝑙 𝑗, 𝑘 + 1 4

A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i+1,j)

slide-8
SLIDE 8

Domain Decomposition: 3 Options

  • Minimizes surface area/volume ratio

— Communicate less data — Optimal for bandwidth bound communication

Tiles

Halo Region

slide-9
SLIDE 9

Domain Decomposition: 3 Options

  • Minimizes number of neighbors

— Communicate to less neighbors — Optimal for latency bound communication

  • Contiguous if data is column-major

Vertical Stripes

Halo Region

slide-10
SLIDE 10

Domain Decomposition: 3 Options

  • Minimizes number of neighbors

— Communicate to less neighbors — Optimal for latency bound communication

  • Contiguous if data is row-major

Horizontal Stripes

Halo Region

slide-11
SLIDE 11

Grid Indexing

Boundary Condition & Halo Region

Row 0 Row 1 ROWS ROWS+1 LDA

slide-12
SLIDE 12

CUDA multi-GPU APIs

  • All memory allocation and kernel launches occur on the currently

active device

  • Simple API’s to get and set the device:

— cudaSetDevice(int d) — cudaGetDevice(int *d)

  • Memory can be transferred directly between devices

— cudaMemcpyPeer(…) — cudaMemcpyPeerAsync(…)

  • For direct memory copies you must enable peer access

— cudaDeviceCanAccessPeer(&access,i,j) — cudaDeviceEnablePeerAccess(…)

slide-13
SLIDE 13

Enabling Peer Access

for(int i=0;i<GPUS;i++) { cudaSetDevice(i); for(int j=0;j<GPUS;j++) { if(i!=j) { int access; cudaDeviceCanAccessPeer(&access,i,j); if(access) { cudaDeviceEnablePeerAccess(j,0); } } }

slide-14
SLIDE 14

Hands On Exercise

  • Progressive exercise

— 5 Tasks — We will go until we run out of time — Each step begins where the last left off — Feel free to work ahead

slide-15
SLIDE 15

Task 1

  • Replicate the computation on all devices

— Cuda calls apply to the current device

for(int d=0;d<numDev;d++) { cudaSetDevice(d); ...

  • Use asynchronous memory copies

— Required for devices to operate in parallel

  • Otherwise host would block until transfers were complete

— Memory must be pinned — For now use stream 0

  • cudaMemcpyAsync(…, 0);
slide-16
SLIDE 16

Allocate & Free Memory on Each Device

double *d_Ain[MAX_DEVICES]; double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMalloc(&d_Ain[d],bytes); cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); } One array per device Loop over each device

slide-17
SLIDE 17

Allocate & Free Memory on Each Device

double *d_Ain[MAX_DEVICES]; double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMalloc(&d_Ain[d],bytes); cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); } One array per device Loop over each device

slide-18
SLIDE 18

Copy Data To Each Device Asynchronously

//copy initial conditions to both buffers for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Loop Over Each Device Asynchronous Memory Copies

slide-19
SLIDE 19

Copy Data To Each Device Asynchronously

//copy initial conditions to both buffers for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Loop Over Each Device Asynchronous Memory Copies

slide-20
SLIDE 20

Launch the Kernel on Each Device

for(int d=0;d<numDev;d++) { cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); } Loop Over Each Device Swap Input and Output Arrays

slide-21
SLIDE 21

Launch the Kernel on Each Device

for(int d=0;d<numDev;d++) { cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); } Loop Over Each Device Swap Input and Output Arrays

slide-22
SLIDE 22

Copy Data From Each Device Asynchronously

//copy results back to host for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Loop Over Each Device Copy Asynchronously

slide-23
SLIDE 23

Copy Data From Each Device Asynchronously

//copy results back to host for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Loop Over Each Device Copy Asynchronously

slide-24
SLIDE 24

Free Memory on Each Device

//free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); cudaFree(d_Aout[d]); cudaCheckError(); } Loop Over Each Device

slide-25
SLIDE 25

Free Memory on Each Device

//free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); cudaFree(d_Aout[d]); cudaCheckError(); } Loop Over Each Device

slide-26
SLIDE 26

Task 1: Results

  • Test machine: Dual Socket Xeon X5675 with 8 M2090s
  • No slowdown from adding more devices
  • All GPUs are solving the same problem

— Not a very efficient use of resources

Size Speedup Efficiency 2048x2048 1 12.5% 4096x4096 1 12.5% 8192x8192 1 12.5%

slide-27
SLIDE 27

Profiler Supports Multi-GPU

  • Option 1: Run directly in nsight or NVVP

— Cuda 5.5 and earlier requires X display

  • Option 2:

— Collect profiles using nvprof:

nvprof –o profile.nvprof ./laplace

— Import profile into NVVP

  • Since we are using IPython we cannot run the profiler now.
slide-28
SLIDE 28

Task 2

  • Assign unique work to each GPU

— Horizontal stripes decomposition

  • Assign unique work to each device
  • Copy a subset of the memory to each device
  • Copy data back to the appropriate host location

— Exchange halo regions at each iteration

slide-29
SLIDE 29

Assign Unique Work To Each Device

int rows=ROWS/numDev; ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Compute Local Rows Per Device Adjust Running Dimensions

slide-30
SLIDE 30

Assign Unique Work To Each Device

int rows=ROWS/numDev; ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Compute Local Rows Per Device Adjust Running Dimensions

slide-31
SLIDE 31

Copy Unique Work to the Devices

size_t bytes=(rows+2)*(LDA)*sizeof(double); ... //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

Compute Transfer Size Compute Indices

slide-32
SLIDE 32

Copy Unique Work to the Devices

size_t bytes=(rows+2)*(LDA)*sizeof(double); ... //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

Compute Transfer Size Compute Indices

slide-33
SLIDE 33

Copy Results to the Host

int bytes_int=rows*LDA*sizeof(double); ... //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

Compute Transfer Size Compute Indices

slide-34
SLIDE 34

Copy Results to the Host

int bytes_int=rows*LDA*sizeof(double); ... //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

Compute Transfer Size Compute Indices

slide-35
SLIDE 35

Exchange Halos After Kernel Launch

for(int d=0;d<numDev;d++) { simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(...); } for(int d=0;d<numDev;d++) { //Grab lower boundary //Grab upper boundary cudaCheckError(); }

slide-36
SLIDE 36

Copy Lower Boundary

//Grab lower boundary if(d>0) cudaMemcpyPeerAsync(d_Aout[d]+IDX(0,0,LDA),d, d_Aout[d-1]+IDX(rows,0,LDA),d-1, LDA*sizeof(double),0);

rows Destination on GPU d Source from GPU d-1 LDA

slide-37
SLIDE 37

Copy Lower Boundary

//Grab lower boundary if(d>0) cudaMemcpyPeerAsync(d_Aout[d]+IDX(0,0,LDA),d, d_Aout[d-1]+IDX(rows,0,LDA),d-1, LDA*sizeof(double),0);

rows Destination on GPU d Source from GPU d-1 LDA

slide-38
SLIDE 38

row 1

Copy Upper Boundary

rows+1 Source on GPU d+1 Destination on GPU d LDA

//Grab upper boundary if(d<numDev-1) cudaMemcpyPeerAsync(d_Aout[d]+IDX(rows+1,0,LDA),d, d_Aout[d+1]+IDX(1,0,LDA),d+1, LDA*sizeof(double),0);

slide-39
SLIDE 39

row 1

Copy Upper Boundary

rows+1 Source on GPU d+1 Destination on GPU d LDA

//Grab upper boundary if(d<numDev-1) cudaMemcpyPeerAsync(d_Aout[d]+IDX(rows+1,0,LDA),d, d_Aout[d+1]+IDX(1,0,LDA),d+1, LDA*sizeof(double),0);

slide-40
SLIDE 40

Task 2: Results

  • Test machine: Dual Socket Xeon X5675 with 8 M2090s
  • Decent speedups have been achieved
  • Efficiency increases with problem size

Size Speedup Efficiency 2048x2048 3.30 41% 4096x4096 5.67 71% 8192x8192 7.28 91%

slide-41
SLIDE 41

Advanced Multi-GPU

  • Communication Hiding

— Overlap communication & computation — Steps 3-5 — Due to time this part won’t be interactive

  • Notebook has steps in it if you want to try keeping up
slide-42
SLIDE 42

Current Implementation

  • Compute Laplace
  • Exchange Halos

... ...

Time Copy Kernel

slide-43
SLIDE 43

Communication Hiding

  • Compute Exterior Async
  • Compute Interior Async
  • Exchange Halos Async

Time Copy Kernel

... ...

slide-44
SLIDE 44

Task 3

  • Separate computation into interior & exterior

— Use 1D kernel for boundary

dim3 blockSize_e (128); dim3 gridSize_e(ceil((double)(COLS)/blockSize_e.x));

— Call existing kernel three times

  • Once for lower boundary
  • Once for higher boundary
  • Once for Interior Region
slide-45
SLIDE 45

Add Exterior Low Kernel

//Exterior Low simpleLaplaceIter_kernel<<<gridSize_e,blockSize_e>>> (1,COLS, d_Ain[d], d_Aout[d]);

LDA 1

slide-46
SLIDE 46

Add Exterior High Kernel

//Exterior High simpleLaplaceIter_kernel<<<gridSize_e,blockSize_e>>> (1,COLS, d_Ain[d]+IDX(rows-1,0,LDA), d_Aout[d]+IDX(rows-1,0,LDA));

rows-1 LDA 1

slide-47
SLIDE 47

Add Exterior High Kernel

//Exterior High simpleLaplaceIter_kernel<<<gridSize_e,blockSize_e>>> (1,COLS, d_Ain[d]+IDX(rows-1,0,LDA), d_Aout[d]+IDX(rows-1,0,LDA));

rows-1 LDA 1

slide-48
SLIDE 48

Add Interior Kernel

//Interior simpleLaplaceIter_kernel<<<gridSize,blockSize>>>( rows-2,COLS, d_Ain[d]+IDX(1,0,LDA),d_Aout[d]+IDX(1,0,LDA));

1 LDA rows-2

slide-49
SLIDE 49

Add Interior Kernel

//Interior simpleLaplaceIter_kernel<<<gridSize,blockSize>>>( rows-2,COLS, d_Ain[d]+IDX(1,0,LDA),d_Aout[d]+IDX(1,0,LDA));

1 LDA rows-2

slide-50
SLIDE 50

Task 3: Results

  • Correct results
  • About the same speed as step 2
slide-51
SLIDE 51

Task 4

  • Create 2 streams

— One for kernels — One for transfers

  • Place kernels in one stream
  • Place transfers the other stream
slide-52
SLIDE 52

Create Streams

cudaStream_t sTransfer[MAX_DEVICES], sKernels[MAX_DEVICES]; ... cudaStreamCreate(&sTransfer[d]); cudaStreamCreate(&sKernels[d]); ...

slide-53
SLIDE 53

Destroy Streams

... cudaStreamDestroy(sTransfer[d]); cudaStreamDestroy(sKernels[d]); ...

slide-54
SLIDE 54

Launch Kernels Into Stream

//Exterior Low simpleLaplaceIter_kernel<<<...,0,sKernels[d]>>>(...); //Exterior High simpleLaplaceIter_kernel<<<...,0,sKernels[d]>>>(...); //Interior simpleLaplaceIter_kernel<<<...,0,sKernels[d]>>>(...);

slide-55
SLIDE 55

Launch Kernels Into Stream

//Exterior Low simpleLaplaceIter_kernel<<<...,0,sKernels[d]>>>(...); //Exterior High simpleLaplaceIter_kernel<<<...,0,sKernels[d]>>>(...); //Interior simpleLaplaceIter_kernel<<<...,0,sKernels[d]>>>(...);

slide-56
SLIDE 56

Launch Memory Copies Into Stream

for(int d=0;d<numDev;d++) { //Grab lower bc if(d>0) { cudaMemcpyPeerAsync(...,sTransfers[d]); } //Grab upper bc if(d<numDev-1) { cudaMemcpyPeerAsync(...,sTransfers[d]); } cudaCheckError(); }

slide-57
SLIDE 57

Launch Memory Copies Into Stream

for(int d=0;d<numDev;d++) { //Grab lower bc if(d>0) { cudaMemcpyPeerAsync(...,sTransfers[d]); } //Grab upper bc if(d<numDev-1) { cudaMemcpyPeerAsync(...,sTransfers[d]); } cudaCheckError(); }

slide-58
SLIDE 58

Step 4: Results

  • Slightly faster than before
  • Incorrect results

Why? Race conditions related to memory copies

slide-59
SLIDE 59

Task 5

  • Add synchronization to fix race conditions

— Can be done with 1 event — 2 race conditions

  • Sending data before exterior region is completed
  • Computing next iteration before communication has completed
slide-60
SLIDE 60

Create and Destroy Event

cudaEvent_t event[MAX_DEVICES]; ... cudaEventCreateWithFlags(&event[d],cudaEventDisableTiming); ... cudaEventDestroy(event[d]);

slide-61
SLIDE 61

Fix First Race Condition

for(int d=0;d<numDev;d++) { cudaSetDevice(d); //Exterior Low simpleLaplaceIter_kernel<<<...,sKernels[d]>>>(...); //Exterior High simpleLaplaceIter_kernel<<<...,sKernels[d]>>>(...); cudaEventRecord(event[d],sKernels[d]); //Interior simpleLaplaceIter_kernel<<<...,sKernels[d]>>>(...); cudaCheckError(); } for(int d=0;d<numDev;d++) { cudaEventSynchronize(event[d]); }

slide-62
SLIDE 62

Fix Second Race Condition

//Grab lower bc if(d>0) { cudaMemcpyPeerAsync(...); cudaEventRecord(event[d],sTransfer[d]); cudaStreamWaitEvent(sKernels[d],event[d],0); } //Grab upper bc if(d<numDev-1) { cudaMemcpyPeerAsync(d_Aout[d]+IDX(rows+1,0,LDA),d, d_Aout[d+1]+IDX(1,0,LDA), d+1, bc_size*sizeof(double),sTransfer[d]); cudaEventRecord(event[d],sTransfer[d]); cudaStreamWaitEvent(sKernels[d],event[d],0); }

slide-63
SLIDE 63

Step 5: Final Results

  • Test machine: Dual Socket Xeon X5675 with 8 M2090s
  • Efficiency increased
  • Perfect scaling at large problem sizes!

Size Speedup Efficiency 2048x2048 3.78 47% 4096x4096 7.11 96% 8192x8192 8.29 104%

slide-64
SLIDE 64

Start Using multi-GPU Today

  • Use multi-GPU to

— Compute Faster, Larger, and Cheaper

GPU Test Drive

http://www.nvidia.com/GPUTestDrive

Feel Free to Download IPython notebook