Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies
Connection instructions Navigate to nvlabs.qwiklab.com Login or create a new account Select the “ Instructor-Led Hands-on Labs ” class Find the lab called “ Scaling to Multiple GPUs ” and click Start After a short wait, lab instance connection information will be shown Please ask Lab Assistants for help!
Why Should You Use Multiple GPUs Compute Faster — More GPU’s = Faster time to solution Compute Larger — More GPU’s = More memory for larger problems Compute Cheaper — More GPU’s per node = less overhead in $, power and space
What You Will Learn About Today During today’s lab — Scalability metrics — Managing multiple devices — Communicating between devices Homework — Communication Hiding — Synchronization
What is not Covered Today CUDA Basics Kernel Optimization MPI Parallelism Multi-GPU debugging
Scalability Metrics For Success Serial Tim ime: : 𝑼 𝒕 Parallel Time: 𝑼 𝒒 # of Processors: 𝑸 Speedup up: How much fast ster ? 𝑼 𝒕 𝑻 = Ideal : Id : P 𝑼 𝒒 Efficie iency: : H How effic iciently are processors use sed? 𝑻 𝑭 = 𝑸 Id Ideal: : 1
Case Study: 2D Laplace Solver A(i+1,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1) Given a 2D grid — Set every vertex equal to the average of neighboring vertices Repeat until converged Common algorithmic pattern — Electromagnetism, Astrophysics, Fluid Dynamics, Iterative Solvers
Domain Decomposition: 3 Options Minimizes surface area/volume ratio — Communicate less data — Optimal for bandwidth bound communication Tiles Halo Region
Domain Decomposition: 3 Options Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication Contiguous if data is column-major Vertical Stripes Halo Region
Domain Decomposition: 3 Options Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication Contiguous if data is row-major Horizontal Stripes Halo Region
Grid Indexing LDA Row 0 Row 1 Boundary Condition & Halo Region ROWS ROWS+1
CUDA multi-GPU APIs All memory allocation and kernel launches occur on the currently active device Simple API’s to get and set the device: — cudaSetDevice(int d) — cudaGetDevice(int *d) Memory can be transferred directly between devices — cudaMemcpyPeer (…) — cudaMemcpyPeerAsync (…) For direct memory copies you must enable peer access — cudaDeviceCanAccessPeer(&access,i,j) — cudaDeviceEnablePeerAccess (…)
Enabling Peer Access for(int i=0;i<GPUS;i++) { cudaSetDevice(i); for(int j=0;j<GPUS;j++) { if(i!=j) { int access; cudaDeviceCanAccessPeer(&access,i,j); if(access) { cudaDeviceEnablePeerAccess(j,0); } } }
Hands On Exercise Progressive exercise — 5 Tasks — We will go until we run out of time — Each step begins where the last left off — Feel free to work ahead
Task 1 Replicate the computation on all devices — Cuda calls apply to the current device for(int d=0;d<numDev;d++) { cudaSetDevice(d); ... Use asynchronous memory copies — Required for devices to operate in parallel Otherwise host would block until transfers were complete — Memory must be pinned — For now use stream 0 cudaMemcpyAsync (…, 0);
Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }
Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }
Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies
Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies
Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }
Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }
Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously
Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously
Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }
Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }
Task 1: Results Test machine: Dual Socket Xeon X5675 with 8 M2090s Size Speedup Efficiency 2048x2048 1 12.5% 4096x4096 1 12.5% 8192x8192 1 12.5% No slowdown from adding more devices All GPUs are solving the same problem — Not a very efficient use of resources
Profiler Supports Multi-GPU Option 1: Run directly in nsight or NVVP — Cuda 5.5 and earlier requires X display Option 2: — Collect profiles using nvprof: nvprof – o profile.nvprof ./laplace — Import profile into NVVP Since we are using IPython we cannot run the profiler now.
Task 2 Assign unique work to each GPU — Horizontal stripes decomposition Assign unique work to each device Copy a subset of the memory to each device Copy data back to the appropriate host location — Exchange halo regions at each iteration
Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions
Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions
Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);
Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);
Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);
Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);
Exchange Halos After Kernel Launch for(int d=0;d<numDev;d++) { simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(...); } for(int d=0;d<numDev;d++) { //Grab lower boundary //Grab upper boundary cudaCheckError(); }
Copy Lower Boundary LDA 0 Destination on GPU d rows Source from GPU d-1 //Grab lower boundary if(d>0) cudaMemcpyPeerAsync(d_Aout[d]+IDX(0,0,LDA),d, d_Aout[d-1]+IDX(rows,0,LDA),d-1, LDA*sizeof(double),0);
Recommend
More recommend