multi gpu a hands on exercise

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer - PowerPoint PPT Presentation

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies Connection instructions Navigate to nvlabs.qwiklab.com Login or create a new account Select the Instructor-Led Hands-on Labs class Find the


  1. Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies

  2. Connection instructions  Navigate to nvlabs.qwiklab.com  Login or create a new account  Select the “ Instructor-Led Hands-on Labs ” class  Find the lab called “ Scaling to Multiple GPUs ” and click Start  After a short wait, lab instance connection information will be shown  Please ask Lab Assistants for help!

  3. Why Should You Use Multiple GPUs  Compute Faster — More GPU’s = Faster time to solution  Compute Larger — More GPU’s = More memory for larger problems  Compute Cheaper — More GPU’s per node = less overhead in $, power and space

  4. What You Will Learn About Today  During today’s lab — Scalability metrics — Managing multiple devices — Communicating between devices  Homework — Communication Hiding — Synchronization

  5. What is not Covered Today  CUDA Basics  Kernel Optimization  MPI Parallelism  Multi-GPU debugging

  6. Scalability Metrics For Success  Serial Tim ime: : 𝑼 𝒕  Parallel Time: 𝑼 𝒒  # of Processors: 𝑸  Speedup up: How much fast ster ? 𝑼 𝒕 𝑻 = Ideal : Id : P 𝑼 𝒒  Efficie iency: : H How effic iciently are processors use sed? 𝑻 𝑭 = 𝑸 Id Ideal: : 1

  7. Case Study: 2D Laplace Solver A(i+1,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1)  Given a 2D grid — Set every vertex equal to the average of neighboring vertices  Repeat until converged  Common algorithmic pattern — Electromagnetism, Astrophysics, Fluid Dynamics, Iterative Solvers

  8. Domain Decomposition: 3 Options  Minimizes surface area/volume ratio — Communicate less data — Optimal for bandwidth bound communication Tiles Halo Region

  9. Domain Decomposition: 3 Options  Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication  Contiguous if data is column-major Vertical Stripes Halo Region

  10. Domain Decomposition: 3 Options  Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication  Contiguous if data is row-major Horizontal Stripes Halo Region

  11. Grid Indexing LDA Row 0 Row 1 Boundary Condition & Halo Region ROWS ROWS+1

  12. CUDA multi-GPU APIs  All memory allocation and kernel launches occur on the currently active device  Simple API’s to get and set the device: — cudaSetDevice(int d) — cudaGetDevice(int *d)  Memory can be transferred directly between devices — cudaMemcpyPeer (…) — cudaMemcpyPeerAsync (…)  For direct memory copies you must enable peer access — cudaDeviceCanAccessPeer(&access,i,j) — cudaDeviceEnablePeerAccess (…)

  13. Enabling Peer Access for(int i=0;i<GPUS;i++) { cudaSetDevice(i); for(int j=0;j<GPUS;j++) { if(i!=j) { int access; cudaDeviceCanAccessPeer(&access,i,j); if(access) { cudaDeviceEnablePeerAccess(j,0); } } }

  14. Hands On Exercise  Progressive exercise — 5 Tasks — We will go until we run out of time — Each step begins where the last left off — Feel free to work ahead

  15. Task 1  Replicate the computation on all devices — Cuda calls apply to the current device for(int d=0;d<numDev;d++) { cudaSetDevice(d); ...  Use asynchronous memory copies — Required for devices to operate in parallel  Otherwise host would block until transfers were complete — Memory must be pinned — For now use stream 0  cudaMemcpyAsync (…, 0);

  16. Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }

  17. Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }

  18. Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies

  19. Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies

  20. Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }

  21. Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }

  22. Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously

  23. Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously

  24. Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }

  25. Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }

  26. Task 1: Results  Test machine: Dual Socket Xeon X5675 with 8 M2090s Size Speedup Efficiency 2048x2048 1 12.5% 4096x4096 1 12.5% 8192x8192 1 12.5%  No slowdown from adding more devices  All GPUs are solving the same problem — Not a very efficient use of resources

  27. Profiler Supports Multi-GPU  Option 1: Run directly in nsight or NVVP — Cuda 5.5 and earlier requires X display  Option 2: — Collect profiles using nvprof: nvprof – o profile.nvprof ./laplace — Import profile into NVVP  Since we are using IPython we cannot run the profiler now.

  28. Task 2  Assign unique work to each GPU — Horizontal stripes decomposition  Assign unique work to each device  Copy a subset of the memory to each device  Copy data back to the appropriate host location — Exchange halo regions at each iteration

  29. Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions

  30. Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions

  31. Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

  32. Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

  33. Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

  34. Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

  35. Exchange Halos After Kernel Launch for(int d=0;d<numDev;d++) { simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(...); } for(int d=0;d<numDev;d++) { //Grab lower boundary //Grab upper boundary cudaCheckError(); }

  36. Copy Lower Boundary LDA 0 Destination on GPU d rows Source from GPU d-1 //Grab lower boundary if(d>0) cudaMemcpyPeerAsync(d_Aout[d]+IDX(0,0,LDA),d, d_Aout[d-1]+IDX(rows,0,LDA),d-1, LDA*sizeof(double),0);

Recommend


More recommend