multi gpu a hands on exercise
play

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer - PowerPoint PPT Presentation

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies Connection instructions Navigate to nvlabs.qwiklab.com Login or create a new account Select the Instructor-Led Hands-on Labs class Find the


  1. Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies

  2. Connection instructions  Navigate to nvlabs.qwiklab.com  Login or create a new account  Select the “ Instructor-Led Hands-on Labs ” class  Find the lab called “ Scaling to Multiple GPUs ” and click Start  After a short wait, lab instance connection information will be shown  Please ask Lab Assistants for help!

  3. Why Should You Use Multiple GPUs  Compute Faster — More GPU’s = Faster time to solution  Compute Larger — More GPU’s = More memory for larger problems  Compute Cheaper — More GPU’s per node = less overhead in $, power and space

  4. What You Will Learn About Today  During today’s lab — Scalability metrics — Managing multiple devices — Communicating between devices  Homework — Communication Hiding — Synchronization

  5. What is not Covered Today  CUDA Basics  Kernel Optimization  MPI Parallelism  Multi-GPU debugging

  6. Scalability Metrics For Success  Serial Tim ime: : 𝑼 𝒕  Parallel Time: 𝑼 𝒒  # of Processors: 𝑸  Speedup up: How much fast ster ? 𝑼 𝒕 𝑻 = Ideal : Id : P 𝑼 𝒒  Efficie iency: : H How effic iciently are processors use sed? 𝑻 𝑭 = 𝑸 Id Ideal: : 1

  7. Case Study: 2D Laplace Solver A(i+1,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1)  Given a 2D grid — Set every vertex equal to the average of neighboring vertices  Repeat until converged  Common algorithmic pattern — Electromagnetism, Astrophysics, Fluid Dynamics, Iterative Solvers

  8. Domain Decomposition: 3 Options  Minimizes surface area/volume ratio — Communicate less data — Optimal for bandwidth bound communication Tiles Halo Region

  9. Domain Decomposition: 3 Options  Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication  Contiguous if data is column-major Vertical Stripes Halo Region

  10. Domain Decomposition: 3 Options  Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication  Contiguous if data is row-major Horizontal Stripes Halo Region

  11. Grid Indexing LDA Row 0 Row 1 Boundary Condition & Halo Region ROWS ROWS+1

  12. CUDA multi-GPU APIs  All memory allocation and kernel launches occur on the currently active device  Simple API’s to get and set the device: — cudaSetDevice(int d) — cudaGetDevice(int *d)  Memory can be transferred directly between devices — cudaMemcpyPeer (…) — cudaMemcpyPeerAsync (…)  For direct memory copies you must enable peer access — cudaDeviceCanAccessPeer(&access,i,j) — cudaDeviceEnablePeerAccess (…)

  13. Enabling Peer Access for(int i=0;i<GPUS;i++) { cudaSetDevice(i); for(int j=0;j<GPUS;j++) { if(i!=j) { int access; cudaDeviceCanAccessPeer(&access,i,j); if(access) { cudaDeviceEnablePeerAccess(j,0); } } }

  14. Hands On Exercise  Progressive exercise — 5 Tasks — We will go until we run out of time — Each step begins where the last left off — Feel free to work ahead

  15. Task 1  Replicate the computation on all devices — Cuda calls apply to the current device for(int d=0;d<numDev;d++) { cudaSetDevice(d); ...  Use asynchronous memory copies — Required for devices to operate in parallel  Otherwise host would block until transfers were complete — Memory must be pinned — For now use stream 0  cudaMemcpyAsync (…, 0);

  16. Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }

  17. Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }

  18. Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies

  19. Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies

  20. Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }

  21. Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }

  22. Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously

  23. Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously

  24. Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }

  25. Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }

  26. Task 1: Results  Test machine: Dual Socket Xeon X5675 with 8 M2090s Size Speedup Efficiency 2048x2048 1 12.5% 4096x4096 1 12.5% 8192x8192 1 12.5%  No slowdown from adding more devices  All GPUs are solving the same problem — Not a very efficient use of resources

  27. Profiler Supports Multi-GPU  Option 1: Run directly in nsight or NVVP — Cuda 5.5 and earlier requires X display  Option 2: — Collect profiles using nvprof: nvprof – o profile.nvprof ./laplace — Import profile into NVVP  Since we are using IPython we cannot run the profiler now.

  28. Task 2  Assign unique work to each GPU — Horizontal stripes decomposition  Assign unique work to each device  Copy a subset of the memory to each device  Copy data back to the appropriate host location — Exchange halo regions at each iteration

  29. Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions

  30. Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions

  31. Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

  32. Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

  33. Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

  34. Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

  35. Exchange Halos After Kernel Launch for(int d=0;d<numDev;d++) { simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(...); } for(int d=0;d<numDev;d++) { //Grab lower boundary //Grab upper boundary cudaCheckError(); }

  36. Copy Lower Boundary LDA 0 Destination on GPU d rows Source from GPU d-1 //Grab lower boundary if(d>0) cudaMemcpyPeerAsync(d_Aout[d]+IDX(0,0,LDA),d, d_Aout[d-1]+IDX(rows,0,LDA),d-1, LDA*sizeof(double),0);

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend