Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer - PowerPoint PPT Presentation

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies

Connection instructions  Navigate to nvlabs.qwiklab.com  Login or create a new account  Select the “ Instructor-Led Hands-on Labs ” class  Find the lab called “ Scaling to Multiple GPUs ” and click Start  After a short wait, lab instance connection information will be shown  Please ask Lab Assistants for help!

Why Should You Use Multiple GPUs  Compute Faster — More GPU’s = Faster time to solution  Compute Larger — More GPU’s = More memory for larger problems  Compute Cheaper — More GPU’s per node = less overhead in $, power and space

What You Will Learn About Today  During today’s lab — Scalability metrics — Managing multiple devices — Communicating between devices  Homework — Communication Hiding — Synchronization

What is not Covered Today  CUDA Basics  Kernel Optimization  MPI Parallelism  Multi-GPU debugging

Scalability Metrics For Success  Serial Tim ime: : 𝑼 𝒕  Parallel Time: 𝑼 𝒒  # of Processors: 𝑸  Speedup up: How much fast ster ? 𝑼 𝒕 𝑻 = Ideal : Id : P 𝑼 𝒒  Efficie iency: : H How effic iciently are processors use sed? 𝑻 𝑭 = 𝑸 Id Ideal: : 1

Case Study: 2D Laplace Solver A(i+1,j) 𝐵 𝑙+1 𝑗, 𝑘 = 𝐵 𝑙 (𝑗 − 1, 𝑘) + 𝐵 𝑙 𝑗 + 1, 𝑘 + 𝐵 𝑙 𝑗, 𝑘 − 1 + 𝐵 𝑙 𝑗, 𝑘 + 1 4 A(i-1,j) A(i,j) A(i+1,j) A(i,j-1)  Given a 2D grid — Set every vertex equal to the average of neighboring vertices  Repeat until converged  Common algorithmic pattern — Electromagnetism, Astrophysics, Fluid Dynamics, Iterative Solvers

Domain Decomposition: 3 Options  Minimizes surface area/volume ratio — Communicate less data — Optimal for bandwidth bound communication Tiles Halo Region

Domain Decomposition: 3 Options  Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication  Contiguous if data is column-major Vertical Stripes Halo Region

Domain Decomposition: 3 Options  Minimizes number of neighbors — Communicate to less neighbors — Optimal for latency bound communication  Contiguous if data is row-major Horizontal Stripes Halo Region

Grid Indexing LDA Row 0 Row 1 Boundary Condition & Halo Region ROWS ROWS+1

CUDA multi-GPU APIs  All memory allocation and kernel launches occur on the currently active device  Simple API’s to get and set the device: — cudaSetDevice(int d) — cudaGetDevice(int *d)  Memory can be transferred directly between devices — cudaMemcpyPeer (…) — cudaMemcpyPeerAsync (…)  For direct memory copies you must enable peer access — cudaDeviceCanAccessPeer(&access,i,j) — cudaDeviceEnablePeerAccess (…)

Enabling Peer Access for(int i=0;i<GPUS;i++) { cudaSetDevice(i); for(int j=0;j<GPUS;j++) { if(i!=j) { int access; cudaDeviceCanAccessPeer(&access,i,j); if(access) { cudaDeviceEnablePeerAccess(j,0); } } }

Hands On Exercise  Progressive exercise — 5 Tasks — We will go until we run out of time — Each step begins where the last left off — Feel free to work ahead

Task 1  Replicate the computation on all devices — Cuda calls apply to the current device for(int d=0;d<numDev;d++) { cudaSetDevice(d); ...  Use asynchronous memory copies — Required for devices to operate in parallel  Otherwise host would block until transfers were complete — Memory must be pinned — For now use stream 0  cudaMemcpyAsync (…, 0);

Allocate & Free Memory on Each Device double *d_Ain[MAX_DEVICES]; One array per device double *d_Aout[MAX_DEVICES]; //allocate device memory for(int d=0;d<numDev;d++) { cudaSetDevice(d); Loop over each cudaMalloc(&d_Ain[d],bytes); device cudaMalloc(&d_Aout[d],bytes); cudaCheckError(); }

Copy Data To Each Device Asynchronously //copy initial conditions to both buffers Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); cudaMemcpyAsync(d_Ain[d],A,bytes,cudaMemcpyHostToDevice,0); cudaMemcpyAsync(d_Aout[d],d_Ain[d],bytes, cudaMemcpyDeviceToDevice,0); cudaCheckError(); } Asynchronous Memory Copies

Launch the Kernel on Each Device Loop Over Each for(int d=0;d<numDev;d++) { Device cudaSetDevice(d); simpleLaplaceIter_kernel<<<gridSize,blockSize>>> (ROWS,COLS,d_Ain[d],d_Aout[d]); cudaCheckError(); } for(int d=0;d<numDev;d++) { std::swap(d_Ain[d],d_Aout[d]); Swap Input and Output Arrays }

Copy Data From Each Device Asynchronously //copy results back to host Loop Over Each Device for(int d=0;d<numDev;d++) { cudaSetDevice(d); cudaMemcpyAsync(A,d_Ain[d],bytes,cudaMemcpyDeviceToHost,0); cudaCheckError(); } Copy Asynchronously

Free Memory on Each Device //free device memory for(int d=0;d<numDev;d++) { cudaFree(d_Ain[d]); Loop Over Each Device cudaFree(d_Aout[d]); cudaCheckError(); }

Task 1: Results  Test machine: Dual Socket Xeon X5675 with 8 M2090s Size Speedup Efficiency 2048x2048 1 12.5% 4096x4096 1 12.5% 8192x8192 1 12.5%  No slowdown from adding more devices  All GPUs are solving the same problem — Not a very efficient use of resources

Profiler Supports Multi-GPU  Option 1: Run directly in nsight or NVVP — Cuda 5.5 and earlier requires X display  Option 2: — Collect profiles using nvprof: nvprof – o profile.nvprof ./laplace — Import profile into NVVP  Since we are using IPython we cannot run the profiler now.

Task 2  Assign unique work to each GPU — Horizontal stripes decomposition  Assign unique work to each device  Copy a subset of the memory to each device  Copy data back to the appropriate host location — Exchange halo regions at each iteration

Assign Unique Work To Each Device Compute Local Rows int rows=ROWS/numDev; Per Device ... dim3 gridSize( ceil((double)(COLS)/blockSize.x), ceil((double)(rows)/blockSize.y)); ... simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(rows,COLS, d_Ain[d],d_Aout[d]); Adjust Running Dimensions

Copy Unique Work to the Devices size_t bytes=(rows+2)*(LDA)*sizeof(double); Compute Transfer Size ... Compute Indices //copy from host to device cudaMemcpyAsync(d_Ain[d]+IDX(0,0,LDA),A+IDX(d*rows,0,LDA), bytes*LDA*sizeof(double),cudaMemcpyHostToDevice,0);

Copy Results to the Host int bytes_int=rows*LDA*sizeof(double); Compute Transfer Size ... Compute Indices //copy from device to host cudaMemcpyAsync(A+IDX(d*rows+1,0,LDA),d_Ain[d]+IDX(1,0,LDA), bytes_int,cudaMemcpyDeviceToHost,0);

Exchange Halos After Kernel Launch for(int d=0;d<numDev;d++) { simpleLaplaceIter_kernel<<<gridSize,blockSize>>>(...); } for(int d=0;d<numDev;d++) { //Grab lower boundary //Grab upper boundary cudaCheckError(); }

Copy Lower Boundary LDA 0 Destination on GPU d rows Source from GPU d-1 //Grab lower boundary if(d>0) cudaMemcpyPeerAsync(d_Aout[d]+IDX(0,0,LDA),d, d_Aout[d-1]+IDX(rows,0,LDA),d-1, LDA*sizeof(double),0);

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer - PowerPoint PPT Presentation

Multi-GPU: A Hands-on Exercise Justin Luitjens NVIDIA - Developer Technologies Connection instructions Navigate to nvlabs.qwiklab.com Login or create a new account Select the Instructor-Led Hands-on Labs class Find the

GBIF data Introduction to the hands-on exercise Vnia Proena

Hands Overview Outline Existing hands Robot hands of the 80s Commercial hands Research

Exercise Hands-on session FAIRMODE Technical Meeting 19-21 June 2017, Athens, Greece Marc Guevara

Building Multi-Processor FPGA Systems Hands-on Tutorial to Using FPGAs and Linux Chris Martin

The iLab Experience a blended learning hands-on course concept you set the focus Your Exercise

Exercise 11: Cutoffs Exercise 11: Cutoffs FLUKA Beginners Course Exercise 11: Cutoffs Aim of

Outline Existing hands Robot hands of the 80s Commercial hands Research hands Prosthetics

your exercise iLab 1+2 info event online Tell your friends!

Exercise 2: Thresholds FLUKA Advanced Course Exercise 2: Thresholds Aim of the exercise: 1.

A 61 year-old man with 61 year-old man with hypertension exercise-induced muscle 1 year

Exercise 7: Two-Steps Method Exercise 7: Two Steps Method FLUKA Advanced Course Exercise 7 -

Exercise 8: Scoring Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the

Exercise 8: Scoring FLUKA Beginners Course Exercise 8: Scoring Aim of the exercise: 1- Add

Hands-On tools@bsc.es 2018 Copy files for the hands-on You can download the material for

Exercise: Overview Aerobic exercise is sustained exercise that stimulates/strengthens

The iLab Experience a blended learning hands-on course concept you set the focus WWW Security /

CONCURRENCY IN C++ Yuqing Xia CSCI 5828 Prof. Kenneth M. Anderson University of Colorado at

Citizen Review Committee Update Meeting November 2, 2012 Agenda Welcome Open the Meeting

Pottsgrove School District Math Common Core Textbook Adoption Common Core State Standards for

U.S. Chamber of Commerce 2018 Midterms: Now What? Senate: - Majority Leader Mitch McConnell -

TestOps: Continuous Integration when infrastructure is the product Barry Jaspan Senior

The Effects of Race Conditions when Implementing Single-Source Redundant Clock Trees in Triple

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A LATTICE QCD CASE STUDY Kate Clark,

Blue Cross Blue Shield Massachusettss Role in Addressing Health Equity and Preventing Chronic