Acceleration of HPC applications on hybrid CPU-GPU systems: When can - PowerPoint PPT Presentation

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga Max Katz (NVIDIA), Leopold Grinberg (IBM) This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE- LLNL-PRES-746880 Slide 1 AC52-07NA27344. Lawrence Livermore National Security, LLC

Multi-Process Service (MPS) Allows kernels launched from different (MPI) processes to be processed concurrently on the same GPU ◮ Utilize inactive SMs when the work is small Share GPU ‘in space’ SMs Time on GPU schedule LLNL-PRES-746880 Slide 2

Multi-Process Service (MPS) Allows kernels launched from different (MPI) processes to be processed concurrently on the same GPU ◮ Utilize inactive SMs when ◮ Processes take turns if the work is small every SM is occupied Share GPU ‘in space’ Share GPU ‘in time’ SMs SMs Time on GPU schedule Time on GPU schedule LLNL-PRES-746880 Slide 2

Sierra system architecture finalized and currently under deployment at LLNL Compute System 4,320 nodes 1.29 PB Memory 240 Compute Racks 125 PFLOPS ≈ 12 MW LLNL-PRES-746880 Slide 3

Sierra system architecture finalized and currently under deployment at LLNL Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta GPUs 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ≈ 12 MW 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory LLNL-PRES-746880 Slide 3

Sierra system architecture finalized and currently under deployment at LLNL Compute System Compute Node 4,320 nodes 2 IBM POWER9 CPUs 1.29 PB Memory 4 NVIDIA Volta GPUs 240 Compute Racks 256 GiB DDR4 125 PFLOPS NVMe-compatible PCIe 1.6 TB SSD ≈ 12 MW 16 GiB Globally addressable HBM2 associated with each GPU Coherent Shared Memory 5% FLOPS 95% FLOPS LLNL-PRES-746880 Slide 3

Ways to utilize a node of Sierra (showing one socket) GPU GPU CPU LLNL-PRES-746880 Slide 4

Ways to utilize a node of Sierra (showing one socket) ◮ MPI process/core GPU GPU CPU � � � � � � � � � � � � � � � � � � � � � � LLNL-PRES-746880 Slide 4

Ways to utilize a node of Sierra (showing one socket) ◮ MPI process/core GPU GPU ◮ MPI process/GPU CPU � � � � � � � � � � � � � � � � � � � � � � � � GPU GPU CPU LLNL-PRES-746880 Slide 4

Ways to utilize a node of Sierra (showing one socket) ◮ MPI process/core GPU GPU ◮ MPI process/GPU CPU � � � � � � � � � � � � � � � � � � � � � � � � GPU GPU ◮ MPI process/core, CPU MPS for GPU � � GPU GPU CPU LLNL-PRES-746880 Slide 4

Parallel performance of multiphysics simulations Decide how to run each phase of the multiphysics simulation ◮ On GPU vs. on CPU ◮ How many MPI processes (one per CPU core or one per GPU?) ◮ If some phases use one MPI process per CPU core, can we use MPS for the accelerated phases? LLNL-PRES-746880 Slide 5

Parallel performance of multiphysics simulations Decide how to run each phase of the multiphysics simulation ◮ On GPU vs. on CPU ◮ How many MPI processes (one per CPU core or one per GPU?) ◮ If some phases use one MPI process per CPU core, can we use MPS for the accelerated phases? Outline: ◮ Tools used for measurement ◮ Multiphysics application ◮ How the application is accelerated ◮ Results: MPI process/GPU vs. 4 MPI processes/GPU + MPS ◮ Impact on kernel performance ◮ Impact on communication LLNL-PRES-746880 Slide 5

Tool: Caliper [SC’16] https://github.com/LLNL/Caliper ◮ Performance analysis toolbox, leverages existing tools ◮ Developed at LLNL ◮ Caliper team is responsive to our needs 1. Annotate: begin/end API similar to timers libraries ◮ Annotation of libraries (e.g., SAMRAI, hypre) combined seamlessly 2. Collect: Runtime parameters to instruct Caliper to measure: ◮ Measure MPI function calls ◮ Linux perf_event sampling (Libpfm) ◮ Measure CUDA driver/runtime calls (using CUPTI) 3. Analyze ◮ Using JSON output format LLNL-PRES-746880 Slide 6

Application: ARES is a massively parallel, multi-dimensional, multi-physics code at LLNL Physics Capabilities: ◮ ALE-AMR Hydrodynamics ◮ High-order Eulerian Hydrodynamics ◮ Elastic-Plastic flow ◮ 3T plasma physics ◮ High-Explosive modeling ◮ Diffusion, S N Radiation ◮ Particulate flow ◮ Laser ray-tracing ◮ Magnetohydrodynamics (MHD) ◮ Dynamic mixing ◮ Non-LTE opacities Applications: ◮ Inertial Confinement Fusion (ICF) ◮ Pulsed power ◮ National Ignition Facility debris ◮ High-Explosive experiments LLNL-PRES-746880 Slide 7

ARES ◮ 800k lines of C/C++ with MPI ◮ 22 years old, used daily on our current supercomputers ◮ Single code base effectively utilizes all HPC platforms LLNL-PRES-746880 Slide 8

ARES uses RAJA https://github.com/LLNL/RAJA ◮ 800k lines of C/C++ with MPI ◮ 22 years old, used daily on our current supercomputers ◮ Single code base effectively utilizes all HPC platforms ◮ Use RAJA as an abstraction layer for on-node parallelization ◮ RAJA is a collection of C++ software abstractions ◮ Separation of concerns C-style for-loop: RAJA-style loop: 1: double* x; double* y; 1: double* x; double* y; 2: double a; 2: double a; 3: for ( int i = begin; 3: RAJA::forall < exec_policy > i < end; ++i ) { (begin, end, [=] (int i) { 4: 4: y[i] += a * x[i]; y[i] += a * x[i]; 5: 5: 6: } 6: }); ◮ Use different RAJA backends (CUDA, OpenMP) LLNL-PRES-746880 Slide 8

Results 3D Sedov blastwave problem ◮ Hydrodynamics calculation ◮ ≈ 80 kernels LLNL-PRES-746880 Slide 9

Results 3D Sedov blastwave problem ◮ Hydrodynamics calculation ◮ ≈ 80 kernels ◮ Pre-SIERRA machine (rzmanta) - Minsky nodes: ◮ 2x Power8+ CPUs (20 cores) ◮ 4x NVIDIA P100 (Pascal) GPUs with 16GB memory each ◮ NVLINK 1.0 ◮ * Some results generated with pre-release versions of compilers; improvements in performance expected in future releases ◮ All results shown use 4 Minsky nodes (16 GPUs) LLNL-PRES-746880 Slide 9

Domain decomposition with and without MPS MPI process/ GPU 4 MPI processes/ GPU + MPS Differences: ◮ Computation: Work per MPI process ◮ Communication: Neighbors and surface to volume ratio LLNL-PRES-746880 Slide 10

Overall runtime with and without MPS MPI process/GPU 4 MPI processes/GPU + MPS 30 Time (sec) 20 10 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 11

Overall runtime with and without MPS MPI process/GPU 4 MPI processes/GPU + MPS 30 Differences: Time (sec) ◮ Computation 20 ◮ Communication 10 ◮ Memory 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 11

Computation time: Small kernels MPI process/GPU 2 4 MPI processes/GPU + MPS Time (sec) 1 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 12

Computation time: Small kernels MPI process/GPU 2 ◮ Few zones, small amount 4 MPI processes/GPU + MPS Time (sec) of work per zone ◮ Dominated by kernel 1 launch overhead ◮ MPS may be slightly slower 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 12

Computation time: Large kernels MPI process/GPU 4 MPI processes/GPU + MPS 20 Time (sec) 10 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 13

Computation time: Large kernels MPS is faster especially when MPI process/GPU problem size is large 4 MPI processes/GPU + MPS 20 ◮ Utilizing GPU better? Time (sec) ◮ GPU utilization? ◮ GPU occupancy? 10 ◮ Utilizing CPU better? ◮ More parallelization? 0 0 ◮ Better utilization of CPU 200 250 300 350 400 memory bandwidth? Problem size (zones 3 ) LLNL-PRES-746880 Slide 13

Waiting on the GPU: cudaDeviceSynchronize 8 MPI process/GPU 4 MPI processes/GPU + MPS 6 Time (sec) 4 2 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 14

Waiting on the GPU: cudaDeviceSynchronize 8 MPI process/GPU 4 MPI processes/GPU + MPS 6 Time (sec) ◮ Appear to be waiting on 4 the GPU longer without MPS 2 0 0 200 250 300 350 400 Problem size (zones 3 ) LLNL-PRES-746880 Slide 14

Domain decomposition with and without MPS MPI process/ GPU 4 MPI processes/ GPU + MPS LLNL-PRES-746880 Slide 15

Domain decomposition with and without MPS MPI process/ GPU 4 MPI processes/ GPU + MPS Differences in communication: ◮ Number of neighbors in halo exchange ◮ Surface to volume ratio ◮ Processor mapping ◮ Other LLNL-PRES-746880 Slide 15

Acceleration of HPC applications on hybrid CPU-GPU systems: When can - PowerPoint PPT Presentation

Acceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga Max Katz (NVIDIA), Leopold Grinberg

TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES TXN/SEC CPU CORES

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS Jorge

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems Abdullah Gharaibeh, Elizeu

Uni.lu HPC School 2019 PS12a: Machine / Deep learning I Keras/Tensorflow CPU/GPU Uni.lu High

CPU scheduling CPU 1 P k P 3 P 2 P 1 . . . CPU 2 . . . CPU n The scheduling problem: - Have

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim

Easy Programming of Linear Algebra Operations on Hybrid CPU-GPU Platforms Enrique S.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

SSL Pricing and Efficacy Trend Analysis for Utility Program Planning Jason Tuenge Northwest Power

FINAL PRESENTATION Western Michigan University Dr. Choudhury and Dr. Rodriguez 4/12/2018 Agenda

FERODO Brake Grease Ferodo Brake Grease (FBG001) is a copper free mounting paste based on high

REC ECONSIDERI ONSIDERING NG STOCK CKTYP YPE SIZES: ES: LONG NG-TE TERM M RESUL ULTS S

BrakingTecH Brake Solutions for Commercial Vehicles About Us BrakingTecH Excellence in

Wellesley School Redistricting School Committee Presentation February 11, 2020 Timeline

Tree Selection Charleston Place Subdivision Tree Committee Presented by: Rich and Jennifer

East Link Extension East Link Extension Bel-Red Tree Mitigation Light Rail Permitting Citizen