HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 - PowerPoint PPT Presentation

GPU ACCELERATED COMPUTING IN HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019

RISE OF GPU COMPUTING 1000X GPU-Computing perf 10 7 by 1.5X per year APPLICATIONS 2025 10 6 ALGORITHMS 1.1X per year 10 5 10 4 SYSTEMS 10 3 CUDA 1.5X per year 10 2 Single-threaded perf ARCHITECTURE 1980 1990 2000 2010 2020 Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp 2

NVIDIA POWERS WORLD’S FASTEST SUPERCOMPUTERS 48% More Systems | 22 of Top 25 Greenest ORNL Summit LLNL Sierra Piz Daint ABCI ENI HPC4 World’s 2 nd Fastest World’s Fastest Europe’s Fastest Japan’s Fastest Fastest Industrial 27,648 GPUs| 144 PF 17,280 GPUs| 95 PF 5,704 GPUs| 21 PF 4,352 GPUs| 20 PF 3,200 GPUs| 12 PF 3

THE NEW HPC MARKET SIMULATION MACHINE LEARNING DEEP LEARNING 4

NVIDIA POWERS 5 OF 6 GORDON BELL NOMINATIONS GPU Acceleration Critical To HPC At Scale Today Prize Winner Prize Winner Genomics Weather Seismic Material Science Quantum 1 st Soil & Structure 2.36 ExaOps 1.13 ExaOps 300X Higher Chromodynamics Simulation Performance <1% of Uncertainty Margin 5

TESLA UNIVERSAL ACCELERATION PLATFORM Single Platform To Drive Utilization and Productivity CUSTOMER USECASES Molecular Weather Seismic Speech Translate Recommender Healthcare Manufacturing Finance Simulations Forecasting Mapping CONSUMER INTERNET SUPERCOMPUTING INDUSTRIAL APPLICATIONS APPS & Amber +550 FRAMEWORKS Applications NAMD MACHINE LEARNING | RAPIDS DEEP LEARNING SUPERCOMPUTING NVIDIA SDK & LIBRARIES CuBLAS CuFFT OpenACC cuDF cuML cuGRAPH cuDNN cuBLAS CUTLASS NCCL TensorRT CUDA TESLA GPUs & SYSTEMS VIRTUAL GPU SYSTEM OEM CLOUD TESLA GPU NVIDIA DGX FAMILY NVIDIA HGX 6

EXPANDING VALUE FOR HPC CUSTOMERS Partnering With HPC Development Community 40X 25X GROMACS 22X CRYSPARC FUN3D Chemistry Cryo CFD AMBER CHROMA CRYOSPARC 24x GTC FUN3D 24x LAMMPS GROMACS 7x MILC MICROVOLUTION 48x NAMD PARABRICKS 22x MICROVOLUTION PARABRICKS WRF QUANTUM ESP WRF 8x SPECFEM3D Microscopy Genomics Weather 2018 2019 2019 MORE PERFORMANCE WITH SAME GPU ADDING NEW AND IMPROVED TOP APPLICATIONS 7 CPU Server: Dual Xeon Gold 6140@2.30GHz, GPU Servers: same CPU server w/ 4 NVIDIA V100 PCIe or SXM2 GPUs

CUDA DEVELOPMENT ECOSYSTEM New Algorithm Developers and Problem Domain GPU Users Optimization Experts Specialists Specialists CUDA-C++ CUDA Fortran Directives and Extended Standard Applications Frameworks Libraries Standard Languages Languages Ease of use Specialized Performance CUDA: Programming Model, GPU Architecture, System Architecture 8

NEW PROGRAMMING MODEL FEATURES Execution Interop Turing Asynchronous Lightweight Graphics Precision Efficiency Multi-Precision Task Graphs Interop Tensor Cores atomicAdd(&h, (half)1.15f); half2 hvec(0.94f, -2.13f); atomicAdd(&h2, hvec); IEEE-754.2008 FP16 Specification = 0.707031 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 0 sign exponent mantissa bit (5 bits) (10 bits) NVCC Enhancements FP16 Operations 9

INDEPENDENT THREAD SCHEDULING Communicating Algorithms Pascal: Lock-Free Algorithms Volta/Turing: Starvation Free Algorithms Threads may wait for messages Threads cannot wait for messages 10

ASYNCHRONOUS TASK GRAPHS Execution Optimization When Workflow is Known Up-Front Deep Neural Network Training DL Inference Loop & Function offload Linear Algebra HPC Simulation 11

DEFINITION OF A CUDA GRAPH Graph Nodes Are Not Just Kernel Launches Sequence of operations, connected by dependencies. A Operations are one of: B X Kernel Launch CUDA kernel running on GPU D C CPU Function Call Callback function on CPU Memcopy/Memset GPU data management E Y Sub-Graph Graphs are hierarchical End 12

WHAT IS OPENACC Open Specification Developed by OpenACC.org Consortium Add Simple Compiler Directive Designed for Directives-based performance and programming model for main() parallel { portability on <serial code> computing CPUs and GPUs #pragma acc kernels { <parallel code> } } SIMPLE POWERFUL & PORTABLE Read more at www.openacc.org/about 13

WHO OPENACC IS FOR The Main Focus Domain Scientists Application Developers 1. Want to do more science & less Looking for: programming 1. easy code maintenance, 2. Believe that GPUs are hard 2. better efficiency, 3. Need help in learning how to easy 3. portability start with GPUs 4. Mostly don’t have a computer Mostly computer scientists science degree 14

OPENACC GROWING MOMENTUM Wide Adoption Across Key HPC Codes Over 100 Apps* Using OpenACC VASP Top Quantum Chemistry and Material Science Code For VASP, OpenACC is the way forward for GPU acceleration. Performance is similar to CUDA, and ANSYS Fluent GTC OpenACC dramatically decreases GPU development Gaussian XGC and maintenance efforts. We’re excited to VASP ACME silica IFPEN, RMM-DIIS on P100 collaborate with NVIDIA and PGI as an early LSDalton FLASH adopter of Unified Memory. MPAS COSMO GAMERA Numeca Prof. Georg Kresse Computational Materials Physics University of Vienna * Applications in production and development 15

SINGLE CODE FOR MULTIPLE PLATFORMS OpenACC - Performance Portable Programming Model for HPC AWE Hydr drodyn dynam amics ics Clover verLe Leaf af mini-Ap App, p, bm32 data set http://uk-mac.github.io/CloverLeaf OpenPOWER 160 142 Speedup vs Single Haswell Core Sunway 140 x 120 109 x86 CPU PGI 18.1 OpenACC x 100 Intel 2018 OpenMP x86 Xeon Phi 80 67x NVIDIA GPU 60 40x 40 AMD GPU 14.8x 15x 20 11x 10x 10x 7.6x 7.9x PEZY-SC 0 1x 2x 4x Kepler Pascal Multicore Haswell Multicore Multicore Skylake Volta V100 Broadwell Systems: Haswell: 2x16 core Haswell server, four K80s, CentOS 7.2 (perf-hsw10), Broadwell: 2x20 core Broadwell server, eight P100s (dgx1-prd-01), Broadwell server, eight V100s (dgx07), Skylake 2x20 core Xeon Gold server (sky-4). Compilers: Intel 2018.0.128, PGI 18.1 Benchmark: CloverLeaf v1.3 downloaded from http://uk-mac.github.io/CloverLeaf the week of November 7 2016; CloverlLeaf_Serial; CloverLeaf_ref (MPI+OpenMP); CloverLeaf_OpenACC (MPI+OpenACC) 16 Data compiled by PGI February 2018.

NSIGHT SYSTEMS System-wide Performance Analysis Observe Application Behavior : CPU threads, GPU traces, Memory Bandwidth and more Locate Optimization Opportunities: CUDA & OpenGL APIs, Unified Memory transfers, User Annotations using NVTX Ready for Big Data : Fast GUI capable of visualizing in excess of 10 million events on laptops, Container support, Minimum user privileges https://developer.nvidia.com/nsight-systems 17

Thread/core migration Processes and threads Thread state CUDA and OpenGL API trace cuDNN and cuBLAS trace Kernel and memory transfer activities Multi-GPU 18

CONTAINERS: SIMPLIFYING WORKFLOWS WHY CONTAINERS Simplifies Deployments - Eliminates complex, time-consuming builds and installs Get started in minutes - Simply Pull & Run the app Portable - Deploy across various environments, from test to production with minimal changes 19

NGC CONTAINERS: ACCELERATING WORKFLOWS WHY NGC CONTAINERS WHY CONTAINERS Simplifies Deployments Optimized for Performance - Monthly DL container releases offer latest features and - Eliminates complex, time-consuming builds and superior performance on NVIDIA GPUs installs Scalable Performance Get started in minutes - Supports multi-GPU & multi-node systems for scale-up & - Simply Pull & Run the app scale-out environments Portable Designed for Enterprise & HPC environments - Deploy across various environments, from test to - Supports Docker & Singularity runtimes production with minimal changes Run Anywhere - Pascal/Volta/Turing-powered NVIDIA DGX, PCs, workstations, servers and top cloud platforms 20

THE NEW NGC GPU-optimized Software Hub. Simplifying DL, ML and HPC Workflows 10+ Model Training Scripts NLP, Image Classification, Object Detection & more Simplify Deployments 50+ Containers NGC DL, ML, HPC Innovate Faster Deploy Anywhere Industry Workflows Medical Imaging, Intelligent Video Analytics 50+ Pre-trained Models ngc.nvidia.com NLP, Classification, Object Detection & more 21

NGC-READY ECOSYSTEM Now Over 50 GPU-Optimized Containers DEEP LEARNING MACHINE LEARNING HPC VISUALIZATION 22

RE-IMAGINING DATA SCIENCE WORKFLOW Open Source, End-to-end GPU-accelerated Workflow Built On CUDA cuDF cuML Visualization insights data Data Optimized ML Data preparation / model visualization wrangling training libraries 23

RAPIDS — OPEN GPU DATA SCIENCE Software Stack Python Data Preparation Model Training Visualization cuDF cuML cuGRAPH PYTHON DEEP LEARNING FRAMEWORKS RAPIDS DASK CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW on GPU Memory 24

ACCELERATING MACHINE LEARNING The RAPIDS Ecosystem Open Source Enterprise Data Science Deep Learning Startups Community Platforms Integration GPU Servers Storage Partners 25

HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 - PowerPoint PPT Presentation

GPU ACCELERATED COMPUTING IN HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 RISE OF GPU COMPUTING 1000X GPU-Computing perf 10 7 by 1.5X per year APPLICATIONS 2025 10 6 ALGORITHMS 1.1X per year 10 5 10 4 SYSTEMS 10 3

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

building software with ease kenneth.hoste@ugent.be HPC UGENT About HPC UGent: central

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Summary, perspectives, Q&A Dmitri Svergun, EMBL-Hamburg Biological SAXS at ICAN in Moscow

CRT Produces Long-term Improvements in Disease Progression in Mildly Symptomatic Heart Failure

How to Generalize RSA Cryptanalyses Atsushi Takayasu and Noboru Kunihiro The University of Tokyo,

Bijections for tree-decorated map and applications to random maps. Luis Fredes (Work in progress

The Revolution in Experimental and Observational Science: The Convergence of Data-Intensive and

Beam Delivery Simulation LHC Studies L. Nevay , J. Snuverink, S. Boogert, H.

XENON1T Pushing the limits of WIMP detection D. Coderre for the XENON1T Collaboration AEC

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 - PowerPoint PPT Presentation

GPU ACCELERATED COMPUTING IN HPC AND IN THE DATA CENTER Peter Messmer, DATE 2019, March 27 2019 RISE OF GPU COMPUTING 1000X GPU-Computing perf 10 7 by 1.5X per year APPLICATIONS 2025 10 6 ALGORITHMS 1.1X per year 10 5 10 4 SYSTEMS 10 3

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

building software with ease kenneth.hoste@ugent.be HPC UGENT About HPC UGent: central

UL HPC School 2017 PS9: [Advanced] Prototyping with Python UL High Performance Computing (HPC)

Summary, perspectives, Q&amp;A Dmitri Svergun, EMBL-Hamburg Biological SAXS at ICAN in Moscow

CRT Produces Long-term Improvements in Disease Progression in Mildly Symptomatic Heart Failure

How to Generalize RSA Cryptanalyses Atsushi Takayasu and Noboru Kunihiro The University of Tokyo,

Bijections for tree-decorated map and applications to random maps. Luis Fredes (Work in progress

The Revolution in Experimental and Observational Science: The Convergence of Data-Intensive and

Beam Delivery Simulation LHC Studies L. Nevay , J. Snuverink, S. Boogert, H.

XENON1T Pushing the limits of WIMP detection D. Coderre for the XENON1T Collaboration AEC

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Summary, perspectives, Q&A Dmitri Svergun, EMBL-Hamburg Biological SAXS at ICAN in Moscow