Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - - PowerPoint PPT Presentation

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization Nicholas J. Wright Perlmutter Chief Architect GPU Technology Conference San Jose March 21 2019

NERSC is the mission High Performance Computing facility for the DOE SC Simulations at scale 7,000 Users 800 Projects 700 Codes 2000 NERSC citations per year Data analysis support for DOE’s experimental and observational facilities Photo Credit: CAMERA - 2 -

NERSC has a dual mission to advance science and the state-of-the-art in supercomputing • We collaborate with computer companies years before a system’s delivery to deploy advanced systems with new capabilities at large scale • We provide a highly customized software and programming environment for science applications • We are tightly coupled with the workflows of DOE’s experimental and observational facilities – ingesting tens of terabytes of data each day • Our staff provide advanced application and system performance expertise to users 3

Perlmutter is a Pre-Exascale System Pre-Exascale Systems Exascale Systems 2021-2023 2013 2016 2020 2018 Mira A21 2021 Theta Argonne Argonne Argonne IBM BG/Q Intel/Cray KNL Intel/Cray Summit ORNL LBNL IBM/NVIDIA Cray/NVIDIA/AMD P9/Volta Frontier CORI Titan ORNL ORNL LBNL TBD Cray/NVidia K20 Cray/Intel Xeon/KNL Trinity Crossroads Sequoia Sierra LLNL LLNL LANL/SNL LANL/SNL LLNL IBM/NVIDIA IBM BG/Q TBD Cray/Intel Xeon/KNL TBD P9/Volta 4

NERSC Systems Roadmap NERSC-11: NERSC-10: Beyond NERSC-9: Moore Exa system CPU and GPU nodes NERSC-8: Cori Continued transition of applications and support for Manycore CPU complex workflows NESAP Launched: NERSC-7: transition applications to Edison advanced architectures 2028 2024 Multicore 2020 CPU 2016 2013 Increasing need for energy-efficient architectures

Cori: A pre-exascale supercomputer for the Office of Science workload Cray XC40 system with 9,600+ Intel 1,600 Haswell processor nodes Knights Landing compute nodes NVRAM Burst Buffer 1.5 PB, 1.5 TB/sec 68 cores / 96 GB DRAM / 16 GB HBM 30 PB of disk, >700 GB/sec I/O bandwidth Support the entire Office of Science Integrated with Cori Haswell nodes on research community Aries network for data / simulation / analysis on one system Begin to transition workload to energy efficient architectures

Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network ● Single-tier All-Flash Lustre based HPC file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows

GPU nodes 4x NVIDIA “Volta-next” GPU ● > 7 TF Volta ● > 32 GiB, HBM-2 specs ● NVLINK 1x AMD CPU 4 Slingshot connections ● 4x25 GB/s GPU direct, Unified Virtual Memory (UVM) 2-3x Cori

CPU nodes AMD “Milan” CPU ● ~64 cores Rome ● “ZEN 3” cores - 7nm+ specs ● AVX2 SIMD (256 bit) 8 channels DDR memory ● >= 256 GiB total per node 1 Slingshot connection ● 1x25 GB/s ~ 1x Cori

Perlmutter: A System Optimized for Science ● GPU-accelerated and CPU-only nodes meet the needs of large scale simulation and data analysis from experimental facilities ● Cray “Slingshot” - High-performance, scalable, low-latency Ethernet- compatible network How do we optimize ● Single-tier All-Flash Lustre based HPC the size of each partition? file system, 6x Cori’s bandwidth ● Dedicated login and high memory nodes to support complex workflows

NERSC System Utilization (Aug’17 - Jul’18) ● 3 codes > 25% of the workload ● 10 codes > 50% of the workload ● 30 codes > 75% of the workload ● Over 600 codes comprise the remaining 25% of the workload.

GPU Readiness Among NERSC Codes (Aug’17 - Jul’18) GPU Status & Description Fraction Breakdown of Hours at NERSC Enabled: Most features are ported 32% and performant Kernels: Ports of some kernels have 10% been documented. Proxy: Kernels in related codes 19% have been ported Unlikely: A GPU port would require 14% major effort. Unknown: GPU readiness cannot be 25% assessed at this time. A number of applications in NERSC workload are GPU enabled already. We will leverage existing GPU codes 12 from CAAR + Community

How many GPU nodes to buy - Benchmark Suite Construction & Scalable System Improvement Application Description Select codes to represent the anticipated workload Quantum • Include key applications from the current workload. Materials code using DFT Espresso • Add apps that are expected to be contribute significantly QCD code using MILC staggered quarks to the future workload. Compressible radiation StarLord Scalable System Improvement hydrodynamics Weather/Community DeepCAM Measures aggregate performance of HPC machine Atmospheric Model 5 How many more copies of the benchmark can be run • GTC Fusion PIC code relative to the reference machine Representative of Performance relative to reference machine • “CPU Only” applications that cannot be (3 Total) ported to GPUs #%&'() × +&,)-.( × /(01_3(0_4&'( !!" = #%&'() 567 × +&,)-.( 567 × /(01_3(0_4&'( 567

Hetero system design & price sensitivity: Budget for GPUs increases as GPU price drops Chart explores an isocost design space Vary the budget allocated to GPUs • Assume GPU enabled applications have • performance advantage = 10x per node, 3 of 8 apps are still CPU only. Examine GPU/CPU node cost ratio • SSI increase GPU / CPU vs. CPU-Only $ per node (@ budget %) 8:1 None No justification for GPUs Slight justification for up to 50% of 6:1 1.05x @ 25% budget on GPUs GPUs cost effective up to full system 4:1 1.23x @ 50% budget, but optimum at 50% Circles: 50% CPU nodes + 50% GPU nodes B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme Stars: Optimal system configuration. Heterogeneity" , 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

Application readiness efforts justify larger GPU partitions. Explore an isocost design space • Assume 8:1 GPU/CPU node cost ratio. • Vary the budget allocated to GPUs • Examine GPU / CPU performance gains such as those obtained by software optimization & tuning. 5 of 8 codes have 10x, 20x, 30x speedup. SSI increase GPU / CPU vs. CPU-Only perf. per node (@ budget %) 10x None No justification for GPUs Compare to 1.23x 20x 1.15x @ 45% for 10x at 4:1 GPU/CPU cost ratio Compare to 3x 30x 1.40x @ 60% from NESAP for KNL Circles: 50% CPU nodes + 50% GPU nodes B. Austin, C. Daley, D. Doerfler, J. Deslippe, B. Cook, B. Friesen, T. Kurth, C. Yang, N. J. Wright, “A Metric for Evaluating Supercomputer Performance in the Era of Extreme Stars: Optimal system configuration Heterogeneity" , 9th IEEE International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS18), November 12, 2018,

Application Readiness Strategy for Perlmutter How to Enable NERSC’s diverse community of 7,000 users, 750 projects, and 700 codes to run on advanced architectures like Perlmutter and beyond? NERSC Exascale Science Application Program (NESAP) • Engage ~25 Applications • up to 17 postdoctoral fellows • Deep partnerships with every SC Office area • Leverage vendor expertise and hack-a-thons • Knowledge transfer through documentation and training for all users • Optimize codes with improvements relevant to multiple architectures • 16

GPU Transition Path for Apps NESAP for Perlmutter will extend activities from NESAP 1. Identifying and exploiting on-node parallelism 2. Understanding and improving data-locality within the memory hierarchy What’s New for NERSC Users? 1. Heterogeneous compute elements 2. Identification and exploitation of even more parallelism 3. Emphasis on performance-portable programming approach: – Continuity from Cori through future NERSC systems 17

OpenMP is the most popular non-MPI parallel programming technique ● Results from ERCAP 2017 user survey ○ Question answered by 328 of 658 survey respondents ● Total exceeds 100% because some applications use multiple techniques

OpenMP meets the needs of the NERSC workload ● Supports C, C++ and Fortran The NERSC workload consists of ~700 applications with a ○ relatively equal mix of C, C++ and Fortran ● Provides portability to different architectures at other DOE labs ● Works well with MPI: hybrid MPI+OpenMP approach successfully used in many NERSC apps ● Recent release of OpenMP 5.0 specification – the third version providing features for accelerators Many refinements over this five year period ○

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - - PowerPoint PPT Presentation

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization Nicholas J. Wright Perlmutter Chief Architect GPU Technology Conference San Jose March 21 2019 NERSC is the mission

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

NETWORK MANAGEMENT SYSTEM October 2016 Overview Overview Product Product NETWORK

DIGITAL PLANETS SECURITY SOLUTION INTRODUCTION OUR OFFERED SERVICES Vulnerability Security

YOU SUCCEED Deploying GPUs in Military Ground Vehicles Ross Newman (ross.newman@abaco.com) Abaco

DEUTSCHE TELEKOM COMPANY PRESENTATION DISCLAIMER This presentation contains forward-looking

A Connector- A Connector- Centric Approach Centric Approach to Architectural to Architectural

Virtual Machine Monitors IBM VM/370 - Mainframe time-sharing 1990s VMware - MPP

U.S. Department of Energy Advanced Research Projects Agency-Energy (ARPA-E) Conner Prochaska

Overview Dr. Madhav Acharya Technology-to-Market Advisor, ARPA-E June 11, 2018 History of

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - - PowerPoint PPT Presentation

Perlmutter - A 2020 Pre-Exascale GPU-accelerated System for NERSC - Architecture and Application Performance Optimization Nicholas J. Wright Perlmutter Chief Architect GPU Technology Conference San Jose March 21 2019 NERSC is the mission

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Picture This! Visualization on GPU Accelerated Supercomputers Peter Messmer, 11/15/2016 NVIDIA

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

GPU-accelerated similarity searching in a database of short DNA sequences Richard Wilton

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Why Nobody Should Care About Operating Systems for Exascale Operating Systems for Exascale Ron

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

exascale road in China Ruibo WANG National University of Defense Technology Contents NUDT

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics Jorge

GPU-accelerated Data Management Data Processing on Modern Hardware Sebastian Bre TU Dortmund

PacketShader: A GPU-Accelerated Software Router Some images and sentence are from original author

NETWORK MANAGEMENT SYSTEM October 2016 Overview Overview Product Product NETWORK

DIGITAL PLANETS SECURITY SOLUTION INTRODUCTION OUR OFFERED SERVICES Vulnerability Security

YOU SUCCEED Deploying GPUs in Military Ground Vehicles Ross Newman (ross.newman@abaco.com) Abaco

DEUTSCHE TELEKOM COMPANY PRESENTATION DISCLAIMER This presentation contains forward-looking

A Connector- A Connector- Centric Approach Centric Approach to Architectural to Architectural

Virtual Machine Monitors IBM VM/370 - Mainframe time-sharing 1990s VMware - MPP

U.S. Department of Energy Advanced Research Projects Agency-Energy (ARPA-E) Conner Prochaska

Overview Dr. Madhav Acharya Technology-to-Market Advisor, ARPA-E June 11, 2018 History of

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team