HACC: Extreme Scaling and Performance Across Diverse Architectures - PowerPoint PPT Presentation

HACC: Extreme Scaling and Performance Across Diverse Architectures HACC (Hardware/Hybrid Accelerated Salman Habib HEP and MCS Divisions Cosmology Code) Framework Argonne National Laboratory Vitali Morozov David Daniel 100M on Mira Nicholas Frontiere Patricia Fasel 100M on Titan Hal Finkel Los Alamos National Laboratory Adrian Pope Zarija Lukic ASCR Katrin Heitmann Lawrence Berkeley National Laboratory HEP Kalyan Kumaran Justin Luitjens Venkatram Vishwanath NVIDIA Tom Peterka DES Joe Insley George Zagaris Argonne National Laboratory Kitware LSST

Motivating HPC: The Computational Ecosystem • Motivations for large HPC campaigns: 1) Quantitative predictions 2) Scientific discovery, expose mechanisms 3) System-scale simulations (‘impossible experiments’) 4) Inverse problems and optimization • Driven by a wide variety of data sources, computational cosmology must address ALL of the above • Role of scalability/performance: 1) Very large simulations necessary, but not just a matter of running a few large simulations 2) High throughput essential 3) Optimal design of simulation campaigns 4) Analysis pipelines and associated infrastructure

Data ‘Overload’: Observations of Cosmic Structure • Cosmology=Physics+Statistics SPT • Mapping the sky with large-area surveys across multiple wave-bands, at remarkably low levels of statistical error CMB temperature anisotropy: theory meets observations SDSS BOSS LSST Galaxies in a moon-sized patch (Deep Lens Survey). LSST will cover 50,000 times this size The same signal in the (~400PB of data) galaxy distribution

Large Scale Structure: Vlasov-Poisson Equation Cosmological Vlasov-Poisson Equation • Properties of the Cosmological Vlasov-Poisson Equation: • 6-D PDE with long-range interactions, no shielding, all scales matter, models gravity-only, collisionless evolution • Extreme dynamic range in space and mass (in many applications, million to one, ‘everywhere’) • Jeans instability drives structure formation at all scales from smooth Gaussian random field initial conditions

Large Scale Structure Simulation Requirements • Force and Mass Resolution: 2 Mpc • Galaxy halos ~100kpc, hence force resolution has to be ~kpc; with Gpc box-sizes, a dynamic range of a 20 Mpc million to one • Ratio of largest object mass to lightest Time is ~10000:1 100 Mpc • Physics: • Gravity dominates at scales greater 1000 Mpc than ~Mpc • Small scales: galaxy modeling, semi- Gravitational Jeans Instablity analytic methods to incorporate gas physics/feedback/star formation • Computing ‘Boundary Conditions’: Can the Universe be run • Total memory in the PB+ class as a short computational • Performance in the 10 PFlops+ class ‘experiment’? • Wall-clock of ~days/week, in situ analysis

Combating Architectural Diversity with HACC • Architecture-independent performance/scalability: Roadrunner ‘Universal’ top layer + ‘plug in’ node-level components; minimize data structure complexity and data motion • Programming model: ‘C++/MPI + X’ where X = Hopper OpenMP, Cell SDK, OpenCL, CUDA, -- • Algorithm Co-Design: Multiple algorithm options, stresses accuracy, low memory overhead, no external libraries in simulation path Mira/Sequoia • Analysis tools: Major analysis framework, tools deployed in stand-alone and in situ modes Titan 1.005 Power spectra ratios across different 1.004 1.003 P(k) Ratio with respect to GPU code 1.003 implementations (GPU version as reference) 1.002 1.001 1.00 Edison 1 0.999 0.998 0.997 0.997 RCB TreePM on BG/Q/GPU P 3 M RCB TreePM on Hopper/GPU P 3 M 0.996 Cell P 3 M/GPU P 3 M Gadget-2/GPU P 3 M 0.995 0.1 1 k (h/Mpc) k[h/Mpc]

Architectural Challenges Roadrunner: Prototype for modern accelerated architectures Mira/Sequoia Architectural ‘Features’ • Complex heterogeneous nodes • Simpler cores, lower memory/core (will weak scaling continue?) • Skewed compute/communication balance • Programming models? • I/O? File systems?

Accelerated Systems: Specific Issues Imbalances and Bottlenecks • Memory is primarily host-side (32 GB vs. 6 GB) (against Roadrunner’s 16 GB vs. 16 GB), important thing to think about (in case of HACC, the grid/ particle balance) • PCIe is a key bottleneck; overall Strategies for Success Mira/Sequoia interconnect B/W does not • It’s (still) all about understanding match Flops (not even close) and controlling data motion • There’s no point in ‘sharing’ • Rethink your code and even work between the CPU and the approach to the problem GPU, performance gains will be • Isolate hotspots, and design for minimal -- GPU must dominate portability around them (modular • The only reason to write a code programming) for such a system is if you can • Like it or not, pragmas will never truly exploit its power (2 X CPU be the full answer is a waste of effort!)

‘HACC In Pictures’ RCB tree levels Mira/Sequ Text ~1 Mpc ~50 Mpc 0.8 Newtonian HACC Top Layer: HACC ‘Nodal’ Layer: 0.7 Force Two-particle Force 3-D domain decomposition Short-range solvers 0.6 Noisy CIC PM Force with particle replication at employing combination 0.5 boundaries (‘overloading’) of flexible chaining mesh 6th-Order sinc-Gaussian 0.4 for Spectral PM algorithm spectrally filtered PM and RCB tree-based force 0.3 Force (long-range force) evaluations 0.2 0.1 GPU: two options, Host-side 0 0 1 2 3 4 5 6 7 8 P3M vs. TreePM

HACC: Algorithmic Features • Fully Spectral Particle-Mesh Solver: 6th-order Green function, 4th-order Super- Lanczos derivatives, high-order spectral filtering, high-accuracy polynomial for short-range forces • Custom Parallel FFT: Pencil-decomposed, high-performance FFT (up to 15K^3) • Particle Overloading: Particle replication at ‘node’ boundaries to reduce/delay communication (intermittent refreshes), important for accelerated systems • Flexible Chaining Mesh: Used to optimize tree and P3M methods • Optimal Splitting of Gravitational Forces: Spectral Particle-Mesh melded with direct and RCB (‘fat leaf’) tree force solvers (PPTPM), short hand-over scale (dynamic range splitting ~ 10,000 X 100); pseudo-particle method for multipole expansions • Mixed Precision: Optimize memory and performance (GPU-friendly!) • Optimized Force Kernels: High performance without assembly • Adaptive Symplectic Time-Stepping: Symplectic sub-cycling of short-range force timesteps; adaptivity from automatic density estimate via RCB tree • Custom Parallel I/O: Topology aware parallel I/O with lossless compression (factor of 2); 1.5 trillion particle checkpoint in 4 minutes at ~160GB/sec on Mira

HACC on Titan: GPU Implementation (Schematic) P3M Implementation (OpenCL): • Spatial data pushed to GPU in Chaining Mesh large blocks, data is sub- Push ¡to ¡GPU Block partitioned into chaining mesh 3 ¡Grid ¡units cubes New Implementations (OpenCL and • Compute forces between particles CUDA): in a cube and neighboring cubes • P3M with data pushed only once • Natural parallelism and simplicity per long time-step, completely leads to high performance eliminating memory transfer • Typical push size ~2GB; large latencies (orders of magnitude push size ensures computation less); uses ‘soft boundary’ time exceeds memory transfer chaining mesh, rather than latency by a large factor rebuilding every sub-cycle • More MPI tasks/node preferred • TreePM analog of BG/Q code over threaded single MPI tasks written in CUDA, also produces (better host code performance) high performance

HACC on Titan: GPU Implementation Performance • P3M kernel runs at Initial Strong Scaling Time (nsec) per substep/particle 1.6TFlops/node at Initial Weak Scaling 40.3% of peak (73% of algorithmic peak) Improved Weak Scaling • TreePM kernel was run on 77% of Titan at 20.54 PFlops at almost identical performance on the Ideal Scaling card • Because of less TreePM Weak Scaling overhead, P3M code is (currently) faster by factor of two in Number of Nodes time to solution 99.2% Parallel Efficiency

HACC Science Simulations with 6 orders of dynamic range, exploiting all supercomputing architectures -- advancing science CMB SZ Sky Map Strong Lensing Synthetic Catalog The Outer Rim Simulation Large Scale Scientific Inference: Cosmological Parameters Structure Merger Trees

HACC: Extreme Scaling and Performance Across Diverse Architectures - PowerPoint PPT Presentation

HACC: Extreme Scaling and Performance Across Diverse Architectures HACC (Hardware/Hybrid Accelerated Salman Habib HEP and MCS Divisions Cosmology Code) Framework Argonne National Laboratory Vitali Morozov David Daniel 100M on Mira Nicholas

2015 WA HACC Community Consultations South West Region Welcome & Overview Overview for

2015 WA HACC Community Consultations Midwest Region Welcome & Overview Overview for

2015 WA HACC Community Consultations South East Metropolitan Region Welcome & Overview

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

MCL Pathways Pre-Apprenticeship PVV PVIP HACC Pre-Apprenticeship SkillUP Lancaster

Application Development Strategies (Building the Solution) Blaise Hanagami Mililani High School

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Performance Scaling How is my parallel code performing and scaling? Performance metrics

CLOSING THE DEAL Consolidation Trends in Transportation & Logistics September 28, 2017 Chris

West Volusia Tourism Advertising Authority FY 2014-2015 Budget Request 1 August 21, 2014 Key

Titre de la prsentation

Athletic Hall of Fame Presents the 2017 Hall of Fame Class Julia Dominguez Athlete 2001-2005

Ascott Residence Trust 1H 2020 Financial Results 28 July 2020 Important Notice This

B ENESCH S HALE M ARKET I NTELLIGENCE Q UARTERLY S UMMARY R EPORT Q4 2012 Contents 1. Top

Halcn Resources Investor Presentation November 2018 Forward Looking Statements This

2050 Global Calculator CO2 GHG emissions Presenter: Benoit Lefevre, Ph.D. Date: July 16th, 2014

HACC: Extreme Scaling and Performance Across Diverse Architectures - PowerPoint PPT Presentation

HACC: Extreme Scaling and Performance Across Diverse Architectures HACC (Hardware/Hybrid Accelerated Salman Habib HEP and MCS Divisions Cosmology Code) Framework Argonne National Laboratory Vitali Morozov David Daniel 100M on Mira Nicholas

2015 WA HACC Community Consultations South West Region Welcome &amp; Overview Overview for

2015 WA HACC Community Consultations Midwest Region Welcome &amp; Overview Overview for

2015 WA HACC Community Consultations South East Metropolitan Region Welcome &amp; Overview

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Who Are Diverse Learners? How Do We Reach Them? Oklahoma State Department of Education Diverse

MCL Pathways Pre-Apprenticeship PVV PVIP HACC Pre-Apprenticeship SkillUP Lancaster

Application Development Strategies (Building the Solution) Blaise Hanagami Mililani High School

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Performance Scaling How is my parallel code performing and scaling? Performance metrics

CLOSING THE DEAL Consolidation Trends in Transportation &amp; Logistics September 28, 2017 Chris

West Volusia Tourism Advertising Authority FY 2014-2015 Budget Request 1 August 21, 2014 Key

Titre de la prsentation

Athletic Hall of Fame Presents the 2017 Hall of Fame Class Julia Dominguez Athlete 2001-2005

Ascott Residence Trust 1H 2020 Financial Results 28 July 2020 Important Notice This

B ENESCH S HALE M ARKET I NTELLIGENCE Q UARTERLY S UMMARY R EPORT Q4 2012 Contents 1. Top

Halcn Resources Investor Presentation November 2018 Forward Looking Statements This

2050 Global Calculator CO2 GHG emissions Presenter: Benoit Lefevre, Ph.D. Date: July 16th, 2014

2015 WA HACC Community Consultations South West Region Welcome & Overview Overview for

2015 WA HACC Community Consultations Midwest Region Welcome & Overview Overview for

2015 WA HACC Community Consultations South East Metropolitan Region Welcome & Overview

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

CLOSING THE DEAL Consolidation Trends in Transportation & Logistics September 28, 2017 Chris