Paralleliza(on and Performance of the NIM Weather Model on CPU, - PowerPoint PPT Presentation

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory

We Need Be?er Numerical Weather Predic(on Superstorm Sandy • Second most destruc(ve in U.S. History • $75B in damages Hurricane Sandy October 28, 2012 • Over 200 deaths “A European forecast that closely predicted Hurricane Sandy's onslaught days ahead of U.S. and other models is raising complaints in the meteorological community.” "The U.S. does not lead the world; we are not No. 1 in weather forecasCng, I'm very sorry to say that," says AccuWeather's Mike Smith…” Source: USA Today, October 30, 2012 Congressional Response: • High Impact Weather Predic(on Program (HIWPP) • Next Genera(on Weather Predic(on Program (NGGPS)

Three Years Later… Hurricane Joaquin Some improvement NOAA’s Hurricane Weather Research & • October 2, 2015 Forecast Model intensity forecasts were accurate US research models had 20” precipita(on • forecasts in South Carolina 36 hours in advance (verified) But … European models predicted Joaquin • would not make landfall (verified) All U..S models incorrectly predicted – landfall The Na(onal Hurricane Center correctly • never issued any hurricane watches or warnings for the mainland Forecasters relied on the European model – for guidance NY Times: Why U.S weather model has fallen behind WashingtonPost: Why the forecast cone of uncertainty is inadequate

Weather Predic(on: Forecast Process • Opera(onal weather predic(on models at NWS are required to run in about 1 percent of real-(me – A one hour forecast produced in 8.5 minutes – Data assimila(on, post processing are similarly constrained HPC NWP Forecaster Post- Data Assimila<on Stakeholders Processing “Accelerators” can speed up Assimila(on and Numerical Weather Predic(on (NWP)

Why Does NWP Need Accelerators? • Increasing computer power has provided linear forecast improvement for decades • CPU clock speeds have NCEP Operational Forecast Skill 36 and 72 Hour Forecasts @ 500 MB over North America [100 * (1-S1/70) Method] stalled 36 Hour Forecast 72 Hour Forecast 90.0 – Increased number of 80.0 processing cores: MIC, 70.0 GPU 60.0 15 Years 50.0 – Lower energy 40.0 requirements 30.0 20.0 IBM CRAY IBM IBM CDC CYBER CRAY IBM IBM IBM IBM IBM IBM 6600 360/195 C90 SP P655+ 704 205 Y-MP P690 Power 6 701 7090 7094 10.0 0.0 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 NCEP Central Operations January 2015

Resolu(on Ma?ers: Large Scale Ocean-Land-Atmosphere Interac(ons • Global opera(onal weather models: 13KM

Resolu(on Ma?ers: Fine-Scale SimulaCon of a Tornado-Producing Super-Cell Storm Produces a Tornado 4-km 1-km More Intense UpdraGs Simula(ons with GFDL’s variable-resolu(on FV 3 , non-hydrosta(c (aka cloud-permijng) model. Courtesy of Lin and Harris (2015 manuscript)

Be?er Data Assimila(on = Be?er Forecasts Hurricane Joaquin 00Z October 1, 2015 US model w/new data assimila(on 50°N Actual track US model w/old 00Z October 1, 2015 (through 03Z 07 data assimila(on October) Hurricane Joaquin 45°N Track Forecast 40°N 35°N European model 30°N 25°N 70°W 80°W 60°W Source: Corey Guas(ni EMC’s Model EvaluaCon Group

Formula to Radically Improve U.S. Weather Predic(on (and be #1) • Increase resolu(on of global models to 3KM or finer – Capture moisture, storm scale features – Coupling atmosphere, ocean, chemistry, land surface • Improve data assimila(on – Use ensemble and (me-based varia(onal methods – Massive increase in number of observa(ons handled – Increase scalability to thousands of cores • Increase in compu(ng – 100 – 1000 (mes more than current models use

Non-hydrosta(c Icosahedral Model (NIM) • Experimental global weather forecast model began in 2008 • Uniform Icosahedral grid • Designed for GPU, MIC – Run on 10K GPUs, 600 MIC, 250K CPU cores – Tested at 3KM resolu(on • Single source code (Fortran) – Serial, parallel execu(on on CPU, GPU, MIC • Paralleliza(on direc(ves • GPU OpenACC, F2C-ACC • CPU OpenMP • MIC OpenMP • MPI SMS • Useful for evalua(ng compilers, Fine-Grained Parallelism GPU & MIC hardware GPU • “Blocks” in horizontal • “Threads” in ver(cal • CPU, MIC • “Threading” in horizontal • “Vectoriza(on” of ver(cal •

Hardware Comparisons • Performance comparisons in literature, presenta(ons can be misleading • Ideally want: – Same source code – Op(mized for all architectures – Standard, high volume chips – Comparisons in terms of: • Device • Single node • Mul(-node – Cost – benefit • Programmability

Device Performance 60 NIM DYNAMICS 49.8 50 110 KM RESOLUTION Intel CPU 96 VERTICAL LEVELS 40 NVIDIA GPU run(me (sec) Intel MIC 26.8 30 23.6 20 20 16.4 15.9 15.1 13.9 7.8 10 0 2010/11 2012 2013 2014 Year Intel CPU (cores) NVIDIA GPU (cores) Intel MIC (cores) 2010/11 Westmere (12) Fermi (448) 2012 SandyBridge (16) Kepler K20x (2688) 2013 IvyBridge (20) Kepler K40 (2880) Knights Corner (61) 2014 Haswell (24) Kepler K80 (4992)

Single Node Performance Results from: NOAA / ESRL - August 2014 Numeric values represent node run-(mes for each configura(on 90 120 KM Resolu<on 81 40,968 Columns, 96 Ver<cal Levels 80 74 73 100 <me steps 70 Symmetric Mode • CPU run(me 58 Execu<on Run-<me (sec) 60 • MIC run(me • GPU run(me 50 46 using F2C-ACC 42 40 33 30 20 10 Node Type : 0 IB20 only IB24 only MIC only GPU only IB24 + MIC IB20 + GPU IB20 + 2 GPU – IB20 : Intel IvyBridge, 20 cores, 3.0GHz – GPU : Kepler K40 2880 cores, 745 MHz – IB24 : Intel IvyBridge 24 cores, 2.70 GHz – MIC : KNC 7120 61 cores, 1.23GHZ

Single Node Performance - Strong Scaling - • Intel IvyBridge with up to 4 NVIDIA K80s • As the work per GPU decreases: – inter-GPU communica(ons increases slightly – efficiency decreases • At least 10,000 columns per GPU is best NIM Single Node Performance 40,962 Columns, 100 <mesteps 50 45 40 Parallel 35 Efficiency Run(me Communica(ons Run<me (seconds) 30 25 20 15 0.95 10 0.90 5 0.77 0.71 0 GPUs 0 2 4 6 8 Cols/GPU 40962 20481 10241 6827 5120

CPU – GPU Cost-Benefit • Dynamics only • Different CPUs and GPU configura(ons – 40 Haswell CPUs, 20 K80 GPUs – incorporate off-node MPI communica(ons • All runs executed in the same (me – Meets a ~1% opera(onal (me constraint for a 3KM resolu(on model – 20K columns / GPU used which equates to 95% GPU strong scaling efficiency

Cost-Benefit – NIM Dynamics • 30KM resolu(on runs in same execu(on (me with: - 40 Intel Haswell CPU Nodes (list price: $6,500) - 20 NVIDIA K80 GPUs (list price: $5,000) • Execu(on (me represents ~1.5% of real-(me for 3KM resolu(on – ~2.75% of real-(me when model physics is included CPU versus GPU Cost-Benefit NIM 30 km resolu(on 300 260 230 250 200 165 Cost (thousands) 145.5 132.5 150 100 CPU only CPU & GPU 50 0 40 20 10 7 5 numCPUs: K80s per CPU: 0 1 2 3 4

Lessons Learned: Code Design • Avoid language constructs that are less well supported or difficult for compilers to op(mize – Pointers, derived types • Separate rou(nes for fine-grain (GPU, MIC) and coarse grain (MIC) • Avoid single loop kernels – High cost of kernel startup, synchroniza(on • Avoid large kernels (GPU) – Limited fast register, cache / shared memory • Use scien(fic formula(ons that are highly parallel

Lessons Learned: Inter-Process Communica(ons • Use of icosahedral grid gave flexibility in how columns could be distributed among MPI ranks – MPI regions should be square to minimize points to be communicated – Spiral ordering to eliminate MPI message packing and unpacking helped CPU, GPU, MIC • GPUDirect gave 30% performance improvement • CUDA Mul(-Process Service (MPS) sped up NIM by 35% on Titan – Not reflected in the results shown

Lessons Learned: Fine-Grain • Choice of innermost dimension important – Vectoriza(on on CPU, MIC – SIMD, Coalesced memory on GPU – For NIM, ver(cal dimension used for dynamics • Horizontal dimension for physics • Innermost dimension should be mul(ple of 32 for GPU, bigger is be?er – Mul(ple of 8 is sufficient for MIC • Minimize branching – Very few special cases in NIM

Improved OpenACC Compilers • Performance of PGI nearly matches F2C-ACC – Was 2.1X slower in 2014 • Cray was 1.7X slower • PGI does good job with analysis, data movement – Use !$acc kernels to get the applica(on running • 800 line MPAS kernel running on GPU in 10 minutes – Use !$acc parallel to op(mize performance – Use !$acc data to handle data movement – Diagnos(c output to guide paralleliza(on, op(miza(on • Cray, IBM comparisons planned

Paralleliza(on and Performance of the NIM Weather Model on CPU, - PowerPoint PPT Presentation

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Second most destruc(ve in U.S.

Nim on everything @PMunch peterme.net Peter Munch-Ellingsen, M.Sc What is Nim? Compiled

How Weather Forecasting Works Extension Climate Learning Lab Forecasting Weather Weather

Reflections on the CIPM MRA: the NIM China Perspective Yuning Duan Vice Director, National

45 th Weather Squadron Space Weather Support to Launch Space Weather Workshop, 29 April 2016

Running the FIM and NIM Weather Models on GPUs Mark

2Q20 1 History and Business Highlights 3 Market Presence 7 Digital Bank 9 Results 13 NIM

Error Detection and Correction: Nim; Secure Communication; RAID Greg Plaxton Theory in

Ry u o Nim: A Variant of the classical game of Wythoffs Nim Tomoaki Abuku, Masanori

Designing an ultra low-overhead multithreading runtime for Nim Mamy Ratsimbazafy Weave

The Weather and Climate Enterprise in the United States April 2, 2012 Seoul, South Korea

lessons learned in communicating weather and climate uncertainty Jason Samenow, Capital Weather

Weather Effects (Group 1) Jared Headings, Ted Zhu, Ian Kirchner Weather in Games Audio and

April 9-13, 2018 Severe Weather Awareness Week 2018 What is Severe Weather Awareness Week?

Severe Weather Awareness Week April 8-12, 2019 Severe Weather Awareness Week 2019 What is

Winter Weather Safety Know Your Risk Take Action Be a Force of Nature Winter Weather Safety

Severe Weather Walls/Roofs Walls/Roofs SPFA Conference March 16 th 2008 a c 6 008 Value

1 Financial Highlights: FY18 Key financial highlights Strong performance backed by robust

NERSC Multi-Factor Authentication It's easy! Abe Singer 2018-11-01 MFA in Brief MFA will

Business Performance 31.12.2018 1 DISCLAIMER This presentation has been prepared by Karur Vysya

Q4FY19 www.repcohome.com Earnings Presentation May 2019 Agenda Repco Home Finance Limited

Corporate Presentation - FY19 Disclaimer This presentation has been prepared by and is the sole

TSB Banking Group - Duncan Funding Platform January 2020 CONFIDENTIAL Disclaimer (1) This

TISCO Financial Group Public Co., Ltd. Analyst Meeting FY2019 January 15, 2020 Consolidated

Q1 2020 Presentation Avida Holding AB Disclaimer This Presentation has been produced by Avida