Paralleliza(on and Performance of the NIM Weather Model on CPU, - - PowerPoint PPT Presentation

paralleliza on and performance of the nim weather model
SMART_READER_LITE
LIVE PREVIEW

Paralleliza(on and Performance of the NIM Weather Model on CPU, - - PowerPoint PPT Presentation

Paralleliza(on and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures Mark Gove? NOAA Earth System Research Laboratory We Need Be?er Numerical Weather Predic(on Superstorm Sandy Second most destruc(ve in U.S.


slide-1
SLIDE 1

Paralleliza(on and Performance

  • f the NIM Weather

Model on CPU, GPU and MIC Architectures

Mark Gove?

NOAA Earth System Research Laboratory

slide-2
SLIDE 2

We Need Be?er Numerical Weather Predic(on

“A European forecast that closely predicted Hurricane Sandy's onslaught days ahead

  • f U.S. and other models is

raising complaints in the meteorological community.” "The U.S. does not lead the world; we are not No. 1 in weather forecasCng, I'm very sorry to say that," says AccuWeather's Mike Smith…”

October 28, 2012 Hurricane Sandy Source: USA Today, October 30, 2012

Congressional Response:

  • High Impact Weather Predic(on Program (HIWPP)
  • Next Genera(on Weather Predic(on Program (NGGPS)

Superstorm Sandy

  • Second most destruc(ve in U.S. History
  • $75B in damages
  • Over 200 deaths
slide-3
SLIDE 3

Three Years Later… Hurricane Joaquin

Some improvement

  • NOAA’s Hurricane Weather Research &

Forecast Model intensity forecasts were accurate

  • US research models had 20” precipita(on

forecasts in South Carolina 36 hours in advance (verified) But …

  • European models predicted Joaquin

would not make landfall (verified)

– All U..S models incorrectly predicted landfall

  • The Na(onal Hurricane Center correctly

never issued any hurricane watches or warnings for the mainland

– Forecasters relied on the European model for guidance

October 2, 2015

NY Times: Why U.S weather model has fallen behind WashingtonPost: Why the forecast cone of uncertainty is inadequate

slide-4
SLIDE 4

Weather Predic(on: Forecast Process

  • Opera(onal weather predic(on models at NWS are required

to run in about 1 percent of real-(me

– A one hour forecast produced in 8.5 minutes – Data assimila(on, post processing are similarly constrained

Data HPC Assimila<on NWP Post- Processing Forecaster Stakeholders

“Accelerators” can speed up Assimila(on and Numerical Weather Predic(on (NWP)

slide-5
SLIDE 5

Why Does NWP Need Accelerators?

  • Increasing computer

power has provided linear forecast improvement for decades

  • CPU clock speeds have

stalled

– Increased number of processing cores: MIC, GPU – Lower energy requirements

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0

1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

36 Hour Forecast 72 Hour Forecast

NCEP Central Operations January 2015

NCEP Operational Forecast Skill

36 and 72 Hour Forecasts @ 500 MB over North America

[100 * (1-S1/70) Method]

15 Years

IBM P690 IBM 701 IBM 704 IBM 7090 IBM 7094 CDC 6600 IBM 360/195 CYBER 205 CRAY Y-MP CRAY C90 IBM SP IBM P655+ IBM Power 6
slide-6
SLIDE 6

Resolu(on Ma?ers: Large Scale

Ocean-Land-Atmosphere Interac(ons

  • Global opera(onal weather models: 13KM
slide-7
SLIDE 7

Resolu(on Ma?ers: Fine-Scale

SimulaCon of a Tornado-Producing Super-Cell Storm

4-km 1-km

Simula(ons with GFDL’s variable-resolu(on FV3, non-hydrosta(c (aka cloud-permijng) model. Courtesy of Lin and Harris (2015 manuscript)

More Intense UpdraGs Produces a Tornado

slide-8
SLIDE 8

Be?er Data Assimila(on = Be?er Forecasts

Hurricane Joaquin

European model US model w/old data assimila(on US model w/new data assimila(on

Actual track

(through 03Z 07 October)

00Z October 1, 2015

50°N 45°N 40°N 35°N 30°N 25°N 80°W 70°W 60°W

Source: Corey Guas(ni EMC’s Model EvaluaCon Group

00Z October 1, 2015 Hurricane Joaquin Track Forecast

slide-9
SLIDE 9

Formula to Radically Improve U.S. Weather Predic(on (and be #1)

  • Increase resolu(on of global models to 3KM or finer

– Capture moisture, storm scale features – Coupling atmosphere, ocean, chemistry, land surface

  • Improve data assimila(on

– Use ensemble and (me-based varia(onal methods – Massive increase in number of observa(ons handled – Increase scalability to thousands of cores

  • Increase in compu(ng

– 100 – 1000 (mes more than current models use

slide-10
SLIDE 10

Non-hydrosta(c Icosahedral Model (NIM)

  • Experimental global weather forecast model began in 2008
  • Uniform Icosahedral grid
  • Designed for GPU, MIC

– Run on 10K GPUs, 600 MIC, 250K CPU cores – Tested at 3KM resolu(on

  • Single source code (Fortran)

– Serial, parallel execu(on on CPU, GPU, MIC

  • Paralleliza(on direc(ves
  • GPU

OpenACC, F2C-ACC

  • CPU

OpenMP

  • MIC

OpenMP

  • MPI

SMS

  • Useful for evalua(ng compilers,

GPU & MIC hardware

Fine-Grained Parallelism

  • GPU
  • “Blocks” in horizontal
  • “Threads” in ver(cal
  • CPU, MIC
  • “Threading” in horizontal
  • “Vectoriza(on” of ver(cal
slide-11
SLIDE 11

Hardware Comparisons

  • Performance comparisons in literature, presenta(ons can

be misleading

  • Ideally want:

– Same source code – Op(mized for all architectures – Standard, high volume chips – Comparisons in terms of:

  • Device
  • Single node
  • Mul(-node

– Cost – benefit

  • Programmability
slide-12
SLIDE 12

Device Performance

49.8 26.8 20 15.9 23.6 15.1 13.9 7.8 16.4 10 20 30 40 50 60 2010/11 2012 2013 2014

NIM DYNAMICS

110 KM RESOLUTION 96 VERTICAL LEVELS Intel CPU NVIDIA GPU Intel MIC run(me (sec)

Year Intel CPU (cores) NVIDIA GPU (cores) Intel MIC (cores) 2010/11 Westmere (12) Fermi (448) 2012 SandyBridge (16) Kepler K20x (2688) 2013 IvyBridge (20) Kepler K40 (2880) Knights Corner (61) 2014 Haswell (24) Kepler K80 (4992)

slide-13
SLIDE 13
  • CPU run(me
  • MIC run(me
  • GPU run(me

using F2C-ACC

Symmetric Mode Execu<on

81 74 73 58 42 46 33 10 20 30 40 50 60 70 80 90

IB20 only IB24 only MIC only GPU only IB24 + MIC IB20 + GPU IB20 + 2 GPU

Run-<me (sec)

120 KM Resolu<on 40,968 Columns, 96 Ver<cal Levels 100 <me steps

Single Node Performance

Results from: NOAA / ESRL - August 2014 – IB20: Intel IvyBridge, 20 cores, 3.0GHz – IB24: Intel IvyBridge 24 cores, 2.70 GHz – GPU: Kepler K40 2880 cores, 745 MHz – MIC: KNC 7120 61 cores, 1.23GHZ

Node Type:

Numeric values represent node run-(mes for each configura(on

slide-14
SLIDE 14

Single Node Performance

  • Strong Scaling -
  • Intel IvyBridge with up to 4 NVIDIA K80s
  • As the work per GPU decreases:

– inter-GPU communica(ons increases slightly – efficiency decreases

  • At least 10,000 columns per GPU is best

0.95 0.90 0.77 0.71 5 10 15 20 25 30 35 40 45 50 2 4 6 8 Run<me (seconds) GPUs

NIM Single Node Performance

40,962 Columns, 100 <mesteps

Run(me Communica(ons Cols/GPU 40962 20481 10241 6827 5120

Parallel Efficiency

slide-15
SLIDE 15

CPU – GPU Cost-Benefit

  • Dynamics only
  • Different CPUs and GPU configura(ons

– 40 Haswell CPUs, 20 K80 GPUs – incorporate off-node MPI communica(ons

  • All runs executed in the same (me

– Meets a ~1% opera(onal (me constraint for a 3KM resolu(on model – 20K columns / GPU used which equates to 95% GPU strong scaling efficiency

slide-16
SLIDE 16

Cost-Benefit – NIM Dynamics

  • 30KM resolu(on runs in same execu(on (me with:
  • 40 Intel Haswell CPU Nodes (list price: $6,500)
  • 20 NVIDIA K80 GPUs (list price: $5,000)
  • Execu(on (me represents ~1.5% of real-(me for 3KM

resolu(on

– ~2.75% of real-(me when model physics is included

260 230 165 145.5 132.5 50 100 150 200 250 300

40 20 10 7 5

Cost (thousands) numCPUs:

CPU versus GPU Cost-Benefit NIM 30 km resolu(on

CPU only CPU & GPU

K80s per CPU: 0 1 2 3 4

slide-17
SLIDE 17

Lessons Learned: Code Design

  • Avoid language constructs that are less well

supported or difficult for compilers to op(mize

– Pointers, derived types

  • Separate rou(nes for fine-grain (GPU, MIC) and

coarse grain (MIC)

  • Avoid single loop kernels

– High cost of kernel startup, synchroniza(on

  • Avoid large kernels (GPU)

– Limited fast register, cache / shared memory

  • Use scien(fic formula(ons that are highly parallel
slide-18
SLIDE 18

Lessons Learned: Inter-Process Communica(ons

  • Use of icosahedral grid gave flexibility in how

columns could be distributed among MPI ranks

– MPI regions should be square to minimize points to be communicated – Spiral ordering to eliminate MPI message packing and unpacking helped CPU, GPU, MIC

  • GPUDirect gave 30% performance improvement
  • CUDA Mul(-Process Service (MPS) sped up NIM by

35% on Titan

– Not reflected in the results shown

slide-19
SLIDE 19

Lessons Learned: Fine-Grain

  • Choice of innermost dimension important

– Vectoriza(on on CPU, MIC – SIMD, Coalesced memory on GPU – For NIM, ver(cal dimension used for dynamics

  • Horizontal dimension for physics
  • Innermost dimension should be mul(ple of 32

for GPU, bigger is be?er

– Mul(ple of 8 is sufficient for MIC

  • Minimize branching

– Very few special cases in NIM

slide-20
SLIDE 20

Improved OpenACC Compilers

  • Performance of PGI nearly matches F2C-ACC

– Was 2.1X slower in 2014

  • Cray was 1.7X slower
  • PGI does good job with analysis, data movement

– Use !$acc kernels to get the applica(on running

  • 800 line MPAS kernel running on GPU in 10 minutes

– Use !$acc parallel to op(mize performance – Use !$acc data to handle data movement – Diagnos(c output to guide paralleliza(on,

  • p(miza(on
  • Cray, IBM comparisons planned
slide-21
SLIDE 21

Summary

  • Goal to radically improve U.S. Weather

Predic(on (and be #1)

– Develop and run NGGPS model at 3KM

  • MPAS or FV3 selec(on in May 2016
  • Lessons learned with NIM will guide paralleliza(on

– Significant improvement in data assimila(on

  • Algorithms, techniques must be scalable to tens of

thousands of compute cores

– Fine-grain compu(ng

  • OpenACC, OpenMP compilers