Research and Forecasting (WRF) Model Bormin Huang Space Science and - - PowerPoint PPT Presentation

research and forecasting wrf model
SMART_READER_LITE
LIVE PREVIEW

Research and Forecasting (WRF) Model Bormin Huang Space Science and - - PowerPoint PPT Presentation

CUDA Implementation of the Weather Research and Forecasting (WRF) Model Bormin Huang Space Science and Engineering Center University of Wisconsin-Madison SC13 nVIDIA Booth #613 Colorado Convention Center Outline Numerical weather prediction


slide-1
SLIDE 1

CUDA Implementation of the Weather Research and Forecasting (WRF) Model

Bormin Huang Space Science and Engineering Center University of Wisconsin-Madison SC13 nVIDIA Booth #613 Colorado Convention Center

slide-2
SLIDE 2

Outline

Numerical weather prediction (NWP) Weather Research and Forecasting (WRF) Model GPU WSM5 Optimization Benchmarks Validation of the results Conclusions

Image: Wielicki, Bruce A., and Coauthors, 2013: Achieving Climate Change Absolute Accuracy in Orbit. Bull. Amer. Meteor. Soc., 94, 1519–1539.

slide-3
SLIDE 3

What is numerical weather prediction (NWP)?

Weather models use systems of differential equations based on the laws of physics, fluid motion, and chemistry, and use a coordinate system which divides the planet into a 3D grid. Winds, heat transfer, solar radiation, relative humidity, and surface hydrology are calculated within each grid cell, and the interactions with neighboring cells are used to calculate atmospheric properties in the future.

Wikimedia Commons

  • Numerical weather prediction uses mathematical models of the

atmosphere and oceans to predict the weather based on current weather conditions.

  • First attempted in the 1920s
  • Computer simulation in the 1950s -> NWP produced realistic

results.

  • Advances in NWP linked with advances in CS
  • Major application in HPC business

Atmospheric model schematic

slide-4
SLIDE 4

Grid spacing (resolution)

Wikimedia Commons: NASA satellite photograph of the Hawaiian Islands

  • Grid spacing (resolution) defines the scale of features you can

simulate with the model

  • “Global” vs. “regional”

regional = higher resolution over smaller domain

slide-5
SLIDE 5

WRF Overview

WRF simulation of Hurricane Rita (2005) tracks

  • WRF is mesoscale and global Weather Research and Forecasting model
  • Designed for both operational forecasters and atmospheric researchers
  • WRF is currently in operational use at numerous weather centers around the world
  • WRF is suitable for a broad spectrum of applications across domain scales ranging

from meters to hundreds of kilometers.

  • Increases in computational power enables
  • Increased vertical as well as horizontal resolution
  • More timely delivery of forecasts
  • Probabilistic forecasts based on ensemble methods
  • Why accelerators?
  • Cost performance
  • Need for strong scaling

Wikimedia Commons Image: Welcome Remarks, 14th Annual WRF Users’ Workshop.

slide-6
SLIDE 6

WRF system components

  • The WRF physics categories are microphysics, cumulus

parametrization, planetary boundary layer (PBL), land-surface model and radiation.

Jimy Dudhia: WRF physics options

slide-7
SLIDE 7

Performance Profile of WRF

25 75

% Runtime

WSM5 Others 1553 511557

Code lines (f90)

* John Michalakes, “Code restructuring to improve performance in WRF model physics on Intel Xeon Phi”, Workshop on Programming weather, climate, and earth-system models on heterogeneous multi-core platforms, September 20, 2013

  • Jan. 2000, 30km workload *

9 91 CONUS 12km workload *

slide-8
SLIDE 8

WRF Microphysics

  • Microphysics provides

atmospheric heat and moisture tendencies.

  • Microphysics includes explicitly

resolved water vapor, cloud, and precipitation processes.

  • Surface snowfall and rainfall are

computed by microphysical schemes.

  • Several bulk water microphysics

schemes are available within the WRF, with different numbers of simulated hydrometeor classes and methods for estimating their size fall speeds, distributions and densities

Microphysics processes in the WSM5 scheme

Cloud ice Cloud water Rain Water vapor Snow

P c

  • n

d P i d e p Praut Prevp Pracw Psmlt Psaci Psacw Psdep P i g e n Psaut Psevp

slide-9
SLIDE 9

Analyzing the WSM5 on CONUS 12 km domain

  • Arithmetic intensity (=FLOPS / byte)
  • high arithmetic intensity -> computation bound
  • low arithmetic intensity -> memory bound
  • WSM5 CONUS 12km workload: 24.25 billion instructions
  • 7.30 billion memory reads
  • 3.18 billion memory writes
  • > 0.83 instructions / byte

Tesla K20 delivers up to 3519 GFLOPS / 208 GB/s ~16.9 FLOPS/byte Arithmetic intensity is relatively low -> reduce memory accesses

Measured using cachegrind (valgrind)

Arithmetic Intensity O(1) N(Log(N)) O(N) BLAS 1 BLAS 2 Dense linear algebra (BLAS 3) FFT N-body (Particle Methods)

Computer Organization and Design: The Hardware/software Interface By David A. Patterson, John L. Hennessy

slide-10
SLIDE 10

Parallelization of the computational domain

  • WRF domain is 2d grid

parallel to the ground

  • Multiple levels correspond

to the vertical heights in the atmosphere

  • Vertical dependencies
  • Columns are independent
  • Parallelizable in horizontal:

two dimensions of parallelism to work with

  • Each thread computes one

column at a grid point

12km resolution case Grid dimention: X=433 Y=308 Z=35 Executed in

  • ne thread

X Z Y

Cloud ice Cloud water Rain Water vapor Snow

Pcond Pidep Praut P r e v p Pracw Psmlt Psaci Psacw Psdep Pigen Psaut Psevp

slide-11
SLIDE 11

Additional optimizations for CUDA C

Decreases processing time from 29.6 ms to 25.4 ms on K20

1.Seven additional temporaries were eliminated 2.Four additional loop fusions were performed 3.Several global arrays were prefetched from global memory to registers. Results were written back at the end of the loop. 4.Dead-code was eliminated 5.Removed computation of the same array thrice 6.After a loop-inversion, three loops were fused (2x) 7.Used const __restrict__* to utilize read-only cache

Mielikainen, J.; Bormin Huang; Huang, H.A.; Goldberg, M.D., "Improved GPU/CUDA Based Parallel Weather and Research Forecast (WRF) Single Moment 5-Class (WSM5) Cloud Microphysics," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,

  • Vol. 5, No. 4, pp. 1256-1265, 2012.
slide-12
SLIDE 12

Analysis of WSM5 on Tesla K20

Metric Description Old WSM5 New WSM5 Processing time 29.6 ms 25.4 ms 14% faster GFLOPS/s 220.5 257.0 Registers per thread 56 62 Additional registers are used for data prefetching/temp. removal Stack frame 0 bytes 8 bytes Spill stores 0 bytes 4 bytes Constant memory 840 bytes 784 bytes 7x64-bit pointers were removed Achieved Occupancy 0.47 0.47 Increase in register usage didn't reduce occupancy Executed IPC 1.17 1.30 Increased by loop fusion L2 Hit Rate 46.18% 57.31% Increased by temporary elimination Texture Cache Hit Rate 53.30% 59.74% Global Load Transactions 25,283,839 24,217,376 Reduced by temporary elimination Global Store Transactions 12,078,815 8,802,572 Reduced by temporary elimination Global Load Throughput 93.9 GB/s 103.8 GB/s

slide-13
SLIDE 13

Limiting factors

Different type of instructions are executed on different function units within each SM. Performance can be limited if a function unit is overused Achieved compute and memory bandwidth below 60% indicate latency issues

Kernel Performance is Bound by Instruction and Memory Latency

slide-14
SLIDE 14

Benchmarking GPUs

GPU Core clock CUDA cores Peak Single Precision Processing Power Peak Double Precision Processing Power Memory Bandwidth ( ECC off ) Total Memory Size Tesla K20 (Nov. 2012) 705 MHz (758 MHz *) 2496 3519 GFLOPS 1173 GFLOPS 208 GB/s 5 GB Tesla K40 (Nov. 2013) 745 MHz (875 MHz *) 2880 3837 GFLOPS 1279 GFLOPS 288 GB/s 12 GB

  • NVIDIA GPU Boost is a feature that makes use of the power headroom to run the SM

clock to a higher frequency.

  • The default clock is set to the base clock, which is necessary for some applications that

are demanding on power (e.g., DGEMM), many application workloads are less demanding on power and can take advantage of a higher boost clock setting for added performance.

slide-15
SLIDE 15

K40 Base Mode K40 Boost Mode

Memory Bandwidth and Utilization

slide-16
SLIDE 16

Nvidia K40 vs. Xeon Phi

* Xeon Phi optimization: John Michalakes, NOAA Additional Optimization: I. Gokhale, L. Meadows, R. Sasanka, Intel Corp.

Xeon Phi * Tesla K40

Processing Time 29.7 ms 16.5 ms Concurrent CUDA threads 3840 (60 cores, 4 HT, 16 SIMD) 14336 (28 warps/MP, 16 MPs) Vector Instructions 49.73% 100% DRAM Write Throughput 33.5 GB/s 57.7 GB/s DRAM Read Throughput 19.0 GB/s 93.3 GB/s

  • Xeon Phi vectorized 1/2 of WSM5 – the other half utilizes only multiple cores
  • Xeon Phi with a higher cache size/number of threads ratio can serve more

memory requests from caches than K40

  • K40 is able to hide latency better even with a higher usage of global memory

than Xeon Phi

  • a larger number of concurrent threads allows for better latency hiding
slide-17
SLIDE 17
slide-18
SLIDE 18

Code Validation

  • Fused multiply-addition was turned off (--fmad=false)
  • GNU C math library was used on GPU, i.e. powf(), expf(), sqrt()

and logf() are replaced by library routines from GNU C library

  • > bit-exact output
  • Small output differences for –fast-math

Difference between CPU and GPU outputs Potential temperature

slide-19
SLIDE 19

GPU-accelerated WRF modules

WRF Module name Speedup Single moment 6-class microphysics 500x Eta microphysics 272x Purdue Lin microphysics 692x Stony-Brook University 5-class microphysics 896x Betts-Miller-Janjic convection 105x Kessler microphysics 816x New Goddard shortwave radiance 134x Single moment 3-class microphysics 331x New Thompson microphysics 153x Double moment 6-class microphysics 206x Dudhia shortwave radiance 409x Goddard microphysics 1311x Double moment 5-class microphysics 206x Total Energy Mass Flux surface layer 214x Mellor-Yamada Nakanishi Niino surface layer 113x Single moment 5-class microphysics 350x Pleim-Xiu surface layer 665x

slide-20
SLIDE 20

Conclusions

  • Great interest in the community in accelerators
  • Continuing work on accelerating other WRF

modules using CUDA C (~20 modules finished)

  • Lessons learned during CUDA C implementation of

WSM5 could be applied to OpenACC/OpenMP 4.0

  • ptimization of WRF modules
slide-21
SLIDE 21

Acknowledgement

Co-authors, Sponsors and Participants:

 Jarno Mielikainen, SSEC, Wisconsin-Madison  Melin Huang, SSEC, Wisconsin-Madison  Allen Huang, SSEC, Wisconsin-Madison  Mitchell Goldberg, NOAA  Ajay Mehta, NOAA  Stan Posey, NVIDIA