Allen Huang, Ph.D. allen@tempoquest.com CTO, Tempo Quest Inc. GTC 2016 San Jose, CA 5 April, 2016
GPU Acceleration of Weather Forecasting and Meteorological Satellite Data Assimilation, Processing and Applications
http://www.tempoquest.com
1
http://www.tempoquest.com Allen Huang, Ph.D. allen@tempoquest.com - - PowerPoint PPT Presentation
GPU Acceleration of Weather Forecasting and Meteorological Satellite Data Assimilation, Processing and Applications http://www.tempoquest.com Allen Huang, Ph.D. allen@tempoquest.com CTO, Tempo Quest Inc. GTC 2016 San Jose, CA 5 April, 2016
1
– Model is not Perfect yet – evolving scientific understanding & algorithm development – Data is not always accurate – actual and accurate initial data are expensive to collect & process – High performance computer is expensive – only can afford limited resource to deploy & operate HPC
– Same forecasts faster, much faster – Better forecasts take much more computations
– Hyperspectral Data Retrieval – Hyperspectral Data Compression
2
– Model is not Perfect yet – evolving scientific understanding & algorithm development – Data is not always accurate – actual and accurate initial data are expensive to collect & process – High performance computer is expensive – only can afford limited resource to deploy & operate HPC
– Same forecasts faster, much faster – Better forecasts take much more computations
– Hyperspectral Data Retrieval – Hyperspectral Data Compression
3
4
4
5
100,000 to 200,000 CPU cores required for:
NIM @2KM resolution, 2x/day
North American (NA) Domain HRRR @<1KM, hourly
HRRR @3KM NA, 100 members, hourly Reference : 250,000 CPU cost ~$100M; use 7,000KW & ~$8M/year energy bill
5
Why are the Weather Forecast Models not accurate enough?
6
Operational (T574~ 27km) Experiment (T1500~ 13km)
Note: Last 24h of the high resolution experiment track based
6
1 Zflops = 1021 flops 1 million trillion (1 billion billion) flop per sec, or 1 exaflops
7
– Model is not Perfect yet – evolving scientific understanding & algorithm development – Data is not always accurate – actual and accurate initial data are expensive to collect & process – High performance computer is expensive – only can afford limited resource to deploy & operate HPC
– Hyperspectral Data Retrieval – Hyperspectral Data Compression
– Same forecasts faster, much faster – Better forecasts take much more computations
8
9
Our experiments on the Intel i7 970 CPU running at 3.20 GHz and a single GPU out of two GPUs on NVIDIA GTX 590
Time [ms] The original Fortran code on CPU 16928 CUDA C with I/O on GPU 83.6 CUDA C without I/O on GPU 48.3
s
p v v v v s v s v
11
12
developed to take advantage of GPU’s massive parallelism capability.
To compute one day's amount of 1,296,000 IASI spectra, the original RTM (with –O2 optimization) will take ~10 days on a 3.0 GHz CPU core; the single-input GPU-RTM will take ~ 10 minutes (with 1455x speedup), whereas the multi-input GPU-RTM will take ~ 5 minutes (with 3024x speedup).
14
15
– Model is not Perfect yet – evolving scientific understanding & algorithm development – Data is not always accurate – actual and accurate initial data are expensive to collect & process – High performance computer is expensive – only can afford limited resource to deploy & operate HPC
– Hyperspectral Data Retrieval – Hyperspectral Data Compression
– Same forecasts faster, much faster – Accleration of Weather Research and Forecasting (WRF) Model
16
levels.
capturing the development of a strong baroclinic cyclone and a frontal boundary that extends from north to south across the entire U.S.
17
18
RRTMG LW 123x / 127x (GPU) JSTARS, 7, 3660-3667, 2014 RRTMG SW 202x / 207x (GPU) JSTARS, PP, 1-11, 2015 Goddard SW 92x / 134x (GPU) JSTARS, 5, 555-562, 2012 Dudhia SW 19x / 409x MYNN SL 6x / 113x TEMF SL 5x / 214x Thermal Diffusion LS 10x / 311x [ 2.1 x ] (GPU) JSTARS, 8, 2249-2259, 2015 YSU PBL 34x / 193x [ 2.4x ] (GPU) GMD, 8, 2977-2990, 2015 TEMF PBL [14.8x ] (MIC) SPIE:doi:10.1117/12.2055040 Betts-Miller-Janjic (BMJ) convetion 55x / 105x Radiation Surface PBL CU P
GPU speedup: speedup with IO / speedup without IO MIC improvement factor in [ ]: w.r.t. 1st version multi-threading code before any improvement
Kessler MP 70x / 816x
Purdue-Lin MP 156x / 692x [ 4.2x] (GPU) SPIE: doi:10.1117/12.901825 WSM 3-class MP 150x / 331x WSM 5-class MP 202x / 350x (GPU) JSTARS, 5, 1256-1265, 2012 Eta MP 37x / 272x SPIE: doi:10.1117/12.976908 WSM 6-class MP 165x / 216x (GPU) J. Comp. & GeoSci., 83, 17-26, 2015 Goddard GCE MP 348x / 361x [ 4.7x] (GPU) JSTARS, 8, 2260-2272, 2015 Thompson MP 76x / 153x [ 2.3x] (MIC) SPIE: doi:10.1117/12.2055038 SBU 5-class MP 213x / 896x JSTARS, 5, 625-633, 2012 WDM 5-class MP 147x / 206x WDM 6-class MP 150x / 206x
Cloud Microphysics
GPU speedup: speedup with IO / speedup without IO MIC improvement factor in [ ]: w.r.t. 1st version multi-threading code before any improvement
20
21
22