This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671627
Exascale challenges for Numerical Weather Prediction : the ESCAPE project
O Olivier Marsden
Exascale challenges for Numerical Weather Prediction : the ESCAPE - - PowerPoint PPT Presentation
Exascale challenges for Numerical Weather Prediction : the ESCAPE project O Olivier Marsden This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreem ent No 671627 European
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671627
O Olivier Marsden
Independent intergovernmental organisation established in 1975 with 19 Member States 15 Co-operating States
2
May be one of the best medium-range forecasts of all times!
3
Sandy 28 Oct 2012 Mean sea-level pressure AN 30 Oct 5d FC T3999 5d FC T1279 5d FC T639 Precipitation: NEXRAD 27 Oct 4d FC T639 4d FC T1279 4d FC T3999 3d FC: Wave height Mean sea-level pressure 10 m wind speed
4
Observations Models Volume 20 million = 2 x 107 5 million grid points 100 levels 10 prognostic variables = 5 x 109 Type 98% from 60 different satellite instruments physical parameters of atmosphere, waves,
Observations Models Volume 200 million = 2 x 108 500 million grid points 200 levels 100 prognostic variables = 1 x 1013 Type 98% from 80 different satellite instruments physical and chemical parameters of atmosphere, waves, ocean, ice, vegetation
Today: Tomorrow:
Factor 10 per day Factor 2000 per time step
5
1 2 3 4 5 6 7 8 9 8192 16384 24576 32768 40960 49152 57344 65536 73728 81920 90112 98304 106496 114688 122880 131072 139264
Fraction of Operational Threshold Number of Edison Cores (CRAY XC-30)
13km Case: Speed Normalized to Operational Threshold (8.5 mins per day)
IFS NMM-UJ FV3, single precision NIM FV3, double precision MPAS NEPTUNE 13km Oper. Threshold
[Michalakes et al. 2015: AVEC-Report: NGGPS level-1 benchmarks and software evaluation] 6
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 8192 16384 24576 32768 40960 49152 57344 65536 73728 81920 90112 98304 106496 114688 122880 131072 139264
Fraction of Operational Threshold Number of Edison Cores (CRAY XC-30)
3km Case: Speed Normalized to Operational Threshold (8.5 mins per day)
IFS NMM-UJ FV3 single precision FV3 double precision NIM NIM, improved MPI comms MPAS NEPTUNE 3km Oper. Threshold
Advanced Computing Evaluation Committee (AVEC)
to evaluate HPC performance of five Next Generation Global Prediction System candidates to meet operational forecast requirements at the National Weather Service through 2025-30
7
8
9
IFS = Integrated Forecasting System
October 29, 2014
2 MW 6 MW (for a single HRES forecast)
two XC-30 clusters each with 85K cores
10
Operational requirement ECMWF require system capacity for 10 to 20 simultaneous HRES forecasts
ESCAPE*, Energy efficient SCalable Algorithms for weather Prediction at Exascale:
*Funded by EC H2020 framework, Future and Emerging Technologies – High-Performance Computing Partners: ECMWF, Météo-France, RMI, DMI, Meteo Swiss, DWD, U Loughborough, PSNC, ICHEC, Bull, NVIDIA, Optalysys
13
Grid-point space
Spectral space
14
Grid-point space
Spectral space
Time-stepping loop in dwarf1-atlas.F90 DO JSTEP=1,ITERS call trans%invtrans(spfields,gpfields) call trans%dirtrans(gpfields,spfields) ENDDO
Ivybridge)
15
Work carried out by George Mozdzynski, ECMWF
230.6 230.7 207.5 218.4 109.1 86.1 238.9 239.5
50 100 150 200 250 300 LTINV_CTL LTDIR_CTL FTDIR_CTL FTINV_CTL msec per time-step
XC-30 TITAN
16
646.2 645.1 345.3 351.1 189.3 152.9 281.7 281.9
100 200 300 400 500 600 700 LTINV_CTL LTDIR_CTL FTDIR_CTL FTINV_CTL msec per time-step
XC-30 TITAN
17
1024.9 1178.6 428.3 424.6 324.3 279.8 342.3 341.8
200 400 600 800 1000 1200 1400 LTINV_CTL LTDIR_CTL FTDIR_CTL FTINV_CTL msec per time-step
XC-30 TITAN
18
100 200 400 800 1600 3200 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 T95 T159 T399 T1023 T1279 T2047 T3999 T7999
Relative Performance
1.40-1.60 1.20-1.40 1.00-1.20 0.80-1.00 0.60-0.80 0.40-0.60 0.20-0.40 0.00-0.20
K20X GPU performance up to 1.4 times faster than 24 Ivybridge core XC-30 node
22
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 1000 2000 3000 4000 5000 6000 7000 8000 9000
Time FFT length (latitude points)
3700 7400 4100
24
25
26
Tc3999 XC-30 TITAN XC-30+GPU Prediction LTINV_CTL 1024.9 324.3 324.3 LTDIR_CTL 1178.6 279.8 279.8 FTDIR_CTL 428.3 342.3 342.3 FTINV_CTL 424.6 341.8 341.8 MTOL 752.5 4763.0 752.5 LTOM 407.9 4782.9 407.9 LTOG 1225.9 1541.9 1225.9 GTOL 401.5 1658.4 401.5 HOST2GPU** 0.0 655.4 655.4 GPU2HOST** 0.0 650.0 650.0 5844.2 14034.4 5381.4 ** included in comms (red) times
27
architectures as part of ECMWF Scalability programme
28
29
30
REAL(kind=8) :: array(NPROMA, NLEV, NGPBLKS) !$OMP PARALLEL PRIVATE(JKGLO,IBL,ICEND) !$OMP DO SCHEDULE(DYNAMIC,1) DO JKGLO=1,NGPTOT,NPROMA! So called NPROMA-loop IBL=(JKGLO-1)/NPROMA+1 ! Current block number ICEND=MIN(NPROMA,NGPTOT-JKGLO+1) ! Block length <= NPROMA CALL CLOUDSC( 1, ICEND, NPROMA, KLEV, & & array(1,1,IBL), & ! ~ 65 arrays like this ) END DO !$OMP END DO !$OMP END PARALLEL
31
Typical values for NPROMA in OpenMP implementation: 10 – 100
32
PRESENT / CREATE clauses to CLOUDSC
33
34
!$ !$OMP OMP P PARA ARALLE LLEL PRI PRIVATE ATE(JK (JKGLO, LO,IB IBL,I L,ICEN CEND) & ) & !$ !$OMP OMP& & PRI PRIVAT VATE(tid tid, , idgpu idgpu) ) num num_thr thread eads(Num NumGPUs PUs) ti tid = = omp_ mp_get get_thr threa ead_n d_num um() () ! ! Op OpenM enMP th threa read number number id idgpu gpu = = mod mod(ti tid, , Nu NumGP mGPUs Us) ) ! ! Ef Effec fectiv tive GPU GPU# fo for r th this is th threa read CALL CALL acc_s acc_set et_devi _device_nu ce_num(idgpu idgpu, , ac acc_ c_ge get_d _dev evice ce_t _typ ype()) ()) !$ !$OMP OMP D DO S O SCHE CHEDULE ULE(STAT TATIC IC) DO DO JKG JKGLO= LO=1,NG ,NGPT PTOT, OT,NPR NPROMA MA ! ! NPROMA NPROMA-loop loop IBL IBL=(J =(JKGLO GLO-1)/N )/NPRO PROMA+1 A+1 ! C ! Curr urrent nt bl block
number ber ICE ICEND= ND=MIN( IN(NP NPROM ROMA,NG ,NGPTO PTOT-JKG JKGLO+ LO+1) ! B ! Bloc lock l k lengt ngth h <= <= NPR NPROMA MA !$ !$acc acc dat ata a cop
yout ut(array array(: (:,: ,:,I ,IBL) L), , ... ..) ) & & ! ~ ~22 22 : : GP GPU U to
Host !$ !$acc acc& & cop copyi yin( n(array array(:, (:,:, :,IBL IBL)) )) ! ~4 ~43 3 : : Host Host to to GP GPU CALL CALL CL CLOUD OUDSC SC (. (... .. arra rray(1,1 1,1,IB ,IBL) . ) ... ..) ) !
! Run Runs on n GPU#<i PU#<idgp dgpu>
!$ !$acc acc end nd data data END END DO DO !$ !$OMP OMP E END ND DO DO !$ !$OMP OMP E END ND PAR PARALLE LLEL
Typical values for NPROMA in OpenACC implementation: > 10,000
!$ACC KERNELS LOOP COLLAPSE(2) PRIVATE(ZTMP_Q,ZTMP) DO JK=1,KLEV DO JL=KIDIA,KFDIA ztmp_q = 0.0_JPRB ztmp = 0.0_JPRB !$ACC LOOP PRIVATE(ZQADJ) REDUCTION(+:ZTMP_Q, +:ZTMP) DO JM=1,NCLV-1 IF (ZQX(JL,JK,JM)<RLMIN) THEN ZLNEG(JL,JK,JM) = ZLNEG(JL,JK,JM)+ZQX(JL,JK,JM) ZQADJ = ZQX(JL,JK,JM)*ZQTMST ztmp_q = ztmp_q + ZQADJ ztmp = ztmp + ZQX(JL,JK,JM) ZQX(JL,JK,JM) = 0.0_JPRB ENDIF ENDDO PSTATE_q_loc(JL,JK) = PSTATE_q_loc(JL,JK) + ztmp_q ZQX(JL,JK,NCLDQV) = ZQX(JL,JK,NCLDQV) + ztmp ENDDO ENDDO !$ACC END KERNELS ASYNC(IBL)
35
ASYNC removes CUDA-thread syncs
36
2 4 6 8 10 12 100 1000 10000 20000 40000 80000 1 GPU 2 GPUs
NPROMA
37
2000 4000 6000 8000 10000 12000 10 1000 20000 80000 Other overhead Communication Computation Haswell
NPROMA
38
!$OMP !$OMP PARA PARALL LLEL EL PR PRIVATE IVATE(J (JKGLO, KGLO,IBL,I IBL,ICE CEND) & ND) & !$ !$OMP OMP& & PRI PRIVAT VATE(tid tid, , idgpu idgpu) ) num num_thr thread eads(Num NumGPUs PUs * 4 * 4) ti tid = = omp_ mp_get get_thr threa ead_n d_num um() () ! ! Op OpenM enMP th threa read number number id idgpu gpu = = mod mod(ti tid, , Nu NumGP mGPUs Us) ) ! ! Ef Effec fectiv tive GPU GPU# fo for r th this is th threa read CA CALL LL ac acc_s c_set_ et_devi evice ce_nu _num(id idgpu gpu, , acc acc_ge _get_de _devi vice_ ce_typ type()) ()) !$OMP !$OMP DO S DO SCH CHEDULE EDULE(STA TATI TIC) DO DO JKG JKGLO= LO=1,NG ,NGPT PTOT, OT,NPR NPROMA MA ! ! NPROMA NPROMA-loop loop IBL IBL=(J =(JKGLO GLO-1)/N )/NPRO PROMA+1 A+1 ! C ! Curr urrent nt bl block
number ber ICE ICEND= ND=MIN( IN(NP NPROM ROMA,NG ,NGPTO PTOT-JKG JKGLO+ LO+1) ! B ! Bloc lock l k lengt ngth h <= <= NPR NPROMA MA !$ !$acc acc data ata copy
array(:, (:,:,I :,IBL), L), . ...) ..) & & ! ~2 ~22 2 : G : GPU PU to
Host !$ !$acc acc& cop & copyin( yin(array array(: (:,: ,:,IB IBL) L)) ) ! ~ ~43 43 : : Host Host to to G GPU CALL CALL CL CLOUD OUDSC SC (. (... .. arra rray(1,1 1,1,IB ,IBL) . ) ... ..) ) !
! Run Runs on n GPU#<i PU#<idgp dgpu>
!$ !$acc acc end nd da data ta EN END D DO !$ !$OMP OMP E END ND DO DO !$ !$OMP OMP E END ND PAR PARALLE LLEL
More threads here
in our CLOUDSC case
39
2 4 6 8 10 12 14 16 Copies 1 2 4 1 GPU 2 GPUs
40
41
GPU is 4-way time-shared GPU is fed with work by one OpenMP thread only
500 1000 1500 2000 2500 3000 3500 4000 4500 10 20000 80000 Other overhead Communication Computation Haswell
NPROMA
42
GPU is 4-way time-shared GPU is not time-shared
2 4 6 8 10 12 14 16 18 Haswell 2 GPUs (T/S) 2 GPUs 1 GPU (T/S) 1 GPU
43
T/S = GPUs time- shared
44
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671627
www.hpc-escape.eu