Reflecting on the Goal and Baseline of Exascale Computing Thomas C. - - PowerPoint PPT Presentation

reflecting on the goal and baseline of exascale computing
SMART_READER_LITE
LIVE PREVIEW

Reflecting on the Goal and Baseline of Exascale Computing Thomas C. - - PowerPoint PPT Presentation

Reflecting on the Goal and Baseline of Exascale Computing Thomas C. Schulthess | T. Schulthess 1 Tracking supercomputer performance over time? Linpack benchmark solves: Ax = b | T. Schulthess 2 Tracking supercomputer performance


slide-1
SLIDE 1
  • T. Schulthess

|

Thomas C. Schulthess

1

Reflecting on the Goal and Baseline of Exascale Computing

slide-2
SLIDE 2
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

slide-3
SLIDE 3
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

slide-4
SLIDE 4
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1,000-fold performance improvement per decade

slide-5
SLIDE 5
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained

1,000-fold performance improvement per decade

slide-6
SLIDE 6
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained

1,000-fold performance improvement per decade

slide-7
SLIDE 7
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained

1,000-fold performance improvement per decade

KKR-CPA (MST)

slide-8
SLIDE 8
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained

1,000-fold performance improvement per decade

KKR-CPA (MST) LSMS (MST)

slide-9
SLIDE 9
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained

1,000-fold performance improvement per decade

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

slide-10
SLIDE 10
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained

1,000-fold performance improvement per decade

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

slide-11
SLIDE 11
  • T. Schulthess

|

Tracking supercomputer performance over time?

2

Ax = b Linpack benchmark solves:

1st application at > 1 TFLOP/s sustained 1st application at > 1 PFLOP/s sustained

1,000-fold performance improvement per decade

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

1,000x perf. improv. per decade seems hold for multiple-scattering-theory(MST)- based electronic structure for materials science

slide-12
SLIDE 12
  • T. Schulthess

|

“Only” 100-fold performance improvement in climate codes

3

Source: Peter Bauer, ECMWF Source: Peter Bauer, ECMWF

slide-13
SLIDE 13
  • T. Schulthess

| 4

Has the efficiency of weather & climate codes dropped 10-fold every decade?

slide-14
SLIDE 14
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

slide-15
SLIDE 15
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

slide-16
SLIDE 16
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

Cray Y-MP @ 300kW

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

slide-17
SLIDE 17
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

Cray Y-MP @ 300kW Cray XT5 @ 7MW

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

slide-18
SLIDE 18
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

Cray Y-MP @ 300kW Cray XT5 @ 7MW

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

IBM P5 @ 400 kW

slide-19
SLIDE 19
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

Cray Y-MP @ 300kW Cray XT5 @ 7MW

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

IBM P5 @ 400 kW IBM P6 @ 1.3 MW

slide-20
SLIDE 20
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

Cray Y-MP @ 300kW Cray XT5 @ 7MW Cray XT5 @ 1.8 MW

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

IBM P5 @ 400 kW IBM P6 @ 1.3 MW

slide-21
SLIDE 21
  • T. Schulthess

|

Floating points efficiency dropped from 50% on Cray Y-MP to 5% on today’s Cray XC (10x in 2 decades)

5

Source: Peter Bauer, ECMWF

Cray Y-MP @ 300kW Cray XT5 @ 7MW Cray XT5 @ 1.8 MW System size (in energy footprint) grew 
 much faster on “Top500” systems

KKR-CPA (MST) LSMS (MST) WL-LSMS (MST)

IBM P5 @ 400 kW IBM P6 @ 1.3 MW

slide-22
SLIDE 22
  • T. Schulthess

| 6

Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF

slide-23
SLIDE 23
  • T. Schulthess

| 6

Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF

slide-24
SLIDE 24
  • T. Schulthess

| 6

Source: Christoph Schär, ETH Zurich, & Nils Wedi, ECMWF

Can the delivery of a 1km-scale capability be pulled in by a decade?

slide-25
SLIDE 25
  • T. Schulthess

|

Leadership in weather and climate

7

Peter Bauer, ECMWF

slide-26
SLIDE 26
  • T. Schulthess

|

Leadership in weather and climate

7

European model may be the best – but far away from sufficient accuracy and reliability!

Peter Bauer, ECMWF

slide-27
SLIDE 27
  • T. Schulthess

|

The impact of resolution: simulated tropical cyclones

8

130 km 60 km 25 km Observations HADGEM3 PRACE UPSCALE, P.L. Vidale (NCAS) and M. Roberts (MO/HC)

slide-28
SLIDE 28
  • T. Schulthess

|

Resolving convective clouds (convergence?)

9

Source: Christoph Schär, ETH Zurich

Bulk convergence Structural convergence

Area-averaged bulk effects upon ambient flow: E.g., heating and moistening of cloud layer Statistics of cloud ensemble: E.g., spacing and size of convective clouds

slide-29
SLIDE 29
  • T. Schulthess

|

Structural and bulk convergence

10

relative frequency

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10

cloud area [km2]

10

−2

10 10

2

10

4

71 64 54 47 43 grid-scale clouds [%] convective mass flux [kg m−2 s−1]

−10 −5 5 10 15

relative frequency

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 10

−7

8 km 4 km 2 km 1 km 500 m

Source: Christoph Schär, ETH Zurich

Statistics of cloud area Statistics of up- & downdrafts No structural convergence Bulk statistics of updrafts converges Factor 4 (Panosetti et al. 2018)

slide-30
SLIDE 30
  • T. Schulthess

| 11

What resolution is needed?

  • There are threshold scales in the atmosphere and ocean: going from 100

km to 10 km is incremental, 10 km to 1 km is a leap. At 1km

  • it is no longer necessary to parametrise precipitating convection, ocean

eddies, or orographic wave drag and its effect on extratropical storms;

  • ocean bathymetry, overflows and mixing, as well as regional orographic

circulation in the atmosphere become resolved;

  • the connection between the remaining parametrisation are now on a

physical footing.

  • We spend the last five decades in a paradigm of incremental advances.

Here we incrementally improved the resolution of models from 200 to 20km

  • Exascale allows us to make the leap to 1 km. This fundamentally changes

the structure of our models. We move from crude parametric presentations to an explicit, physics based, description of essential processes.

  • The last such step change was fifty years ago. This was when, in the late

1960s, climate scientists first introduced global climate models, which were distinguished by their ability to explicitly represent extra-tropical storms, ocean gyres and boundary current.

Bjorn Stevens, MPI-M

slide-31
SLIDE 31
  • T. Schulthess

|

Simulation throughput: Simulate Years Per Day (SPYD)

12

NWP Climate in production Climate spinup Simulation 10 d 100 y 5’000 y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD 0.27 2.7 27

slide-32
SLIDE 32
  • T. Schulthess

|

Simulation throughput: Simulate Years Per Day (SPYD)

12

NWP Climate in production Climate spinup Simulation 10 d 100 y 5’000 y Desired wall clock time 0.1 d 0.1 y 0.5 y ratio 100 1'000 10'000 SYPD 0.27 2.7 27

Minimal throughout 1 SYPD, preferred 5 SYPD

slide-33
SLIDE 33
  • T. Schulthess

|

Summary of intermediate goal (reach by 2021?)

13

Horizontal resolution 1 km (globally quasi-uniform) Vertical resolution 180 levels (surface to ~100 km) Time resolution Less than 1 minute Coupled Land-surface/ocean/ocean-waves/sea-ice Atmosphere Non-hydrostatic Precision Single (32bit) or mixed precision Compute rate 1 SYPD (simulated year wall-clock day)

slide-34
SLIDE 34
  • T. Schulthess

|

Running COSMO 5.0 at global scale on Piz Daint

14

Scaling to full system size: ~5300 GPU accelerate nodes available Running a near-global (±80º covering 97% of Earths surface) COSMO 5.0 simulation & IFS > Either on the hosts processors: Intel Xeon E5 2690v3 (Haswell 12c). > Or on the GPU accelerator: PCIe version ofNVIDIA GP100 (Pascal) GPU

slide-35
SLIDE 35
  • T. Schulthess

| 15

September 15, 2015

Today’s Outlook: GPU-accelerated Weather Forecasting

John Russell

“Piz Kesch”

slide-36
SLIDE 36
  • T. Schulthess

| 16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

Requirements from MeteoSwiss

6x

slide-37
SLIDE 37
  • T. Schulthess

| 16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

Requirements from MeteoSwiss

6x We need a 40x improvement between 2012 and 2015 at constant cost

slide-38
SLIDE 38
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

Requirements from MeteoSwiss

6x We need a 40x improvement between 2012 and 2015 at constant cost

slide-39
SLIDE 39
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost

slide-40
SLIDE 40
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost

slide-41
SLIDE 41
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost

slide-42
SLIDE 42
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.3x Change in architecture (CPU GPU) Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost

slide-43
SLIDE 43
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU GPU) Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost

slide-44
SLIDE 44
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU GPU) 1.3x additional processors Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture We need a 40x improvement between 2012 and 2015 at constant cost

slide-45
SLIDE 45
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU GPU) 1.3x additional processors Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture Bonus: reduction in power! We need a 40x improvement between 2012 and 2015 at constant cost

slide-46
SLIDE 46
  • T. Schulthess

|

Where the factor 40 improvement came from

16

1 5 10 15 20 25 30 35 40

Constant budget for investments and operations Grid 2.2 km 1.1 km

24x

Ensemble with multiple forecasts Data assimilation

10x

1.7x from software refactoring (old vs. new implementation on x86) 2.8x Mathematical improvements (resource utilisation, precision) 2.8x Moore’s Law & arch. improvements on x86 2.3x Change in architecture (CPU GPU) 1.3x additional processors Requirements from MeteoSwiss

6x Investment in software allowed mathematical improvements and change in architecture

There is no silver bullet!

Bonus: reduction in power! We need a 40x improvement between 2012 and 2015 at constant cost

slide-47
SLIDE 47
  • T. Schulthess

|

Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0

17

Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

slide-48
SLIDE 48
  • T. Schulthess

|

Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0

17 0.01 0.1 1 10 100 4888 10 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100

(Filled sym- symbols) on er node.

h∆xi #nodes ∆t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46⇥1010 1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109 47 km 18 300 9.6 0.099 1.39 ⇥ 107

(c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing ob- tained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps

y of the time compression achieved in terms of SYPD.

Metric: simulated years per wall-clock day Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

slide-49
SLIDE 49
  • T. Schulthess

|

Near-global climate simulation at 1km resolution: establishing a performance baseline on 4888 GPUs with COSMO 5.0

17 0.01 0.1 1 10 100 4888 10 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100

(Filled sym- symbols) on er node.

h∆xi #nodes ∆t [s] SYPD MWh/SY gridpoints 930 m 4,888 6 0.043 596 3.46⇥1010 1.9 km 4,888 12 0.23 97.8 8.64 ⇥ 109 47 km 18 300 9.6 0.099 1.39 ⇥ 107

(c) Time compression (SYPD) and energy cost (MWh/SY) for three moist simulations. At 930 m grid spacing ob- tained with a full 10d simulation, at 1.9 km from 1,000 steps, and at 47 km from 100 steps

y of the time compression achieved in terms of SYPD.

Metric: simulated years per wall-clock day

2.5x faster than Yang et al.’s 2016 Gordon Bell winner run on TaihuLight!

Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

slide-50
SLIDE 50
  • T. Schulthess

|

The baseline for COSMO-global and IFS

18

Near-global COSMO [Fuh2018] Global IFS [Wed2009] Value Shortfall Value Shortfall Horizontal resolution 0.93 km (non- uniform) 0.9x 1.25 km 1.25x Vertical resolution 60 levels (surface to 25 km) 3x 62 levels (surface to 40 km) 3x Time resolution 6 s (split-explicit with sub- stepping)

  • 120 s (semi-implicit)

4x Coupled No 1.2x No 1.2x Atmosphere Non-hydrostatic

  • Non-hydrostatic
  • Precision

Double 0.6x Single

  • Compute rate

0.043 SYPD 23x 0.088 SYPD 11x Other (I/O, full physics, …) Limited I/O Only microphysics 1.5x Full physics, no I/O

  • Total shortfall

65x 198x

slide-51
SLIDE 51
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

slide-52
SLIDE 52
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

slide-53
SLIDE 53
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

slide-54
SLIDE 54
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018

0.88

slide-55
SLIDE 55
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Max achievable BW

0.88

slide-56
SLIDE 56
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW

0.88

slide-57
SLIDE 57
  • T. Schulthess

|

Memory use efficiency

19

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW

0.88 0.76

slide-58
SLIDE 58
  • T. Schulthess

|

Memory use efficiency

19

100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW

0.88 0.76

slide-59
SLIDE 59
  • T. Schulthess

|

Memory use efficiency

19

100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW

0.88 0.76

2x lower than peak BW

slide-60
SLIDE 60
  • T. Schulthess

|

Memory use efficiency

19

100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

MUE = I/O efficiency · BW efficiency = Q D B ˆ B

Necessary data transfers Actual data transfers Fuhrer et al., Geosci. Model Dev. Discuss., https://doi.org/10.5194/gmd-2017-230, published 2018 Achieved BW Max achievable BW

0.88 0.76 = 0.67

2x lower than peak BW

slide-61
SLIDE 61
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20

slide-62
SLIDE 62
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

slide-63
SLIDE 63
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20 100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

slide-64
SLIDE 64
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20 100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

  • 2. Improving BW efficiency

Improve BW efficiency and peak BW

2x

(results on Volta show this is realistic)

slide-65
SLIDE 65
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20 100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

0.01 0.1 1 10 100 4888 10 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

  • 2. Improving BW efficiency

Improve BW efficiency and peak BW

2x

(results on Volta show this is realistic)

slide-66
SLIDE 66
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20 100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

0.01 0.1 1 10 100 4888 10 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

  • 2. Improving BW efficiency

Improve BW efficiency and peak BW

2x

(results on Volta show this is realistic)

  • 3. Weak scaling

4x possible in COSMO, but we reduced 
 available parallelism by factor 2

2x

slide-67
SLIDE 67
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20 100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

0.01 0.1 1 10 100 4888 10 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

  • 2. Improving BW efficiency

Improve BW efficiency and peak BW

2x

(results on Volta show this is realistic)

  • 3. Weak scaling

4x possible in COSMO, but we reduced 
 available parallelism by factor 2

2x

  • 4. Remaining reduction in shortfall

4x

Numerical algorithms (larger time steps) Further improved processors / memory

slide-68
SLIDE 68
  • T. Schulthess

|

How realistic is it to overcome 65-fold shortfall of a grid-based implementation like COSMO-global?

20 100 200 300 400 500 600 0.1 1000 Memory BW (GB/s) Data size (MB) 28.2 1,000 100 1 0.1 362 10

COPY (double) a[i] = b[i] GPU STREAM (double) a[i] = b[i] (1D) AVG i-stride (float) a[i]=b[i-1]+b[i+1] 5-POINT (float) a[i] = b[i] + b[i+1] + b[i-1] + b[i+jstride] +b[i-jstride] COPY (float) a[i] = b[i]

0.01 0.1 1 10 100 4888 10 100 1000 SYPD #nodes Δx = 19 km, P100 Δx = 19 km, Haswell Δx = 3.7 km, P100 Δx = 3.7 km, Haswell Δx = 1.9 km, P100 Δx = 930 m, P100

  • 1. Icosahedral grid (ICON) vs. Lat-long/Cartesian grid (COSMO)

2x fewer grid-columns Time step of 10 ms instead of 5 ms

4x

  • 2. Improving BW efficiency

Improve BW efficiency and peak BW

2x

(results on Volta show this is realistic)

  • 3. Weak scaling

4x possible in COSMO, but we reduced 
 available parallelism by factor 2

2x

  • 4. Remaining reduction in shortfall

4x

Numerical algorithms (larger time steps) Further improved processors / memory But we don’t want to increase the footprint of the 2021 system beyond “Piz Daint”

slide-69
SLIDE 69
  • T. Schulthess

|

The importance of ensembles

21

Peter Bauer, ECMWF

slide-70
SLIDE 70
  • T. Schulthess

|

The importance of ensembles

21

Peter Bauer, ECMWF

slide-71
SLIDE 71
  • T. Schulthess

|

Remaining goals beyond 2021 (by 2024?)

22

  • 1. Improve the throughput to 5 SYPD
  • 2. Reduce the footprint of a single simulation by up to factor 10
slide-72
SLIDE 72
  • T. Schulthess

|

Data challenge: what is all models run at 1 km scale?

23

With current workflow used by the climate community, 
 IPPC at 1 km scale avg. would produce 50 exabytes of data! The only way out would be to analyse the model while the simulation is running New workflow:

  • 1. first set of runs generate model trajectories and checkpoints
  • 2. Reconstruct model trajectories from checkpoints and analyse


(as often as necessary)

slide-73
SLIDE 73
  • T. Schulthess

| 24

Summary and Conclusions

slide-74
SLIDE 74
  • T. Schulthess

| 24

Summary and Conclusions

  • While flop/s may be good to compare performance in history of computing, it is not a

good metric to design the systems of the future

slide-75
SLIDE 75
  • T. Schulthess

| 24

Summary and Conclusions

  • While flop/s may be good to compare performance in history of computing, it is not a

good metric to design the systems of the future

  • Given todays challenges use Memory Use Efficiency (MUE) instead

MUE = “I/O Efficiency” X “Bandwidth Efficiency”

slide-76
SLIDE 76
  • T. Schulthess

| 24

Summary and Conclusions

  • While flop/s may be good to compare performance in history of computing, it is not a

good metric to design the systems of the future

  • Given todays challenges use Memory Use Efficiency (MUE) instead

MUE = “I/O Efficiency” X “Bandwidth Efficiency”

  • Convection resolving weather and climate simulations @ 1 km horizontal resolution
  • Represents a big leap for quality of weather and climate simulations
  • Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s
  • Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s
  • Need a 50-200x performance improvement to reach 2021 goal of 1SYPD
slide-77
SLIDE 77
  • T. Schulthess

| 24

Summary and Conclusions

  • While flop/s may be good to compare performance in history of computing, it is not a

good metric to design the systems of the future

  • Given todays challenges use Memory Use Efficiency (MUE) instead

MUE = “I/O Efficiency” X “Bandwidth Efficiency”

  • Convection resolving weather and climate simulations @ 1 km horizontal resolution
  • Represents a big leap for quality of weather and climate simulations
  • Aim is to pull in the milestone by a decade, in the early 2020s rather than 2030s
  • Desired throughput of 1 SYPD by 2021 and 5 SYPD by mid 2020s
  • Need a 50-200x performance improvement to reach 2021 goal of 1SYPD
  • System improvement needed
  • Improve “Bandwidth Efficiency” for regular but non-stride-1 memory access
  • Improve application software
  • Further improvement of scalability
  • Improve algorithms (time integration has potential)
slide-78
SLIDE 78
  • T. Schulthess

| 25

Collaborators

Tim Palmer (U. of Oxford) Christoph Schar (ETH Zurich) Oliver Fuhrer (MeteoSwiss) Peter Bauer (ECMWF) Bjorn Stevens (MPI-M) Torsten Hoefler (ETH Zurich) Nils Wedi (ECMWF)