A highly scalable Met Office NERC Cloud model EASC 2015 Nick - - PowerPoint PPT Presentation

a highly scalable
SMART_READER_LITE
LIVE PREVIEW

A highly scalable Met Office NERC Cloud model EASC 2015 Nick - - PowerPoint PPT Presentation

A highly scalable Met Office NERC Cloud model EASC 2015 Nick Brown (EPCC), Michele Weiland (EPCC), Adrian Hill (Met Office), Ben Shipway (Met Office) and Chris Maynard (Met Office) nick.brown@ed.ac.uk A highly scalable Met Office NERC Cloud


slide-1
SLIDE 1

Nick Brown (EPCC), Michele Weiland (EPCC), Adrian Hill (Met Office), Ben Shipway (Met Office) and Chris Maynard (Met Office) nick.brown@ed.ac.uk

A highly scalable Met Office NERC Cloud model

EASC 2015

slide-2
SLIDE 2

A highly scalable Met Office NERC Cloud model

  • The existing Large Eddy Model (LEM)
  • The replacement Met Office NERC Cloud model (MONC)
  • Performance and scalability

A highly scalable Met Office NERC Cloud model

slide-3
SLIDE 3

Background

  • The Met Office’s Large Eddy Model (LEM) is used for large

eddy simulation and cloud resolving modelling

– Primarily models clouds and atmospheric flows – The results of these simulations inform science in their own right and help develop the parameterisations for the UM.

A highly scalable Met Office NERC Cloud model

  • The desire is to do very

high resolution (<1m) and/or real time modelling

slide-4
SLIDE 4

Background

  • However the LEM was developed in the late 1980s

– Designed for scalar machines – A mixture of FORTRAN 90, 77 and earlier

  • Parallelised in the mid 1990s and initially targeted the T3E

(430 GFLOPS.)

– Some perfective maintenance performed since then to enable use on later generation machines, but still using the same basic assumptions.

A highly scalable Met Office NERC Cloud model

slide-5
SLIDE 5

Background – scalability issues

  • The 3D space is decomposed into

2D slices

– One of the largest runs has been x=y=384 z=150 (22 million grid points)

  • ver 192 processes.

A highly scalable Met Office NERC Cloud model

  • Parallel calls go to MPI through GCOM

– Generations of users have miss understood the semantics of these communications (such as blocking) and added in lots of superfluous synchronisation.

slide-6
SLIDE 6

Background – code issues

  • Uses an archaic system for managing the code
  • Global variables
  • Gotos
  • Equivalences
  • Different styles adopted in the same files/procedures
  • No unit tests.
  • Nobody knows the workings of some areas of the code

A highly scalable Met Office NERC Cloud model

slide-7
SLIDE 7

MONC

  • We elected for a complete rewrite of the code, using modern

software engineering and parallelism techniques

– Written in Fortran 2003 with MPI – Using Fruit for unit testing and Doxygen for documentation – Designed to be a community model which will be accessible to be changed by non expert HPC programmers and scale/perform well.

A highly scalable Met Office NERC Cloud model

  • Met Office to get a Cray XC40

machine.

– This, along with ARCHER is the initial target for the model.

slide-8
SLIDE 8

MONC – code architecture

  • Architected as plugins called components

– All independent of each other – Follow a specific standard format – Can be enabled/disabled at runtime via configuration files – Trivial to create new components – Managed via a registry

  • Components contain optional callbacks

– At initialisation of MONC – Per timestep – At finalisation of the model

A highly scalable Met Office NERC Cloud model

slide-9
SLIDE 9

MONC – Component example

type(component_descriptor_type) function test_get_descriptor() test_get_descriptor%name=“test_component" test_get_descriptor%version=0.1 test_get_descriptor%initialisation=>initialisation_callback test_get_descriptor%timestep=>timestep_callback end function test_get_descriptor A highly scalable Met Office NERC Cloud model subroutine initialisation_callback(current_state) type(model_state_type), target, intent(inout) :: current_state ……………… end subroutine initialisation_callback subroutine timestep_callback(current_state) type(model_state_type), target, intent(inout) :: current_state ……………… end subroutine timestep_callback test_component_enabled=.true.

slide-10
SLIDE 10

MONC - Components

A highly scalable Met Office NERC Cloud model

Viscosity Diffusion TVD advection PW advection Buoyancy Coriolis Damping Forcing Micro physics Radiation Lower BC Smagorinsky Mean profiles Diverr FFT Iterative

Model Core

Logging, data collections, data conversions, scientific constants, options database, maths utilities, grid interpolation, definitions

Registry Model runner

Halo swapping Decomposition Check pointer Termination check Debugger

slide-11
SLIDE 11

MONC – IO Server

  • In addition to the model functionality (working on

prognostics), data analysis needs to be done to produce diagnostic data

– Such as the average temperature at each vertical level – In the LEM this is done for each timestep from within the model

  • In MONC a separate IO server is used

– The MONC model can fire and forget required data at any point to the IO server – This means that the model can continue to run and not be impacted by IO related latencies.

A highly scalable Met Office NERC Cloud model

MONC Model IO Server

slide-12
SLIDE 12

MONC – IO Server

  • Have many MONC processes and a number of IO servers

– Typically one core per processor is dedicated to IO, serving the other cores running the model – Our own IO server implementation provides a framework where diagnostics can be configured via XML and/or code.

  • Can use any IO server, including XIOS

– It is just a component in the model which connects to them

A highly scalable Met Office NERC Cloud model

M M M M M M M M M M M M M M M IO M M M M M M M M M M M M M M M IO

slide-13
SLIDE 13

Performance & scalability - strong

  • Using the dry boundary layer test case which is wind at a

specific level in the vertical

  • Strong scaling, 536 million grid points, modelled for 10000 simulation seconds

A highly scalable Met Office NERC Cloud model

500 1000 1500 2000 2500 3000 2048 4096 8192 16384 32768

Time (s) Number of MONC processes

slide-14
SLIDE 14

Performance & scalability - weak

A highly scalable Met Office NERC Cloud model

200 400 600 800 1000 1200 1400 1600 1800 1024 2048 4096 8192 16384 32768

Time (s) Number of MONC processes

  • Weak scaling, 65536 grid points per process, modelled for 10000 simulation

seconds

536 million grid points 1.07 billion grid points 2.1 billion grid points 268 million grid points 134 million grid points

slide-15
SLIDE 15

Improving scalability - Iterative solver

  • The Poisson equation is solved for pressure terms

– The LEM uses an FFT method with a tridiagonal solver. Working in Fourier space this solve an ordinary vertical differential equation but requires forwards and backwards global FFTs. – A similar version has been implemented in MONC, decomposing in pencil and using FFTW for the actual FFT kernel. – Regardless, an FFT based approach requires lots of all to all communications and won’t scale.

  • An iterative solver (component) has been implemented which

replaces the FFT solver (component) and should scale better

– A matrix less implementation of ILU preconditioned BiCGStab – CG also provided as an option

A highly scalable Met Office NERC Cloud model

slide-16
SLIDE 16

Iterative vs FFT solver

A highly scalable Met Office NERC Cloud model

200 400 600 800 1000 1200 1400 1600 1800 1024 2048 4096 8192 16384 32768

FFT Solver Iterative Solver Number of MONC processes Time (s)

(1e-4)

  • Weak scaling, 65536 grid points per process, modelled for 10000 simulation

seconds

slide-17
SLIDE 17

Precision - single vs double

A highly scalable Met Office NERC Cloud model

200 400 600 800 1000 1200 1400 1024 2048 4096 8192 16384

FFT single Iterative single (1e-4) FFT double Iterative double (1e-4) Time (s) Number of MONC processes

  • Weak scaling, 65536 grid points per process, modelled for 10000 simulation

seconds

slide-18
SLIDE 18

Conclusions and further work

  • MONC is a highly scalable and configurable community

model

  • Demonstrated model runs and core counts well beyond what

the current model can handle

  • GPU version of the advection schemes (to be tested on Piz

Daint.)

  • The scientific community are starting to use current versions
  • f MONC
  • Scalability aspects to be further tuned

A highly scalable Met Office NERC Cloud model