REXI: breaking the time step constraint David Acreman, Jemma - - PowerPoint PPT Presentation

rexi breaking the time step constraint
SMART_READER_LITE
LIVE PREVIEW

REXI: breaking the time step constraint David Acreman, Jemma - - PowerPoint PPT Presentation

REXI: breaking the time step constraint David Acreman, Jemma Shipton, Colin Cotter and Beth Wingate Why REXI? Trends in processor design are towards increasing number of cores Strong scaling of domain decomposition is limited


slide-1
SLIDE 1

REXI: breaking the time step constraint

David Acreman, Jemma Shipton, Colin Cotter and Beth Wingate

slide-2
SLIDE 2

Why REXI?

  • Trends in processor design

are towards increasing number of cores

  • Strong scaling of domain

decomposition is limited

  • Timestep limits weak scaling
  • We need to find parallelism

elsewhere

https://www.karlrupp.net/2015/06/40-years-of-microprocessor-trend-data/

slide-3
SLIDE 3

Rational approximation of exponential integrator (REXI)

Apply n forward Euler time steps Approximate the exponential αk and βk are pre-computed complex

  • numbers. Terms in the summation can be

calculated in parallel

Schreiber et al, 2017, Beyond spatial scalability limitations with a massively parallel method for linear oscillatory problems, International Journal of High Performance Computing Applications

slide-4
SLIDE 4

Rational approximation of exponential integrator (REXI)

Approximate the exponential using Gaussian basis functions al and μ are pre-computed constants (Haut et al, 2015) Terms in the sum over M can be calculated in parallel Approximate Gaussians as sum

  • f rational terms

Haut et al, 2015, A high-order time-parallel scheme for solving wave propagation problems via the direct construction of an approximate time-evolution operator, IMA Journal of Numerical Analysis (2016) 36, 688–716

  • No. of Gaussians

Width of Gaussian

hM > |tλMAX |

slide-5
SLIDE 5

REXI study

  • REXI results presented in Schreiber et al (2017) for

benchmark problems applied to shallow water equations

  • We will also solve the shallow water equations but with

some significant differences:

  • Finite difference or spectral → finite elements (Firedrake)
  • Regular unit square → icosahedral sphere in physical

co-ordinates

  • Looking for speed up over conventional time stepping

Schreiber et al, 2017, Beyond spatial scalability limitations with a massively parallel method for linear oscillatory problems, International Journal of High Performance Computing Applications

slide-6
SLIDE 6

Convergence tests

  • Initial conditions: polar wave
  • Run REXI with varying number of terms (M) with h=0.2

(width of Gaussian)

  • Check L2 error norm vs reference solution (implicit mid-

point method with 25s time step)

  • Increase REXI time step (t) and determine the number of

terms (M) required to achieve convergence

  • Expect: hM > |tλMAX |
slide-7
SLIDE 7

hM > |tλMAX |

h=0.2, refinement level=3

10000 100000 1x106 1x107 1x108 1x109 1x1010 1x1011 64 128 192 256 320 384 448 512 576 640 704 768 832 896 960 U L2 error norm Number of REXI terms (M) t=7 500s t=15 000s t=30 000s t=60 000s t=120 000s

t/ks M

λMAX

7.5 64 0.0017 15 112 0.0015 30 224 0.0015 60 432 0.0014 120 864 0.0016

Increasing t requires larger M (linear) ✅ Increasing t increases error ✅

slide-8
SLIDE 8

hM > |tλMAX|≈45 ⇒ λMAX≈0.0015

t=30 000s, refinement level=3

100000 1x106 1x107 1x108 1x109 1x1010 1x1011 32 64 96 128 160 192 224 256 288 320 352 384 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6 h=2.4 h=3.2

h M hxM 0.2 224 44.8 0.4 112 44.8 0.8 64 51.2 1.6 32 51.2

hM is constrained but what about h on its own?

slide-9
SLIDE 9

t=30 000s t=60 000s t=120 000s t=240 000s

100000 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6

100000 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6

Can we use h=1.6 with a larger t?

slide-10
SLIDE 10

100000 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6

refinement level=3 refinement level=4

100000 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6 100000 1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6

refinement level=2

1x106 1x107 1x108 1x109 1x1010 1x1011 50 100 150 200 250 300 350 U L2 error norm Number of REXI terms (M) h=0.1 h=0.2 h=0.4 h=0.8 h=1.6

refinement level=5 What about resolution (λmax)?

slide-11
SLIDE 11

Scaling tests

  • Measure time for a single REXI step using PyOP2 timed

stage (average over three runs, no I/O in timed region)

  • h=0.2 and 1.6, minimum M for convergence, refinement

level 3

  • Single node scaling on Archer: 24 cores per node (2x12)
  • Specify placement to ensure MPI processes are

distributed evenly between sockets

slide-12
SLIDE 12

h=0.2, refinement level=3 Reference solution: 115 (1 proc) → 1300 (24 procs)

200 300 400 500 600 700 800 900 1000 1100 1200 4 8 12 16 20 24 Model time / Wallclock time

  • No. of processors

t=7500, M=64 t=15000, M=112 t=30000, M=224 t=60000, M=432 t=120000, M=864

slide-13
SLIDE 13

h=1.6, refinement level=3 Reference solution: 115 (1 proc) → 1300 (24 procs)

1000 2000 3000 4000 5000 6000 7000 8000 9000 4 8 12 16 20 24 Model time / Wallclock time

  • No. of processors

t=30000, M=32 t=60000, M=64 t=12000, M=112 t=240000, M=224

slide-14
SLIDE 14

Future work

  • What value of h to use? Does this depend on the initial conditions

(or other factors)?

  • How to trade-off speed and accuracy?
  • For a given spatial resolution (affects λMAX) and t
  • Determine maximum h and minimum M for convergence

(hM > |tλMAX |)

  • Measure error vs reference solution and time to solution
  • Improve time to solution by reducing MPI overhead: examine in

more detail with profiler (e.g. determine load balance)

slide-15
SLIDE 15

Build with Intel toolchain and run DG advection example under MPI profiler:

Each line is an MPI process Time in MPI_Bcast Communication between processes