Algorithmic Differentiation of Structured Mesh Applications G abor - - PowerPoint PPT Presentation

algorithmic differentiation of structured mesh
SMART_READER_LITE
LIVE PREVIEW

Algorithmic Differentiation of Structured Mesh Applications G abor - - PowerPoint PPT Presentation

Algorithmic Differentiation of Structured Mesh Applications G abor D aniel Balogh Supervisor: dr. Istv an Reguly P azm any P eter Catholic University Faculty of Information Technology and Bionics October 20, 2020 G. D. Balogh


slide-1
SLIDE 1

Algorithmic Differentiation of Structured Mesh Applications

G´ abor D´ aniel Balogh Supervisor: dr. Istv´ an Reguly

P´ azm´ any P´ eter Catholic University Faculty of Information Technology and Bionics

October 20, 2020

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 1 / 18

slide-2
SLIDE 2

Algorithmic Differentiation - AD

Algorithmic Differentiation is used to evaluate derivatives of the function which defined by a computer program

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 2 / 18

slide-3
SLIDE 3

Algorithmic Differentiation - AD

Algorithmic Differentiation is used to evaluate derivatives of the function which defined by a computer program Why AD? Numerical methods require derivatives There are three main ways of automating computation of derivatives

Finite differentiation - slow for high dimension, lower accuracy Symbolic differentiation - cannot handle some programming constructs Algorithmic Differentiatoin - exact solution, fast

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 2 / 18

slide-4
SLIDE 4

Algorithmic Differentiation - AD

Our program: f : Rn → Rm, from som input u ∈ Rn generates some output w ∈ Rm

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 3 / 18

slide-5
SLIDE 5

Algorithmic Differentiation - AD

Our program: f : Rn → Rm, from som input u ∈ Rn generates some output w ∈ Rm Our goal is to get the Jacobian J (or a part of it): Jij = ∂fi ∂xj

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 3 / 18

slide-6
SLIDE 6

Algorithmic Differentiation - AD

Our program: f : Rn → Rm, from som input u ∈ Rn generates some output w ∈ Rm Our goal is to get the Jacobian J (or a part of it): Jij = ∂fi ∂xj But how to get there?

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 3 / 18

slide-7
SLIDE 7

Algorithmic Differentiation - AD

Assume that we can write f as a composite of K functions: f = f K ◦ f K−1 ◦ . . . ◦ f 1

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 4 / 18

slide-8
SLIDE 8

Algorithmic Differentiation - AD

Assume that we can write f as a composite of K functions: f = f K ◦ f K−1 ◦ . . . ◦ f 1 Then we can write the Jacobian as: J = JL · JL−1 · . . . · J1

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 4 / 18

slide-9
SLIDE 9

Algorithmic Differentiation - AD

Assume that we can write f as a composite of K functions: f = f K ◦ f K−1 ◦ . . . ◦ f 1 Then we can write the Jacobian as: J = JL · JL−1 · . . . · J1 There are two mode of AD: Forward (tangent) mode computes J · u = JL · JL−1 · · · · · J1 · u, for u ∈ Rn Backward (adjoint) mode computes JT · w = JT

1 · JT 2 · · · · · JT K · w, for w ∈ Rm

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 4 / 18

slide-10
SLIDE 10

Adjoint mode AD

Backward (adjoint) mode computes JT · w = JT

1 · JT 2 · · · · · JT K · w, for w ∈ Rm

Use w such that the ith element of w is 1, and the others are 0. JT · w will produce the ith row of J

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 5 / 18

slide-11
SLIDE 11

Adjoint mode AD

Backward (adjoint) mode computes JT · w = JT

1 · JT 2 · · · · · JT K · w, for w ∈ Rm

Use w such that the ith element of w is 1, and the others are 0. JT · w will produce the ith row of J Evaluate it one step at a time Only need the derivative of one function of the chain If we choose the fi decomposition carefully, we can implement AD efficiently

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 5 / 18

slide-12
SLIDE 12

Adjoint mode AD

Backward (adjoint) mode computes JT · w = JT

1 · JT 2 · · · · · JT K · w, for w ∈ Rm

Use w such that the ith element of w is 1, and the others are 0. JT · w will produce the ith row of J Evaluate it one step at a time Only need the derivative of one function of the chain If we choose the fi decomposition carefully, we can implement AD efficiently But to get Ji we need the state of the program at fi

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 5 / 18

slide-13
SLIDE 13

Adjoint mode AD

Backward (adjoint) mode computes JT · w = JT

1 · JT 2 · · · · · JT K · w, for w ∈ Rm

Commonly implemented with operator overloading fi is an elementary operation, easy to compute JT

i

· w′ But we produce enormous chains, and need to store every state of every variable

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 6 / 18

slide-14
SLIDE 14

DSLs for future proof HPC applications

Fast-changing hardware

infeasible to maintain code to support all of them

Embedded Domain Specific Languages (eDSL) can hide platform specific details

future proof, perforamnce portable solutions for application developers

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 7 / 18

slide-15
SLIDE 15

Oxford Parallel library for Structured mesh solvers

OPS (Oxford Parallel library for Structured mesh solvers) C++ library with high-level API calls for structured mesh applications High level concepts

grids data on grids loops over subgrid accessing data

generate parallel implementations of loops for all hardware

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 8 / 18

slide-16
SLIDE 16

OPS + AD

Each loop that performs computation must be a call of ops par loop

takes the loop body as a function descriptors of datasets: access type, stencil of access

OPS generates the implementation for all ops par loop

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 9 / 18

slide-17
SLIDE 17

OPS + AD

Each loop that performs computation must be a call of ops par loop

takes the loop body as a function descriptors of datasets: access type, stencil of access

OPS generates the implementation for all ops par loop If we have all these information, can we do AD?

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 9 / 18

slide-18
SLIDE 18

OPS + AD

If we have all these information, can we do AD? OPS already has control over the sequence of parallel loops. If we have derivatives of these loops instead of the elementary operations we can evaluate the chain rule on the loop level.

We assume that either the user will supply the derivative or we can use source transformation to get it

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 10 / 18

slide-19
SLIDE 19

OPS + AD

If we have all these information, can we do AD? OPS already has control over the sequence of parallel loops. If we have derivatives of these loops instead of the elementary operations we can evaluate the chain rule on the loop level.

We assume that either the user will supply the derivative or we can use source transformation to get it

Features missing to perform the backward sweep:

store intermediate states reversing data flow inside loops

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 10 / 18

slide-20
SLIDE 20

Intermediate state storage - tape

Generated code registers loops and create copies of overwritten data to a data structure (tape). In the backward sweep we can execute the adjoints of each loop and load intermediate states of datasets.

OPS tape Initilise OPS and variables

  • ps_par_loop_1
  • ps_par_loop_2
  • ps_par_loop_N-1
  • ps_par_loop_3
  • ps_par_loop_N

Init adjoints

  • ps_par_loop_N_a1s
  • ps_par_loop_N-1_a1s
  • ps_par_loop_1_a1s

Finish

  • ps_par_loop_2_a1s
  • ps_par_loop_3_a1s
  • ps_interpret_adjoints

Write checkpoints Load checkpoints

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 11 / 18

slide-21
SLIDE 21

Reversing data flow

The second problem is that we need to parallelise the adjoints of the stencil loops as well.

Figure 1: Example computational step in OPS given by the user (a) for compute results and (b) to compute the derivatives backwards

1 inline void mean_kernel( 2

const OPS_ACC<double> &u,

3

OPS_ACC<double> &u_2) {

4

u_2(0, 0) = (u(-1, 0) + u(1, 0)

5

+ u(0, -1) + u(0, 1)) * 0.25;

6 }

a: Compute the mean of neighbours for each grid point

1 inline void mean_kernel_adjoint( 2

const OPS_ACC<double> &u,

3

OPS_ACC<double> &u_a1s,

4

const OPS_ACC<double> &u_2,

5

OPS_ACC<double> &u_2_a1s) {

6

u_a1s(-1,0) += 0.25 * u_2_a1s(0, 0);

7

u_a1s(1, 0) += 0.25 * u_2_a1s(0, 0);

8

u_a1s(0,-1) += 0.25 * u_2_a1s(0, 0);

9

u_a1s(0, 1) += 0.25 * u_2_a1s(0, 0);

10 }

b: The corresponding adjoint kernel

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 12 / 18

slide-22
SLIDE 22

Reversing data flow

Writing pattern in forward code. No race condition allowed. Reversed stencil for adjoints. Race conditions

  • n each adjoint.
  • G. D. Balogh (PPCU – ITK)

October 20, 2020 13 / 18

slide-23
SLIDE 23

Avoiding race conditions

CPU: Red black colouring creating red and black stripes for each thread

Thr 0 Thr 1

CUDA: overloading operators to use atomics for adjoints

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 14 / 18

slide-24
SLIDE 24

Performance: CPU

Our best performance on example apps produce derivatives under less then 6× of the primal code, which is close to the theoretical optimum on a representative code from finance.

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 15 / 18

slide-25
SLIDE 25

Performance: V100

Naive GPU implementation of the adjoint loops on typical problem sizes takes 10 − 25× of the primal.

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 16 / 18

slide-26
SLIDE 26

Current solution: Memory

Another problem of the current implementation is that checkpointing requires too much memory. Memory (GB) With checkpointing (GB) iteration count Grid Size primal 10 100 200 512 × 256 0.025 0.292 1.788 3.792 1024 × 512 0.094 0.892 6.902 12.90 1024 × 1024 0.188 1.946 13.04 26.04

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 17 / 18

slide-27
SLIDE 27

Conclusions

We extended the OPS library with adjoint aware API. Successfully parallelised the computations of adjoints for CPUs and GPUs Showed promising runtime performance But the current implementation requires too much memory

Currently working on an implementation for recomputing loops

Supported by the ´ UNKP-19-3-I New National Excellence Program of the Ministry for Innovation and Technology

  • G. D. Balogh (PPCU – ITK)

October 20, 2020 18 / 18