Automatic Differentiation for Computational Engineering Kailai Xu - - PowerPoint PPT Presentation

automatic differentiation for computational engineering
SMART_READER_LITE
LIVE PREVIEW

Automatic Differentiation for Computational Engineering Kailai Xu - - PowerPoint PPT Presentation

Automatic Differentiation for Computational Engineering Kailai Xu and Eric Darve CME 216 AD 1 / 47 Outline Overview 1 Computational Graph 2 Forward Mode 3 Reverse Mode 4 AD for Physical Simulation 5 AD Through Implicit Operators 6


slide-1
SLIDE 1

Automatic Differentiation for Computational Engineering

Kailai Xu and Eric Darve

CME 216 AD 1 / 47

slide-2
SLIDE 2

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 2 / 47

slide-3
SLIDE 3

Overview

Gradients are useful in many applications

Mathematical Optimization min

x∈Rn f (x)

Using the gradient descent method: xn+1 = xn − αn∇f (xn) Sensitivity Analysis f (x + ∆x) ≈ f ′(x)∆x Machine Learning Training a neural network using automatic differentiation (back-propagation). Solving Nonlinear Equations Solve a nonlinear equation f (x) = 0 using Newton’s method xn+1 = xn − f (xn) f ′(xn)

CME 216 AD 3 / 47

slide-4
SLIDE 4

Terminology

Deriving and implementing gradients are a challenging and all-consuming process. Automatic differentiation: a set of techniques to numerically evaluate the derivative of a function specified by a computer program (Wikipedia). It also bears other names such as autodiff, algorithmic differentiation, computational differentiation, and back-propagation. There are a lot of AD softwares

1

TensorFlow and PyTorch: deep learning frameworks in Python

2

Adept-2: combined array and automatic differentiation library in C++

3

autograd: efficiently derivatives computation of NumPy code.

4

ForwardDiff.jl, Zygote.jl: Julia differentiable programming packages

This lecture: how to compute gradients using automatic differentiation (AD)

Forward mode, reverse mode, and AD for implicit solvers

CME 216 AD 4 / 47

slide-5
SLIDE 5

AD Software

https://github.com/microsoft/ADBench

CME 216 AD 5 / 47

slide-6
SLIDE 6

Finite Differences

f ′(x) ≈ f (x + h) − f (x) h , f ′(x) ≈ f (x + h) − f (x − h) 2h Derived from the definition of derivatives f ′(x) = lim

h→0

f (x + h) − f (x) h Conceptually simple. Curse of dimensionalties: to compute the gradients of f : Rm → R, you need at least O(m) function evaluations. Huge numerical error: roundoff error.

CME 216 AD 6 / 47

slide-7
SLIDE 7

Finite Difference

f (x) = sin(x) f ′(x) = cos(x) x0 = 0.1

CME 216 AD 7 / 47

slide-8
SLIDE 8

Finite Difference

Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2017). Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research, 18(1), 5595-5637.

CME 216 AD 8 / 47

slide-9
SLIDE 9

Symbolic Differentiation

Symbolic differentiation computes exact derivatives (gradients): there is no approximation error. It works by recursively applies simple rules to symbols d dx (c) = 0 d dx (x) = 1 d dx (u + v) = d dx (u) + d dx (v) d dx (uv) = v d dx (u) + u d dx (v) . . . Here c is a variable independent of x, and u, v are variables dependent on x. There may not exist convenient expressions for the analytical gradients of some functions. For example, a blackbox function from a third-party library.

CME 216 AD 9 / 47

slide-10
SLIDE 10

Symbolic Differentiation

Symbolic differentiation can lead to complex and redundant expressions

CME 216 AD 10 / 47

slide-11
SLIDE 11

Automatic Differentiation

AD is neither finite difference nor symbolic differentiation. It works by recursively applies simple rules to values d dx (c) = 0 d dx (x) = 1 d dx (u + v) = d dx (u) + d dx (v) d dx (uv) = v d dx (u) + u d dx (v) . . . Here c is a variable independent of x, and u, v are variables dependent on x. It evaluates numerically gradients of “function units” using symbolic differentiation, and chains the computed gradients using the chain rule df (g(x)) dx = f ′(g(x))g′(x) It is efficient (linear in the cost of computing the function itself) and numerically stable.

CME 216 AD 11 / 47

slide-12
SLIDE 12

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 12 / 47

slide-13
SLIDE 13

Computational Graph

The “language” for automatic differentiation is computational graph.

The computational graph is a directed acyclic graph (DAG). Each edge represents the data: a scalar, a vector, a matrix, or a high dimensional tensor. Each node is a function that consumes several incoming edges and

  • utputs some values.

J = f4(u1, u2, u3, u4), u2 = f1(u1, θ), u3 = f2(u2, θ), u4 = f3(u3, θ).

u1 u2 u1 u3 u4 J f1 f2 f3 f4 θ

Let’s build a computational graph for computing z = sin(x1 + x2) + x2

2x3

CME 216 AD 13 / 47

slide-14
SLIDE 14

Building a Computational Graph

z = sin(x1 + x2) + x2

2x3

CME 216 AD 14 / 47

slide-15
SLIDE 15

Building a Computational Graph

z = sin(x1 + x2) + x2

2x3

CME 216 AD 15 / 47

slide-16
SLIDE 16

Building a Computational Graph

z = sin(x1 + x2) + x2

2x3

CME 216 AD 16 / 47

slide-17
SLIDE 17

Computing Gradients from a Computational Graph

Automatic differentiation works by propagating gradients in the computational graph. Two basic modes: forward-mode and backward-mode. Forward-mode propagates gradients in the same direction as forward computation. Backward-mode propagates gradients in the reverse direction of forward computation.

CME 216 AD 17 / 47

slide-18
SLIDE 18

Computing Gradients from a Computational Graph

Different computational graph topologies call for different modes of automatic differentiation.

One-to-many: forward-propagation⇒forward-mode AD. Many-to-one: back-propagation⇒reverse-mode AD.

CME 216 AD 18 / 47

slide-19
SLIDE 19

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 19 / 47

slide-20
SLIDE 20

Automatic Differentiation: Forward Mode AD

The forward-mode automatic differentiation uses the chain rule to propagate the gradients. ∂f ◦ g(x) ∂x = f ′(g(x))g′(x) Compute in the same order as function evaluation. Each node in the computational graph

Aggregate all the gradients from up-streams. Forward the gradient to down-stream nodes.

CME 216 AD 20 / 47

slide-21
SLIDE 21

Example: Forward Mode AD

Let’s consider a specific way for computing f (x) =   x4 x2 + sin(x) − sin(x)  

CME 216 AD 21 / 47

slide-22
SLIDE 22

Example: Forward Mode AD

Let’s consider a specific way for computing f (x) =   x4 x2 + sin(x) − sin(x)  

x

y1 = x2 y2 = sin x y5 = − y2 y3 = y2

1 y4 = y1 + y2

(y1, y′

1) = (x2, 2x)

(y2, y′

2) = (sin x, cos x)

CME 216 AD 21 / 47

slide-23
SLIDE 23

Example: Forward Mode AD

Let’s consider a specific way for computing f (x) =   x4 x2 + sin(x) − sin(x)  

x

y1 = x2 y2 = sin x y5 = − y2 y3 = y2

1 y4 = y1 + y2

(y1, y′

1) = (x2, 2x)

(y2, y′

2) = (sin x, cos x)

(y3, y′

3) = (y2 1 , 2y1y′ 1) = (x4, 4x3)

CME 216 AD 21 / 47

slide-24
SLIDE 24

Example: Forward Mode AD

Let’s consider a specific way for computing f (x) =   x4 x2 + sin(x) − sin(x)  

x

y1 = x2 y2 = sin x y5 = − y2 y3 = y2

1 y4 = y1 + y2

(y1, y′

1) = (x2, 2x)

(y2, y′

2) = (sin x, cos x)

(y3, y′

3) = (y2 1 , 2y1y′ 1) = (x4, 4x3)

(y4, y′

4) = (y1 + y1, y′ 1 + y′ 2)

= (x2 + sin x, 2x + cos x)

CME 216 AD 21 / 47

slide-25
SLIDE 25

Example: Forward Mode AD

Let’s consider a specific way for computing f (x) =   x4 x2 + sin(x) − sin(x)  

x

y1 = x2 y2 = sin x y5 = − y2 y3 = y2

1 y4 = y1 + y2

(y1, y′

1) = (x2, 2x)

(y2, y′

2) = (sin x, cos x)

(y3, y′

3) = (y2 1 , 2y1y′ 1) = (x4, 4x3)

(y4, y′

4) = (y1 + y1, y′ 1 + y′ 2)

= (x2 + sin x, 2x + cos x) (y5, y′

5) = (−y2, −y′ 2) = (− sin x, − cos x)

CME 216 AD 21 / 47

slide-26
SLIDE 26

Summary

Forward mode AD reuses gradients from upstreams. Therefore, this mode is useful for few-to-many mappings f : Rn → Rm, n ≪ m Applications: sensitivity analysis, uncertainty quantification, etc.

Consider a physical model f : Rn → Rm, let x ∈ Rn be the quantity of interest (usually a low dimensional physical parameter), uncertainty propagation method computes the perturbation of the model output (usually a large dimensional quantity, i.e., m ≫ 1) f (x + ∆x) ≈ f (x) + f ′(x)∆x

CME 216 AD 22 / 47

slide-27
SLIDE 27

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 23 / 47

slide-28
SLIDE 28

Reverse Mode AD

df (g(x)) dx = f ′(g(x))g′(x) Computing in the reverse order of forward computation. Each node in the computational graph

Aggregates all the gradients from down-streams Back-propagates the gradient to upstream nodes.

CME 216 AD 24 / 47

slide-29
SLIDE 29

Example: Reverse Mode AD

z = sin(x1 + x2) + x2

2x3

CME 216 AD 25 / 47

slide-30
SLIDE 30

Example: Reverse Mode AD

z = sin(x1 + x2) + x2

2x3

CME 216 AD 26 / 47

slide-31
SLIDE 31

Example: Reverse Mode AD

z = sin(x1 + x2) + x2

2x3

CME 216 AD 27 / 47

slide-32
SLIDE 32

Example: Reverse Mode AD

z = sin(x1 + x2) + x2

2x3

CME 216 AD 28 / 47

slide-33
SLIDE 33

Summary

Reverse mode AD reuses gradients from down-streams. Therefore, this mode is useful for many-to-few mappings f : Rn → Rm, n ≫ m Typical application:

Deep learning: n = total number of weights and biases of the neural network, m = 1 (loss function). Mathematical optimization: usually there are only a single objective function.

CME 216 AD 29 / 47

slide-34
SLIDE 34

Summary

In general, for a function f : Rn → Rm Mode Suitable for ... Complexity1 Application Forward m ≫ n ≤ 2.5 OPS(f (x)) UQ Reverse m ≪ n ≤ 4 OPS(f (x)) Inverse Modeling There are also many other interesting topics

Mixed mode AD: many-to-many mappings. Computing sparse Jacobian matrices using AD by exploiting sparse structures.

Margossian CC. A review of automatic differentiation and its efficient implementation. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2019 Jul;9(4):e1305.

1OPS is a metric for complexity in terms of fused-multiply adds. CME 216 AD 30 / 47

slide-35
SLIDE 35

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 31 / 47

slide-36
SLIDE 36

The Demand for Gradients in Physical Simulation

Solving nonlinear equations Uncertainty quantification/sensitivity analysis Inverse problems

Image source: https://mirams.wordpress.com/2016/11/23/uncertainty-in-risk-prediction/, http://fourier.eng.hmc.edu/e176/lectures/ch2/node5.html

CME 216 AD 32 / 47

slide-37
SLIDE 37

Inverse Problem and Mathematical Optimization

Consider a bar under heating with a source term f (x, t). The right hand side has fixed temperature and the left hand side is insulated. The governing equation for the temperature u(x, t) is ∂u(x, t) ∂t = κ(x)∆u(x, t) + f (x, t), t ∈ (0, T), x ∈ Ω u(1, t) = 0 t > 0 κ(0)∂u(0, t) ∂x = 0 t > 0 The diffusivity coefficient is given by κ(x) = a + bx where a and b are unknown parameters.

CME 216 AD 33 / 47

slide-38
SLIDE 38

Inverse Problem and Mathematical Optimization

Goal: calibrate a and b from u0(t) = u(0, t) κ(x) = a + bx

CME 216 AD 34 / 47

slide-39
SLIDE 39

Inverse Problem and Mathematical Optimization

This problem is a standard inverse problem. We can formulate the problem as a PDE-constrained optimization problem min

a,b

t (u(0, t) − u0(t))2dt s.t. ∂u(x, t) ∂t = κ(x)∆u(x, t) + f (x, t), t ∈ (0, T), x ∈ (0, 1) − κ(0)∂u(0, t) ∂x = 0, t > 0 u(1, t) = 0, t > 0 u(x, 0) = 0, x ∈ [0, 1] κ(x) = ax + b

CME 216 AD 35 / 47

slide-40
SLIDE 40

Numerical Partial Differential Equation

As with many physical modeling techniques, we discretize the PDE using numerical schemes. Here is a finite difference scheme for the PDE k = 1, 2, . . . , m, i = 1, 2, . . . , n uk+1

i

− uk

i

∆t = κi uk+1

i+1 + uk+1 i−1 − 2uk+1 i

∆x2 + f k+1

i

For initial and boundary conditions, we have −κ1 uk+1

2

− uk+1 2∆x = 0 uk+1

n+1 = 0

u0

i = 0

CME 216 AD 36 / 47

slide-41
SLIDE 41

Numerical Partial Differential Equation

Rewriting the equation as a linear system, we have A(a, b)Uk+1 = Uk + F k+1, Uk =      uk

1

uk

2

. . . uk

n

     Here λi = κi ∆t

∆x2 and

A(a, b) =           2λ1 + 1 −2λ1 −λ2 2λ2 + 1 −λ2 −λ3 2λ3 + 1 −λ3 ... ... −λn−1 −λn 2λn + 1           , F k = ∆t      f k+1

1

f k+1

2

. . . f k+1

n

    

CME 216 AD 37 / 47

slide-42
SLIDE 42

Computational Graph for Numerical Schemes

The discretized optimization problem is min

a,b m

  • k=1

(uk

1 − u0((k − 1)∆t))2

s.t. A(a, b)Uk+1 = Uk + F k+1, k = 1, 2, . . . , m U0 = 0 The computational graph for the forward computation (evaluating the loss function) is

CME 216 AD 38 / 47

slide-43
SLIDE 43

Implementation using an AD system

CME 216 AD 39 / 47

slide-44
SLIDE 44

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 40 / 47

slide-45
SLIDE 45

Challenges in AD

Most AD frameworks only deal with explicit operators, i.e., the functions that have analytical derivatives, or composition of these functions. Many scientific computing algorithms are iterative or implicit in nature. Linear/Nonlinear Explicit/Implicit Expression Linear Explicit y = Ax Nonlinear Explicit y = F(x) Linear Implicit Ay = x Nonlinear Implicit F(x, y) = 0

CME 216 AD 41 / 47

slide-46
SLIDE 46

Example

Consider a function f : x → y, which is implicitly defined by F(x, y) = x3 − (y3 + y) = 0 The forward computation may consist of iterative algorithms, such as the Newton’s method and the bisection method: y0 ← 0 k ← 0 while |F(x, yk)| > ǫ do δk ← F(x, yk)/F ′

y(x, yk)

yk+1 ← yk − δk k ← k + 1 end while Return yk a ← −M, b ← M, m ← 0 while |F(x, m)| > ǫ do m ← a+b

2

if F(x, m) > 0 then b ← m else a ← m end if end while Return m

CME 216 AD 42 / 47

slide-47
SLIDE 47

Example

An efficient way is to apply the implicit function theorem. For our example, F(x, y) = x3 − (y3 + y) = 0, treat y as a function of x and take the derivative on both sides 3x2 − 3y(x)2y′(x) − y′(x) = 0 ⇒ y′(x) = 3x2 3y(x)2 + 1 The above gradient is exact.

CME 216 AD 43 / 47

slide-48
SLIDE 48

Implicit Operators in Physical Modeling

Return to our bar problem, what if the material property is complex and has a temperature-dependent governing equation? ∂u(x, t) ∂t = κ(u)∆u(x, t) + f (x, t), t ∈ (0, T), x ∈ Ω An implicit scheme is usually a nonlinear equation, and requires an iterative solver (e.g., the Newton-Raphson algorithm) to solve uk+1

i

− uk

i

∆t = κ(uk+1

i

)uk+1

i+1 + uk+1 i−1 − 2uk+1 i

∆x2 + f k+1

i

Typical AD frameworks cannot handle this operator. We need to differentiate through implicit operators. This topic will be covered in a future lecture: physics constrained learning.

CME 216 AD 44 / 47

slide-49
SLIDE 49

Outline

1

Overview

2

Computational Graph

3

Forward Mode

4

Reverse Mode

5

AD for Physical Simulation

6

AD Through Implicit Operators

7

Conclusion

CME 216 AD 45 / 47

slide-50
SLIDE 50

Conclusion

What’s covered in this lecture

Reverse mode automatic differentiation; Forward mode automatic differentiation; Using AD to solver inverse problems in physical modeling; Automatic differentiation through implicit operators.

CME 216 AD 46 / 47

slide-51
SLIDE 51

What’s Next

Physics constrained learning: inverse modeling using automatic differentiation through implicit operators; Neural networks and numerical schemes: substitute the unknown component in a physical system with a neural network and learn the neural network with AD; Implementation of inverse modeling algorithms in ADCME.

CME 216 AD 47 / 47