Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September - - PowerPoint PPT Presentation

deep neural networks for pdes
SMART_READER_LITE
LIVE PREVIEW

Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September - - PowerPoint PPT Presentation

Deep Neural Networks for PDEs Philipp Grohs DL and Vis, September 2018 Short Reading List 1 Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep Learning; MIT Press, 2016 2 Aurelien Geron: Hands-On Machine Learning with Scikit-Learn and


slide-1
SLIDE 1

Deep Neural Networks for PDEs

Philipp Grohs

DL and Vis, September 2018

slide-2
SLIDE 2

Short Reading List

1 Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep

Learning; MIT Press, 2016

2 Aurelien Geron: Hands-On Machine Learning with Scikit-Learn

and TensorFlow; O’Reilley, 2017

3 Brian Steele and John Chandler and Swarna Reddy: Algorithms

for Data Science; Springer, 2017

4 Alan Jeffrey: Applied Partial Differential Equations – An

Introduction; Academic Press, 2002

slide-3
SLIDE 3

Syllabus

1 PDEs and the Curse of Dimensionality 2 A Crash Course in Statistical Learning Theory (including a

Detour to Variational Autoencoders)

3 PDEs as Learning Problem 4 Solving linear Kolmogorov Equations by means of Neural

Network Based Learning

slide-4
SLIDE 4

PDEs and the Curse of Dimensionality

slide-5
SLIDE 5

PDEs

A PDE for the function u(x1, . . . , xd) is an equation of the form F

  • x1, . . . , xd, u, ∂u

∂x1 , . . . ∂u ∂xd , ∂2u ∂x1∂x1 , . . . , ∂2u ∂x1∂xd , . . .

  • = 0.

together with suitable boundary conditions.

slide-6
SLIDE 6

Heat Equation

∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 + g(t, x), u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4.

slide-7
SLIDE 7

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4.

slide-8
SLIDE 8

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) = 1 (4πt)3/2

  • R3 ϕ(y) exp(−|x − y|2/4t)dy.
slide-9
SLIDE 9

Fluid Dynamics

∂u ∂t (t, x, v) + v · ∇u(t, x, v) = Qu(t, x, v) t ∈ (0, ∞), x, v ∈ R3; d = 7.

slide-10
SLIDE 10

Schr¨

  • dinger Equation

Wave function of non-relativistic quantum mechanical system of N electons in a field of K nuclei of charge Zν and fixed position Rµ ∈ R3 i ∂ ∂t Ψ(r1, . . . , rN; t) = −1 2

N

  • ξ=1

∆iΨ(r1, . . . , rN; t)−

N

  • ξ=1

K

  • ν=1

Zν |rξ − Rν|Ψ(r1, . . . , rN; t) + 1 2

N

  • ξ=1

N

  • η=1

1 − δξ,η |rξ − rη|, t ∈ (0, ∞), r1, . . . , rN ∈ R3; d = 3N + 1.

slide-11
SLIDE 11

Black-Scholes Equation

Pricing a portfolio of N financial derivatives ∂u ∂t (t, x) = 1 2

N

  • i,j=1

xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+

N

  • i=1

µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −

N

  • i=1

cixi, 0} t ∈ (0, ∞), x ∈ RN; d = N + 1.

slide-12
SLIDE 12

Learning the PDE [Rudy et.al. (2017)]

slide-13
SLIDE 13

Finite Difference Approach

Want to approximate u(x) for x ∈ [0, 1]d.

slide-14
SLIDE 14

Finite Difference Approach

Want to approximate u(x) for x ∈ [0, 1]d. Let ui1,...,id ∼ u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d,

slide-15
SLIDE 15

Finite Difference Approach

Want to approximate u(x) for x ∈ [0, 1]d. Let ui1,...,id ∼ u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, ui1,...,il+1,...,id − ui1,...,il,...,id ǫ ∼ ∂ ∂xl u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, and so on,

slide-16
SLIDE 16

Finite Difference Approach

Want to approximate u(x) for x ∈ [0, 1]d. Let ui1,...,id ∼ u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, ui1,...,il+1,...,id − ui1,...,il,...,id ǫ ∼ ∂ ∂xl u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, and so on, and solve the discrete system F

  • i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id

ǫ , . . .

  • = 0

(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d.

slide-17
SLIDE 17

Curse of Dimensionality

The system F

  • i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id

ǫ , . . .

  • = 0

(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. requires us to solve an equation in ui1,...,id for (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d.

slide-18
SLIDE 18

Curse of Dimensionality

The system F

  • i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id

ǫ , . . .

  • = 0

(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. requires us to solve an equation in ui1,...,id for (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. Exponential Dependence on the Dimension Let ǫ = 1

2 (take two samples in each coordinate). Then these are 2d

unknowns.

slide-19
SLIDE 19

Curse of Dimensionality

The system F

  • i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id

ǫ , . . .

  • = 0

(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. requires us to solve an equation in ui1,...,id for (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. Exponential Dependence on the Dimension Let ǫ = 1

2 (take two samples in each coordinate). Then these are 2d

  • unknowns. intractable for high-dimensional problems!
slide-20
SLIDE 20

Curse of Dimensionality

slide-21
SLIDE 21

Curse of Dimensionality

The complexity of approximating a general d-dimensional function scales exponentially in d.

slide-22
SLIDE 22

Curse of Dimensionality

The complexity of approximating a general d-dimensional function scales exponentially in d. Suppose we have a problem where we aim to ap- proximate a d-dimensional function. An algorithm to solve the problem suffers from the curse of di- mensionality if its computational complexity de- pends exponentially on the dimension d.

slide-23
SLIDE 23

Black-Scholes Equation

Pricing a portfolio of N financial derivatives

∂u ∂t (t, x) = 1 2

N

  • i,j=1

xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+

N

  • i=1

µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −

N

  • i=1

cixi, 0}

t ∈ (0, ∞), x ∈ RN; d = N + 1.

slide-24
SLIDE 24

Black-Scholes Equation

Pricing a portfolio of N financial derivatives

∂u ∂t (t, x) = 1 2

N

  • i,j=1

xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+

N

  • i=1

µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −

N

  • i=1

cixi, 0}

t ∈ (0, ∞), x ∈ RN; d = N + 1. Realistic values: d = 100 − 1000.

slide-25
SLIDE 25

Black-Scholes Equation

Pricing a portfolio of N financial derivatives

∂u ∂t (t, x) = 1 2

N

  • i,j=1

xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+

N

  • i=1

µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −

N

  • i=1

cixi, 0}

t ∈ (0, ∞), x ∈ RN; d = N + 1. Realistic values: d = 100 − 1000. Complexity of finite difference method: 2100 − 21000.

slide-26
SLIDE 26

Black-Scholes Equation

Pricing a portfolio of N financial derivatives

∂u ∂t (t, x) = 1 2

N

  • i,j=1

xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+

N

  • i=1

µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −

N

  • i=1

cixi, 0}

t ∈ (0, ∞), x ∈ RN; d = N + 1. Realistic values: d = 100 − 1000. Complexity of finite difference method: 2100 − 21000. Number of atoms in the universe: 2250.

slide-27
SLIDE 27

Black-Scholes Equation

slide-28
SLIDE 28

Black-Scholes Equation

Option pricing is extremely relevant and has to be done every day in the financial industry

slide-29
SLIDE 29

Black-Scholes Equation

Option pricing is extremely relevant and has to be done every day in the financial industry All algorithms for the solution of the Black-Scholes equation suffer from the curse of dimensionality!

slide-30
SLIDE 30

MNIST

MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/

slide-31
SLIDE 31

MNIST

MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784:

slide-32
SLIDE 32

MNIST

MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784:

slide-33
SLIDE 33

MNIST

MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit
slide-34
SLIDE 34

MNIST

MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit
slide-35
SLIDE 35

MNIST

slide-36
SLIDE 36

MNIST

5

ConvNet

slide-37
SLIDE 37

MNIST

5

ConvNet

This is a 784-dimensional function

slide-38
SLIDE 38

MNIST

5

ConvNet

This is a 784-dimensional function Apparently, deep learning does not suffer from the curse of dimensionality for certain classification problems!

slide-39
SLIDE 39

MNIST

5

ConvNet

This is a 784-dimensional function Apparently, deep learning does not suffer from the curse of dimensionality for certain classification problems! Can this also be used for the solution of PDEs?

slide-40
SLIDE 40

A Crash Course in Statistical Learning Theory

slide-41
SLIDE 41

Data Generating Distribution

Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.

slide-42
SLIDE 42

Data Generating Distribution

Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.

slide-43
SLIDE 43

Data Generating Distribution

Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.

slide-44
SLIDE 44

Data Generating Distribution

Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.

Variational Autoencoder Demo

slide-45
SLIDE 45

A New Look

Suppose that our training data consists of samples according to a given data distribution (X, Y )

slide-46
SLIDE 46

A New Look

Suppose that our training data consists of samples according to a given data distribution (X, Y )

slide-47
SLIDE 47

A New Look

If we knew the data distribution (X, Y ), the best functional relation between X and Y would simply be E[Y |X = x]!

slide-48
SLIDE 48

A New Look

If we knew the data distribution (X, Y ), the best functional relation between X and Y would simply be E[Y |X = x]!

slide-49
SLIDE 49

A New Look

But we only have samples and do not know the distribution (X, Y )

slide-50
SLIDE 50

A New Look

But we only have samples and do not know the distribution (X, Y )

slide-51
SLIDE 51

A New Look

But we only have samples and do not know the distribution (X, Y )

A mathematical learning problem seeks to infer the regression function E[Y |X = x] from random samples (xi, yi)m

i=1 of (X, Y ).

slide-52
SLIDE 52

Mathematical Formulation

slide-53
SLIDE 53

Mathematical Formulation

Let (Ω, F, P) be a probability space and let X : Ω → Rd and Y : Ω → Rn be random vectors. Find the best functional relationship ˆ U : Rd → Rn between these vectors in the sense that ˆ U = argmin

U:Rd→Rn

|U(X(ω)) − Y (ω)|2dP(ω) = argmin

U:Rd→Rn E

  • |U(X) − Y |2

.

slide-54
SLIDE 54

Mathematical Formulation

Let (Ω, F, P) be a probability space and let X : Ω → Rd and Y : Ω → Rn be random vectors. Find the best functional relationship ˆ U : Rd → Rn between these vectors in the sense that ˆ U = argmin

U:Rd→Rn

|U(X(ω)) − Y (ω)|2dP(ω) = argmin

U:Rd→Rn E

  • |U(X) − Y |2

. We have ˆ U(x) = E [Y |X = x] .

slide-55
SLIDE 55

Mathematical Formulation

Let (Ω, F, P) be a probability space and let X : Ω → Rd and Y : Ω → Rn be random vectors. Find the best functional relationship ˆ U : Rd → Rn between these vectors in the sense that ˆ U = argmin

U:Rd→Rn

|U(X(ω)) − Y (ω)|2dP(ω) = argmin

U:Rd→Rn E

  • |U(X) − Y |2

. We have ˆ U(x) = E [Y |X = x] . ˆ U is called the regression function.

slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58

Statistical Learning Theory

Let z =

  • (x1, y1), . . . , (xm, ym)
  • be m realizations of samples

independently drawn according to (X, Y ). For a function U : Rd → Rk define the empirical risk of U by Ez(U) = 1 m

m

  • i=1

|U(xi) − yi|2.

slide-59
SLIDE 59

Statistical Learning Theory

Let z =

  • (x1, y1), . . . , (xm, ym)
  • be m realizations of samples

independently drawn according to (X, Y ). For a function U : Rd → Rk define the empirical risk of U by Ez(U) = 1 m

m

  • i=1

|U(xi) − yi|2. Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C(Rd, Rk) and computes the empirical regression function ˆ UH,z ∈ argmin

U∈H

Ez(U).

slide-60
SLIDE 60

Statistical Learning Theory

Let z =

  • (x1, y1), . . . , (xm, ym)
  • be m realizations of samples

independently drawn according to (X, Y ). For a function U : Rd → Rk define the empirical risk of U by Ez(U) = 1 m

m

  • i=1

|U(xi) − yi|2. Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C(Rd, Rk) and computes the empirical regression function ˆ UH,z ∈ argmin

U∈H

Ez(U). Example H = {Polynomials of degree ≤ p}.

slide-61
SLIDE 61
slide-62
SLIDE 62

Degree too low: underfitting. Degree to high: overfitting!

slide-63
SLIDE 63

Figure: Error with Polynomial Degree

slide-64
SLIDE 64

Figure: Error with Polynomial Degree

Bias-Variance-Problem “Capacity” of the hypothesis space has to be adapted to the complexity of the target function and the sample size!

slide-65
SLIDE 65

Bias-Variance Decomposition

Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m

i=1 i.i.d. samples, H a hypothesis class and ˆ

UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2

slide-66
SLIDE 66

Bias-Variance Decomposition

Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m

i=1 i.i.d. samples, H a hypothesis class and ˆ

UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2 Bias-Variance Decomposition Let UH := argminU∈H E|U(X) − ˆ U(X)|2, ǫapprox := E|UH(X) − ˆ U(X)|2 the approximation error and ǫgeneralize := E(UH,z) − E(UH) the generalization error. Then ǫ = ǫapprox + ǫgeneralize.

slide-67
SLIDE 67

Bias-Variance Decomposition

Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m

i=1 i.i.d. samples, H a hypothesis class and ˆ

UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2 Bias-Variance Decomposition Let UH := argminU∈H E|U(X) − ˆ U(X)|2, ǫapprox := E|UH(X) − ˆ U(X)|2 the approximation error and ǫgeneralize := E(UH,z) − E(UH) the generalization error. Then ǫ = ǫapprox + ǫgeneralize. Main Theorem [e.g., Cucker-Zhou (2007)] If m ln(N(H,c·η))

η2

(and very strong conditions hold), then ǫgeneralize ≤ η w.h.p. where N(H, s) is the s-covering number of H w.r.t. L∞ .

slide-68
SLIDE 68

Bias-Variance Decomposition

Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m

i=1 i.i.d. samples, H a hypothesis class and ˆ

UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2 Bias-Variance Decomposition Let UH := argminU∈H E|U(X) − ˆ U(X)|2, ǫapprox := E|UH(X) − ˆ U(X)|2 the approximation error and ǫgeneralize := E(UH,z) − E(UH) the generalization error. Then ǫ = ǫapprox + ǫgeneralize. Main Theorem [e.g., Cucker-Zhou (2007)] If m ln(N(H,c·η))

η2

(and very strong conditions hold), then ǫgeneralize ≤ η w.h.p. where N(H, s) is the s-covering number of H w.r.t. L∞ . Problems for Data Science Applications: Assumption that data is iid is debatable Different asymptotic regime in deep learning (where often #DOFs >> #training samples) Without knowing P(X,Y ) it is impossible to control the approximation error.

slide-69
SLIDE 69

PDEs as Learning Problems

slide-70
SLIDE 70

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4.

slide-71
SLIDE 71

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =

  • R3 ϕ(y)

1 (4πt)3/2 exp(−|x − y|2/4t)dy.

slide-72
SLIDE 72

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =

  • R3 ϕ(y)

1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x

t )] ,

Z x

t ∼ N(x, t1/2I).

slide-73
SLIDE 73

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =

  • R3 ϕ(y)

1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x

t )] ,

Z x

t ∼ N(x, t1/2I).

In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ

  • Z X

t

  • we

have u(t, x) = E [Y |X = x] .

slide-74
SLIDE 74

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =

  • R3 ϕ(y)

1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x

t )] ,

Z x

t ∼ N(x, t1/2I).

In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ

  • Z X

t

  • we

have u(t, x) = E [Y |X = x] . The solution u(t, x) of the PDE can be interpreted as solution to the learning problem with data distribution (X, Y ), where X ∼ U[u, v]3 and Y = ϕ(Z X

t ) and Z X t

∼ N(x, t1/2I)!

slide-75
SLIDE 75

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =

  • R3 ϕ(y)

1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x

t )] ,

Z x

t ∼ N(x, t1/2I).

In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ

  • Z X

t

  • we

have u(t, x) = E [Y |X = x] . The solution u(t, x) of the PDE can be interpreted as solution to the learning problem with data distribution (X, Y ), where X ∼ U[u, v]3 and Y = ϕ(Z X

t ) and Z X t

∼ N(x, t1/2I)! Contrary to conventional ML problems, the data dis- tribution is now explicitly known – we can simulate as much training data as we want!

slide-76
SLIDE 76

Explicit Solution of Heat Equation if g = 0

Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =

  • R3 ϕ(y)

1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x

t )] ,

Z x

t ∼ N(x, t1/2I).

In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ

  • Z X

t

  • we

have u(t, x) = E [Y |X = x] . The solution u(t, x) of the PDE can be interpreted as solution to the learning problem with data distribution (X, Y ), where X ∼ U[u, v]3 and Y = ϕ(Z X

t ) and Z X t

∼ N(x, t1/2I)! Contrary to conventional ML problems, the data dis- tribution is now explicitly known – we can simulate as much training data as we want! We will see in a minute that similar properties hold for a much more general class of PDEs!

slide-77
SLIDE 77

Linear Kolmogorov Equations

Given Σ : Rd → Rd×d, µ : Rd → Rd and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace

  • Σ(x)ΣT(x)Hessxu(t, x)
  • + µ(x) · ∇xu(t, x),

(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x).

slide-78
SLIDE 78

Linear Kolmogorov Equations

Given Σ : Rd → Rd×d, µ : Rd → Rd and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace

  • Σ(x)ΣT(x)Hessxu(t, x)
  • + µ(x) · ∇xu(t, x),

(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Examples include convection-diffusion equations and Black-Scholes Equation.

slide-79
SLIDE 79

Linear Kolmogorov Equations

Given Σ : Rd → Rd×d, µ : Rd → Rd and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace

  • Σ(x)ΣT(x)Hessxu(t, x)
  • + µ(x) · ∇xu(t, x),

(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Examples include convection-diffusion equations and Black-Scholes Equation. Standard methods such as sparse grid methods, sparse tensor product methods, spectral methods, finite element methods or finite difference methods are incapable of solving such equations in high dimensions (d = 100)!

slide-80
SLIDE 80

Special Case: Pricing of Financial Derivatives

slide-81
SLIDE 81

Special Case: Pricing of Financial Derivatives

Given a portfolio consisting of d assets with value (xi(t))d

i=1.

slide-82
SLIDE 82

Special Case: Pricing of Financial Derivatives

Given a portfolio consisting of d assets with value (xi(t))d

i=1.

European Max Option: At time T, exercise option and receive G(x) := max

  • d

max

i=1 (xi − Ki), 0

slide-83
SLIDE 83

Special Case: Pricing of Financial Derivatives

Given a portfolio consisting of d assets with value (xi(t))d

i=1.

European Max Option: At time T, exercise option and receive G(x) := max

  • d

max

i=1 (xi − Ki), 0

  • (Black-Scholes (1973)): in the absence of correlations the

portfolio-value u(t, x) satisfies

∂ ∂t

  • u(t, x) + µ

2

d

  • i=1

xi ∂ ∂xi u(t, x)

  • + σ2

2

d

  • i=1

|xi|2 ∂2 ∂x2

i

u(t, x)

  • = 0,

u(T, x) = G(x).

slide-84
SLIDE 84

Special Case: Pricing of Financial Derivatives

Given a portfolio consisting of d assets with value (xi(t))d

i=1.

European Max Option: At time T, exercise option and receive G(x) := max

  • d

max

i=1 (xi − Ki), 0

  • (Black-Scholes (1973)): in the absence of correlations the

portfolio-value u(t, x) satisfies

∂ ∂t

  • u(t, x) + µ

2

d

  • i=1

xi ∂ ∂xi u(t, x)

  • + σ2

2

d

  • i=1

|xi|2 ∂2 ∂x2

i

u(t, x)

  • = 0,

u(T, x) = G(x).

Pricing Problem: u(0, x) =??.

slide-85
SLIDE 85

Kolmogorov PDEs as Learning Problems

slide-86
SLIDE 86

Kolmogorov PDEs as Learning Problems

For x ∈ Rd and t ∈ R+ let Z x

t := x +

t µ(Z x

s )ds +

t Σ(Z x

s )dWs.

Then (Feynman-Kac) u(T, x) = E(ϕ(Z x

T)).

slide-87
SLIDE 87

Kolmogorov PDEs as Learning Problems

For x ∈ Rd and t ∈ R+ let Z x

t := x +

t µ(Z x

s )ds +

t Σ(Z x

s )dWs.

Then (Feynman-Kac) u(T, x) = E(ϕ(Z x

T)).

Lemma (Beck-Becker-G-Jafaari-Jentzen (2018)) Let X ∼ U[a,b]d and let Y = ϕ(Z T

X ). The solution ˆ

U of the mathematical learning problem with data distribution (X, Y ) is given by ˆ U(x) = u(T, x), x ∈ [a, b]d, where u solves the corresponding Kolmogorov equation.

slide-88
SLIDE 88

Solving linear Kolmogorov Equations by means of Neural Network Based Learning

slide-89
SLIDE 89

The Vanilla DL Paradigm

slide-90
SLIDE 90

The Vanilla DL Paradigm

Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784:

slide-91
SLIDE 91

The Vanilla DL Paradigm

Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit
slide-92
SLIDE 92

The Vanilla DL Paradigm

Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit

Given labeled training data (xi, yi)m

i=1 ⊂ R784 × R10.

slide-93
SLIDE 93

The Vanilla DL Paradigm

Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit

Given labeled training data (xi, yi)m

i=1 ⊂ R784 × R10.

Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons (N1 = 30, N2 = 30).

slide-94
SLIDE 94

The Vanilla DL Paradigm

Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit

Given labeled training data (xi, yi)m

i=1 ⊂ R784 × R10.

Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons (N1 = 30, N2 = 30). The learning goal is to find the empirical regression function fz ∈ Hσ

(784,30,30,10).

slide-95
SLIDE 95

The Vanilla DL Paradigm

Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’

  • f each digit

Given labeled training data (xi, yi)m

i=1 ⊂ R784 × R10.

Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons (N1 = 30, N2 = 30). The learning goal is to find the empirical regression function fz ∈ Hσ

(784,30,30,10).

Typically solved by stochastic first order

  • ptimization methods.
slide-96
SLIDE 96

Description of Image Content

ImageNet Challenge

slide-97
SLIDE 97

Deep Learning Algorithm

slide-98
SLIDE 98

Deep Learning Algorithm

  • 1. Generate training data z = (xi, yi)m

i=1 iid

∼ (X, ϕ(Z T

X )) by

simulating Z T

X with the Euler-Maruyama scheme.

slide-99
SLIDE 99

Deep Learning Algorithm

  • 1. Generate training data z = (xi, yi)m

i=1 iid

∼ (X, ϕ(Z T

X )) by

simulating Z T

X with the Euler-Maruyama scheme.

  • 2. Apply the Deep Learning Paradigm to this training data
slide-100
SLIDE 100

Deep Learning Algorithm

  • 1. Generate training data z = (xi, yi)m

i=1 iid

∼ (X, ϕ(Z T

X )) by

simulating Z T

X with the Euler-Maruyama scheme.

  • 2. Apply the Deep Learning Paradigm to this training data

...meaning that

(i) we pick a network architecture (N0 = d, N1, . . . , NL = 1), and let H = Hσ

(N0,...,NL) and

(ii) attempt to approximately compute ˆ UH,z = argmin

U∈H

1 m

m

  • i=1

(U(xi) − yi)2 in Tensorflow.

slide-101
SLIDE 101

100000 200000 300000 400000 500000

Number of iterations

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Estimated relative errors L1( [0,1]d; ) L2( [0,1]d; ) L ( [0,1]d; )

Number

  • f steps

Relative L1(λ[0,1]d ; R)-error Relative L2(λ[0,1]d ; R)-error Relative L∞(λ[0,1]d ; R)-error Runtime in seconds 0.998253 0.998254 1.003524 0.5 10000 0.957464 0.957536 0.993083 44.6 50000 0.786743 0.786806 0.828184 220.8 100000 0.574013 0.574060 0.605283 440.8 150000 0.361564 0.361594 0.384105 661.0 200000 0.001419 0.001784 0.010423 880.8 500000 0.001419 0.001784 0.010423 2200.7 750000 0.001419 0.001784 0.010423 3300.6

Figure: Estimated errors associated to the solution u(1, ·) of the 100-dimensional parabolic PDE ∂u

∂t (t, x) = ∆xu(t, x), u(0, x) = |x|2,

x ∈ [0, 1]100.

slide-102
SLIDE 102

Number

  • f steps

Relative L1(λ[90,110]d ; R)-error Relative L2(λ[90,110]d ; R)-error Relative L∞(λ[90,110]d ; R)-error Runtime in seconds 1.004285 1.004286 1.009524 1 25000 0.842938 0.843021 0.87884 110.2 50000 0.684955 0.685021 0.719826 219.5 100000 0.371515 0.371551 0.387978 437.9 150000 0.064605 0.064628 0.072259 656.2 250000 0.001220 0.001538 0.010039 1092.6 500000 0.000949 0.001187 0.005105 2183.8 750000 0.000902 0.001129 0.006028 3275.1

Figure: Estimated errors associated to the solution u(T, ·) of the 100-dimensional uncorrelated Black Scholes PDE

∂u ∂t (t, x) = 1 2

d

i=1 |σixi|2( ∂2u ∂x2

i )(t, x) + d

i=1 µixi( ∂u ∂xi )(t, x),

u(0, x) = exp(−rT) max

  • maxi∈{1,2,...,d} xi − 100, 0
  • , x ∈ [90, 110]100.
slide-103
SLIDE 103

Number

  • f steps

Relative L1(λ[90,110]d ; R)-error Relative L2(λ[90,110]d ; R)-error Relative L∞(λ[90,110]d ; R)-error Runtime in seconds 1.003383 1.003385 1.011662 0.8 25000 0.631420 0.631429 0.640633 112.1 50000 0.269053 0.269058 0.275114 223.3 100000 0.000752 0.000948 0.00553 445.8 150000 0.000694 0.00087 0.004662 668.2 250000 0.000604 0.000758 0.006483 1119.3 500000 0.000493 0.000615 0.002774 2292.8 750000 0.000471 0.00059 0.002862 3466.8

Figure: Estimated errors associated to the solution u(T, ·) of the 100-dimensional correlated Black Scholes PDE

∂u ∂t (t, x) = 1 2

d

i,j=1 xixjβiβjςi, ςjRd( ∂2u ∂xi∂xj )(t, x) + d i=1 µixi( ∂u ∂xi )(t, x),

u(0, x) = exp(−µT) max

  • 110 − mini∈{1,2,...,d}{xi}, 0
  • , x ∈ [90, 110]100.
slide-104
SLIDE 104

Number

  • f steps

Relative L1(λ[90,110]d ; R)-error Relative L2(λ[90,110]d ; R)-error Relative L∞(λ[90,110]d ; R)-error Runtime in seconds 1.003383 1.003385 1.011662 0.8 25000 0.631420 0.631429 0.640633 112.1 50000 0.269053 0.269058 0.275114 223.3 100000 0.000752 0.000948 0.00553 445.8 150000 0.000694 0.00087 0.004662 668.2 250000 0.000604 0.000758 0.006483 1119.3 500000 0.000493 0.000615 0.002774 2292.8 750000 0.000471 0.00059 0.002862 3466.8

Figure: Estimated errors associated to the solution u(T, ·) of the 100-dimensional correlated Black Scholes PDE

∂u ∂t (t, x) = 1 2

d

i,j=1 xixjβiβjςi, ςjRd( ∂2u ∂xi∂xj )(t, x) + d i=1 µixi( ∂u ∂xi )(t, x),

u(0, x) = exp(−µT) max

  • 110 − mini∈{1,2,...,d}{xi}, 0
  • , x ∈ [90, 110]100.

All computations were performed in single precision (float32) on a NVIDIA GeForce GTX 1080 GPU with 1974 MHz core clock and 8 GB GDDR5X memory with 1809.5 MHz clock rate. The underlying system consisted of an Intel Core i7-6800K CPU with 64 GB DDR4-2133 memory running Tensorflow 1.5 on Ubuntu 16.04.

slide-105
SLIDE 105

Some Theoretical Results

slide-106
SLIDE 106

Linear Affine Kolmogorov Equations

Given Σ : Rd → Rd×d, µ : Rd → Rd affine and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace

  • Σ(x)ΣT(x)Hessxu(t, x)
  • + µ(x) · ∇xu(t, x),

(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x).

slide-107
SLIDE 107

Linear Affine Kolmogorov Equations

Given Σ : Rd → Rd×d, µ : Rd → Rd affine and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace

  • Σ(x)ΣT(x)Hessxu(t, x)
  • + µ(x) · ∇xu(t, x),

(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Includes Black-Scholes Equation with correlations!

slide-108
SLIDE 108

Linear Affine Kolmogorov Equations

Given Σ : Rd → Rd×d, µ : Rd → Rd affine and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace

  • Σ(x)ΣT(x)Hessxu(t, x)
  • + µ(x) · ∇xu(t, x),

(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Includes Black-Scholes Equation with correlations! Theorem [G-Hornung-Jentzen-Von Wurstenberger (2018)], simplified version Suppose that ϕ ∈ Hσ

(N0,...,NL) (or can be well approximated by NNs).

Then for all ǫ > 0 there is Φǫ with size(Φǫ) size(ϕ) · ǫ−2 and sup

x∈[a,b]d |u(T, x) − Rσ(Φǫ)(x)| ≤ ǫ.

The implicit constant depends at most polynomially on the dimension d = N0.

slide-109
SLIDE 109

Option Pricing without Curse of Dimensionality

Theorem [Berner-G-Jentzen (2018)], very special case Let ϕ(x) = min{max{max(xi − Ki), 0}, R} or ϕ(x) = min{max{d

i=1 xi − K, 0}, R} (or any typical option). Then

for all ǫ > 0 there is Φǫ ∈ HReLU

(N0,...,NL) with size(Φǫ) = O(ǫ−2) and

1 (b − a)d/2

  • [a,b]d |u(T, x) − Rσ(Φǫ)(x)|2dx

1/2 ≤ ǫ. Such networks can be found by solving the ERM problem with m ∼ ǫ−4 samples. The implicit constants depend at most polynomially on the dimension d = N0!

slide-110
SLIDE 110

Option Pricing without Curse of Dimensionality

Theorem [Berner-G-Jentzen (2018)], very special case Let ϕ(x) = min{max{max(xi − Ki), 0}, R} or ϕ(x) = min{max{d

i=1 xi − K, 0}, R} (or any typical option). Then

for all ǫ > 0 there is Φǫ ∈ HReLU

(N0,...,NL) with size(Φǫ) = O(ǫ−2) and

1 (b − a)d/2

  • [a,b]d |u(T, x) − Rσ(Φǫ)(x)|2dx

1/2 ≤ ǫ. Such networks can be found by solving the ERM problem with m ∼ ǫ−4 samples. The implicit constants depend at most polynomially on the dimension d = N0! Due to compositional structure of NNs, all results hold also for

  • ptions operating on options...
slide-111
SLIDE 111

Wrap Up

slide-112
SLIDE 112

Wrap Up

Several PDEs can be reformulated as learning problem

slide-113
SLIDE 113

Wrap Up

Several PDEs can be reformulated as learning problem Neural network based numerical solution of high-dimensional PDEs is extremely promising both empirically and mathematically – and it is possible to prove real theorems!

slide-114
SLIDE 114

Wrap Up

Several PDEs can be reformulated as learning problem Neural network based numerical solution of high-dimensional PDEs is extremely promising both empirically and mathematically – and it is possible to prove real theorems! Specifically, we can prove that these methods are capable of

  • vercoming the curse of dimensionality for an important class of

PDEs arising in computational finance.

slide-115
SLIDE 115

Wrap Up

Several PDEs can be reformulated as learning problem Neural network based numerical solution of high-dimensional PDEs is extremely promising both empirically and mathematically – and it is possible to prove real theorems! Specifically, we can prove that these methods are capable of

  • vercoming the curse of dimensionality for an important class of

PDEs arising in computational finance. We can observe these properties in simulations.

slide-116
SLIDE 116

Thank You!

Questions?

slide-117
SLIDE 117

Literature

Beck, Becker, G, Jaafari, Jentzen. Solving Stochastic Differential Equations and Kolmogorov Equations by Means of Deep Learning. ArXiv 1806.00421. Elbr¨ achter, G, Jentzen, Schwab. DNN Expression Rate Analysis of High Dimensional PDEs: Applications in Option Pricing. ArXiv 1806.xxxxx. Perekreshtenko, G, Elbr¨ achter, B¨

  • lcskei. The universal approximation

power of finite-width deep ReLU networks. ArXiv 1806.01528. G, Hornung, Jentzen, Von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. ArXiv 1809.xxxxx. Berner, G, Jentzen. Empirical risk minimization over deep neural network hypothesis classes breaks the curse of dimensionality for the numerical approximatoin of Black-Scholes partial differential equations. ArXiv 1809.xxxxx.