SLIDE 1 Deep Neural Networks for PDEs
Philipp Grohs
DL and Vis, September 2018
SLIDE 2 Short Reading List
1 Ian Goodfellow and Yoshua Bengio and Aaron Courville: Deep
Learning; MIT Press, 2016
2 Aurelien Geron: Hands-On Machine Learning with Scikit-Learn
and TensorFlow; O’Reilley, 2017
3 Brian Steele and John Chandler and Swarna Reddy: Algorithms
for Data Science; Springer, 2017
4 Alan Jeffrey: Applied Partial Differential Equations – An
Introduction; Academic Press, 2002
SLIDE 3 Syllabus
1 PDEs and the Curse of Dimensionality 2 A Crash Course in Statistical Learning Theory (including a
Detour to Variational Autoencoders)
3 PDEs as Learning Problem 4 Solving linear Kolmogorov Equations by means of Neural
Network Based Learning
SLIDE 4
PDEs and the Curse of Dimensionality
SLIDE 5 PDEs
A PDE for the function u(x1, . . . , xd) is an equation of the form F
∂x1 , . . . ∂u ∂xd , ∂2u ∂x1∂x1 , . . . , ∂2u ∂x1∂xd , . . .
together with suitable boundary conditions.
SLIDE 6
Heat Equation
∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 + g(t, x), u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4.
SLIDE 7
Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4.
SLIDE 8 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) = 1 (4πt)3/2
- R3 ϕ(y) exp(−|x − y|2/4t)dy.
SLIDE 9
Fluid Dynamics
∂u ∂t (t, x, v) + v · ∇u(t, x, v) = Qu(t, x, v) t ∈ (0, ∞), x, v ∈ R3; d = 7.
SLIDE 10 Schr¨
Wave function of non-relativistic quantum mechanical system of N electons in a field of K nuclei of charge Zν and fixed position Rµ ∈ R3 i ∂ ∂t Ψ(r1, . . . , rN; t) = −1 2
N
∆iΨ(r1, . . . , rN; t)−
N
K
Zν |rξ − Rν|Ψ(r1, . . . , rN; t) + 1 2
N
N
1 − δξ,η |rξ − rη|, t ∈ (0, ∞), r1, . . . , rN ∈ R3; d = 3N + 1.
SLIDE 11 Black-Scholes Equation
Pricing a portfolio of N financial derivatives ∂u ∂t (t, x) = 1 2
N
xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+
N
µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −
N
cixi, 0} t ∈ (0, ∞), x ∈ RN; d = N + 1.
SLIDE 12
Learning the PDE [Rudy et.al. (2017)]
SLIDE 13
Finite Difference Approach
Want to approximate u(x) for x ∈ [0, 1]d.
SLIDE 14
Finite Difference Approach
Want to approximate u(x) for x ∈ [0, 1]d. Let ui1,...,id ∼ u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d,
SLIDE 15
Finite Difference Approach
Want to approximate u(x) for x ∈ [0, 1]d. Let ui1,...,id ∼ u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, ui1,...,il+1,...,id − ui1,...,il,...,id ǫ ∼ ∂ ∂xl u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, and so on,
SLIDE 16 Finite Difference Approach
Want to approximate u(x) for x ∈ [0, 1]d. Let ui1,...,id ∼ u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, ui1,...,il+1,...,id − ui1,...,il,...,id ǫ ∼ ∂ ∂xl u(i1ǫ, . . . , idǫ), (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d, and so on, and solve the discrete system F
- i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id
ǫ , . . .
(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d.
SLIDE 17 Curse of Dimensionality
The system F
- i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id
ǫ , . . .
(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. requires us to solve an equation in ui1,...,id for (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d.
SLIDE 18 Curse of Dimensionality
The system F
- i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id
ǫ , . . .
(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. requires us to solve an equation in ui1,...,id for (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. Exponential Dependence on the Dimension Let ǫ = 1
2 (take two samples in each coordinate). Then these are 2d
unknowns.
SLIDE 19 Curse of Dimensionality
The system F
- i1ǫ, . . . , idǫ, ui1,...,id, ui1+1,...,id − ui1,...,id
ǫ , . . .
(i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. requires us to solve an equation in ui1,...,id for (i1, . . . , id) ∈ {0, . . . , ⌊ǫ−1⌋}d. Exponential Dependence on the Dimension Let ǫ = 1
2 (take two samples in each coordinate). Then these are 2d
- unknowns. intractable for high-dimensional problems!
SLIDE 20
Curse of Dimensionality
SLIDE 21
Curse of Dimensionality
The complexity of approximating a general d-dimensional function scales exponentially in d.
SLIDE 22
Curse of Dimensionality
The complexity of approximating a general d-dimensional function scales exponentially in d. Suppose we have a problem where we aim to ap- proximate a d-dimensional function. An algorithm to solve the problem suffers from the curse of di- mensionality if its computational complexity de- pends exponentially on the dimension d.
SLIDE 23 Black-Scholes Equation
Pricing a portfolio of N financial derivatives
∂u ∂t (t, x) = 1 2
N
xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+
N
µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −
N
cixi, 0}
t ∈ (0, ∞), x ∈ RN; d = N + 1.
SLIDE 24 Black-Scholes Equation
Pricing a portfolio of N financial derivatives
∂u ∂t (t, x) = 1 2
N
xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+
N
µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −
N
cixi, 0}
t ∈ (0, ∞), x ∈ RN; d = N + 1. Realistic values: d = 100 − 1000.
SLIDE 25 Black-Scholes Equation
Pricing a portfolio of N financial derivatives
∂u ∂t (t, x) = 1 2
N
xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+
N
µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −
N
cixi, 0}
t ∈ (0, ∞), x ∈ RN; d = N + 1. Realistic values: d = 100 − 1000. Complexity of finite difference method: 2100 − 21000.
SLIDE 26 Black-Scholes Equation
Pricing a portfolio of N financial derivatives
∂u ∂t (t, x) = 1 2
N
xixjβiβjςi, ςjRN( ∂2u ∂xi∂xj )(t, x)+
N
µixi( ∂u ∂xi )(t, x) u(0, x) = max{K −
N
cixi, 0}
t ∈ (0, ∞), x ∈ RN; d = N + 1. Realistic values: d = 100 − 1000. Complexity of finite difference method: 2100 − 21000. Number of atoms in the universe: 2250.
SLIDE 27
Black-Scholes Equation
SLIDE 28
Black-Scholes Equation
Option pricing is extremely relevant and has to be done every day in the financial industry
SLIDE 29
Black-Scholes Equation
Option pricing is extremely relevant and has to be done every day in the financial industry All algorithms for the solution of the Black-Scholes equation suffer from the curse of dimensionality!
SLIDE 30
MNIST
MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/
SLIDE 31
MNIST
MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784:
SLIDE 32
MNIST
MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784:
SLIDE 33 MNIST
MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
SLIDE 34 MNIST
MNIST Database for hand- written digit recognition http://yann.lecun.com/ exdb/mnist/ Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
SLIDE 35
MNIST
SLIDE 36
MNIST
5
ConvNet
SLIDE 37
MNIST
5
ConvNet
This is a 784-dimensional function
SLIDE 38
MNIST
5
ConvNet
This is a 784-dimensional function Apparently, deep learning does not suffer from the curse of dimensionality for certain classification problems!
SLIDE 39
MNIST
5
ConvNet
This is a 784-dimensional function Apparently, deep learning does not suffer from the curse of dimensionality for certain classification problems! Can this also be used for the solution of PDEs?
SLIDE 40
A Crash Course in Statistical Learning Theory
SLIDE 41
Data Generating Distribution
Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.
SLIDE 42
Data Generating Distribution
Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.
SLIDE 43
Data Generating Distribution
Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.
SLIDE 44
Data Generating Distribution
Suppose that there exists a probability distribution on R784 that randomly generates handwritten digits.
Variational Autoencoder Demo
SLIDE 45
A New Look
Suppose that our training data consists of samples according to a given data distribution (X, Y )
SLIDE 46
A New Look
Suppose that our training data consists of samples according to a given data distribution (X, Y )
SLIDE 47
A New Look
If we knew the data distribution (X, Y ), the best functional relation between X and Y would simply be E[Y |X = x]!
SLIDE 48
A New Look
If we knew the data distribution (X, Y ), the best functional relation between X and Y would simply be E[Y |X = x]!
SLIDE 49
A New Look
But we only have samples and do not know the distribution (X, Y )
SLIDE 50
A New Look
But we only have samples and do not know the distribution (X, Y )
SLIDE 51 A New Look
But we only have samples and do not know the distribution (X, Y )
A mathematical learning problem seeks to infer the regression function E[Y |X = x] from random samples (xi, yi)m
i=1 of (X, Y ).
SLIDE 52
Mathematical Formulation
SLIDE 53 Mathematical Formulation
Let (Ω, F, P) be a probability space and let X : Ω → Rd and Y : Ω → Rn be random vectors. Find the best functional relationship ˆ U : Rd → Rn between these vectors in the sense that ˆ U = argmin
U:Rd→Rn
|U(X(ω)) − Y (ω)|2dP(ω) = argmin
U:Rd→Rn E
.
SLIDE 54 Mathematical Formulation
Let (Ω, F, P) be a probability space and let X : Ω → Rd and Y : Ω → Rn be random vectors. Find the best functional relationship ˆ U : Rd → Rn between these vectors in the sense that ˆ U = argmin
U:Rd→Rn
|U(X(ω)) − Y (ω)|2dP(ω) = argmin
U:Rd→Rn E
. We have ˆ U(x) = E [Y |X = x] .
SLIDE 55 Mathematical Formulation
Let (Ω, F, P) be a probability space and let X : Ω → Rd and Y : Ω → Rn be random vectors. Find the best functional relationship ˆ U : Rd → Rn between these vectors in the sense that ˆ U = argmin
U:Rd→Rn
|U(X(ω)) − Y (ω)|2dP(ω) = argmin
U:Rd→Rn E
. We have ˆ U(x) = E [Y |X = x] . ˆ U is called the regression function.
SLIDE 56
SLIDE 57
SLIDE 58 Statistical Learning Theory
Let z =
- (x1, y1), . . . , (xm, ym)
- be m realizations of samples
independently drawn according to (X, Y ). For a function U : Rd → Rk define the empirical risk of U by Ez(U) = 1 m
m
|U(xi) − yi|2.
SLIDE 59 Statistical Learning Theory
Let z =
- (x1, y1), . . . , (xm, ym)
- be m realizations of samples
independently drawn according to (X, Y ). For a function U : Rd → Rk define the empirical risk of U by Ez(U) = 1 m
m
|U(xi) − yi|2. Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C(Rd, Rk) and computes the empirical regression function ˆ UH,z ∈ argmin
U∈H
Ez(U).
SLIDE 60 Statistical Learning Theory
Let z =
- (x1, y1), . . . , (xm, ym)
- be m realizations of samples
independently drawn according to (X, Y ). For a function U : Rd → Rk define the empirical risk of U by Ez(U) = 1 m
m
|U(xi) − yi|2. Empirical Risk Minimization (ERM) picks a hypothesis class H ⊂ C(Rd, Rk) and computes the empirical regression function ˆ UH,z ∈ argmin
U∈H
Ez(U). Example H = {Polynomials of degree ≤ p}.
SLIDE 61
SLIDE 62
Degree too low: underfitting. Degree to high: overfitting!
SLIDE 63 Figure: Error with Polynomial Degree
SLIDE 64 Figure: Error with Polynomial Degree
Bias-Variance-Problem “Capacity” of the hypothesis space has to be adapted to the complexity of the target function and the sample size!
SLIDE 65 Bias-Variance Decomposition
Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m
i=1 i.i.d. samples, H a hypothesis class and ˆ
UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2
SLIDE 66 Bias-Variance Decomposition
Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m
i=1 i.i.d. samples, H a hypothesis class and ˆ
UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2 Bias-Variance Decomposition Let UH := argminU∈H E|U(X) − ˆ U(X)|2, ǫapprox := E|UH(X) − ˆ U(X)|2 the approximation error and ǫgeneralize := E(UH,z) − E(UH) the generalization error. Then ǫ = ǫapprox + ǫgeneralize.
SLIDE 67 Bias-Variance Decomposition
Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m
i=1 i.i.d. samples, H a hypothesis class and ˆ
UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2 Bias-Variance Decomposition Let UH := argminU∈H E|U(X) − ˆ U(X)|2, ǫapprox := E|UH(X) − ˆ U(X)|2 the approximation error and ǫgeneralize := E(UH,z) − E(UH) the generalization error. Then ǫ = ǫapprox + ǫgeneralize. Main Theorem [e.g., Cucker-Zhou (2007)] If m ln(N(H,c·η))
η2
(and very strong conditions hold), then ǫgeneralize ≤ η w.h.p. where N(H, s) is the s-covering number of H w.r.t. L∞ .
SLIDE 68 Bias-Variance Decomposition
Let (X, Y ) data generating r.v.’s and ˆ U the regression function. Let z = (xi, yi)m
i=1 i.i.d. samples, H a hypothesis class and ˆ
UH,z the empirical regression function. We seek to understand the error ǫ := E( ˆ UH,z) − E( ˆ U) = E| ˆ UH,z(X) − ˆ U(X)|2 Bias-Variance Decomposition Let UH := argminU∈H E|U(X) − ˆ U(X)|2, ǫapprox := E|UH(X) − ˆ U(X)|2 the approximation error and ǫgeneralize := E(UH,z) − E(UH) the generalization error. Then ǫ = ǫapprox + ǫgeneralize. Main Theorem [e.g., Cucker-Zhou (2007)] If m ln(N(H,c·η))
η2
(and very strong conditions hold), then ǫgeneralize ≤ η w.h.p. where N(H, s) is the s-covering number of H w.r.t. L∞ . Problems for Data Science Applications: Assumption that data is iid is debatable Different asymptotic regime in deep learning (where often #DOFs >> #training samples) Without knowing P(X,Y ) it is impossible to control the approximation error.
SLIDE 69
PDEs as Learning Problems
SLIDE 70
Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4.
SLIDE 71 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =
1 (4πt)3/2 exp(−|x − y|2/4t)dy.
SLIDE 72 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =
1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x
t )] ,
Z x
t ∼ N(x, t1/2I).
SLIDE 73 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =
1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x
t )] ,
Z x
t ∼ N(x, t1/2I).
In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ
t
have u(t, x) = E [Y |X = x] .
SLIDE 74 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =
1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x
t )] ,
Z x
t ∼ N(x, t1/2I).
In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ
t
have u(t, x) = E [Y |X = x] . The solution u(t, x) of the PDE can be interpreted as solution to the learning problem with data distribution (X, Y ), where X ∼ U[u, v]3 and Y = ϕ(Z X
t ) and Z X t
∼ N(x, t1/2I)!
SLIDE 75 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =
1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x
t )] ,
Z x
t ∼ N(x, t1/2I).
In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ
t
have u(t, x) = E [Y |X = x] . The solution u(t, x) of the PDE can be interpreted as solution to the learning problem with data distribution (X, Y ), where X ∼ U[u, v]3 and Y = ϕ(Z X
t ) and Z X t
∼ N(x, t1/2I)! Contrary to conventional ML problems, the data dis- tribution is now explicitly known – we can simulate as much training data as we want!
SLIDE 76 Explicit Solution of Heat Equation if g = 0
Let u(t, x) satisfy ∂u ∂t (t, x) = ∂2u ∂x1∂x1 + ∂2u ∂x2∂x2 + ∂2u ∂x3∂x3 , u(0, x) = ϕ(x) t ∈ (0, ∞), x ∈ R3; d = 4. Then u(t, x) =
1 (4πt)3/2 exp(−|x − y|2/4t)dy. In other words u(t, x) = E [ϕ(Z x
t )] ,
Z x
t ∼ N(x, t1/2I).
In other words, for x ∈ [u, v]3 and X ∼ U[u, v]3 and Y = ϕ
t
have u(t, x) = E [Y |X = x] . The solution u(t, x) of the PDE can be interpreted as solution to the learning problem with data distribution (X, Y ), where X ∼ U[u, v]3 and Y = ϕ(Z X
t ) and Z X t
∼ N(x, t1/2I)! Contrary to conventional ML problems, the data dis- tribution is now explicitly known – we can simulate as much training data as we want! We will see in a minute that similar properties hold for a much more general class of PDEs!
SLIDE 77 Linear Kolmogorov Equations
Given Σ : Rd → Rd×d, µ : Rd → Rd and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace
- Σ(x)ΣT(x)Hessxu(t, x)
- + µ(x) · ∇xu(t, x),
(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x).
SLIDE 78 Linear Kolmogorov Equations
Given Σ : Rd → Rd×d, µ : Rd → Rd and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace
- Σ(x)ΣT(x)Hessxu(t, x)
- + µ(x) · ∇xu(t, x),
(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Examples include convection-diffusion equations and Black-Scholes Equation.
SLIDE 79 Linear Kolmogorov Equations
Given Σ : Rd → Rd×d, µ : Rd → Rd and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace
- Σ(x)ΣT(x)Hessxu(t, x)
- + µ(x) · ∇xu(t, x),
(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Examples include convection-diffusion equations and Black-Scholes Equation. Standard methods such as sparse grid methods, sparse tensor product methods, spectral methods, finite element methods or finite difference methods are incapable of solving such equations in high dimensions (d = 100)!
SLIDE 80
Special Case: Pricing of Financial Derivatives
SLIDE 81 Special Case: Pricing of Financial Derivatives
Given a portfolio consisting of d assets with value (xi(t))d
i=1.
SLIDE 82 Special Case: Pricing of Financial Derivatives
Given a portfolio consisting of d assets with value (xi(t))d
i=1.
European Max Option: At time T, exercise option and receive G(x) := max
max
i=1 (xi − Ki), 0
SLIDE 83 Special Case: Pricing of Financial Derivatives
Given a portfolio consisting of d assets with value (xi(t))d
i=1.
European Max Option: At time T, exercise option and receive G(x) := max
max
i=1 (xi − Ki), 0
- (Black-Scholes (1973)): in the absence of correlations the
portfolio-value u(t, x) satisfies
∂ ∂t
2
d
xi ∂ ∂xi u(t, x)
2
d
|xi|2 ∂2 ∂x2
i
u(t, x)
u(T, x) = G(x).
SLIDE 84 Special Case: Pricing of Financial Derivatives
Given a portfolio consisting of d assets with value (xi(t))d
i=1.
European Max Option: At time T, exercise option and receive G(x) := max
max
i=1 (xi − Ki), 0
- (Black-Scholes (1973)): in the absence of correlations the
portfolio-value u(t, x) satisfies
∂ ∂t
2
d
xi ∂ ∂xi u(t, x)
2
d
|xi|2 ∂2 ∂x2
i
u(t, x)
u(T, x) = G(x).
Pricing Problem: u(0, x) =??.
SLIDE 85
Kolmogorov PDEs as Learning Problems
SLIDE 86 Kolmogorov PDEs as Learning Problems
For x ∈ Rd and t ∈ R+ let Z x
t := x +
t µ(Z x
s )ds +
t Σ(Z x
s )dWs.
Then (Feynman-Kac) u(T, x) = E(ϕ(Z x
T)).
SLIDE 87 Kolmogorov PDEs as Learning Problems
For x ∈ Rd and t ∈ R+ let Z x
t := x +
t µ(Z x
s )ds +
t Σ(Z x
s )dWs.
Then (Feynman-Kac) u(T, x) = E(ϕ(Z x
T)).
Lemma (Beck-Becker-G-Jafaari-Jentzen (2018)) Let X ∼ U[a,b]d and let Y = ϕ(Z T
X ). The solution ˆ
U of the mathematical learning problem with data distribution (X, Y ) is given by ˆ U(x) = u(T, x), x ∈ [a, b]d, where u solves the corresponding Kolmogorov equation.
SLIDE 88
Solving linear Kolmogorov Equations by means of Neural Network Based Learning
SLIDE 89
The Vanilla DL Paradigm
SLIDE 90
The Vanilla DL Paradigm
Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784:
SLIDE 91 The Vanilla DL Paradigm
Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
SLIDE 92 The Vanilla DL Paradigm
Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
Given labeled training data (xi, yi)m
i=1 ⊂ R784 × R10.
SLIDE 93 The Vanilla DL Paradigm
Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
Given labeled training data (xi, yi)m
i=1 ⊂ R784 × R10.
Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons (N1 = 30, N2 = 30).
SLIDE 94 The Vanilla DL Paradigm
Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
Given labeled training data (xi, yi)m
i=1 ⊂ R784 × R10.
Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons (N1 = 30, N2 = 30). The learning goal is to find the empirical regression function fz ∈ Hσ
(784,30,30,10).
SLIDE 95 The Vanilla DL Paradigm
Every image is given as a 28 × 28 matrix x ∈ R28×28 ∼ R784: Every label is given as a 10-dim vector y ∈ R10 describing the ‘probability’
Given labeled training data (xi, yi)m
i=1 ⊂ R784 × R10.
Fix network architecture, e.g., number of layers (for example L = 3) and numbers of neurons (N1 = 30, N2 = 30). The learning goal is to find the empirical regression function fz ∈ Hσ
(784,30,30,10).
Typically solved by stochastic first order
SLIDE 96 Description of Image Content
ImageNet Challenge
SLIDE 97
Deep Learning Algorithm
SLIDE 98 Deep Learning Algorithm
- 1. Generate training data z = (xi, yi)m
i=1 iid
∼ (X, ϕ(Z T
X )) by
simulating Z T
X with the Euler-Maruyama scheme.
SLIDE 99 Deep Learning Algorithm
- 1. Generate training data z = (xi, yi)m
i=1 iid
∼ (X, ϕ(Z T
X )) by
simulating Z T
X with the Euler-Maruyama scheme.
- 2. Apply the Deep Learning Paradigm to this training data
SLIDE 100 Deep Learning Algorithm
- 1. Generate training data z = (xi, yi)m
i=1 iid
∼ (X, ϕ(Z T
X )) by
simulating Z T
X with the Euler-Maruyama scheme.
- 2. Apply the Deep Learning Paradigm to this training data
...meaning that
(i) we pick a network architecture (N0 = d, N1, . . . , NL = 1), and let H = Hσ
(N0,...,NL) and
(ii) attempt to approximately compute ˆ UH,z = argmin
U∈H
1 m
m
(U(xi) − yi)2 in Tensorflow.
SLIDE 101 100000 200000 300000 400000 500000
Number of iterations
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Estimated relative errors L1( [0,1]d; ) L2( [0,1]d; ) L ( [0,1]d; )
Number
Relative L1(λ[0,1]d ; R)-error Relative L2(λ[0,1]d ; R)-error Relative L∞(λ[0,1]d ; R)-error Runtime in seconds 0.998253 0.998254 1.003524 0.5 10000 0.957464 0.957536 0.993083 44.6 50000 0.786743 0.786806 0.828184 220.8 100000 0.574013 0.574060 0.605283 440.8 150000 0.361564 0.361594 0.384105 661.0 200000 0.001419 0.001784 0.010423 880.8 500000 0.001419 0.001784 0.010423 2200.7 750000 0.001419 0.001784 0.010423 3300.6
Figure: Estimated errors associated to the solution u(1, ·) of the 100-dimensional parabolic PDE ∂u
∂t (t, x) = ∆xu(t, x), u(0, x) = |x|2,
x ∈ [0, 1]100.
SLIDE 102 Number
Relative L1(λ[90,110]d ; R)-error Relative L2(λ[90,110]d ; R)-error Relative L∞(λ[90,110]d ; R)-error Runtime in seconds 1.004285 1.004286 1.009524 1 25000 0.842938 0.843021 0.87884 110.2 50000 0.684955 0.685021 0.719826 219.5 100000 0.371515 0.371551 0.387978 437.9 150000 0.064605 0.064628 0.072259 656.2 250000 0.001220 0.001538 0.010039 1092.6 500000 0.000949 0.001187 0.005105 2183.8 750000 0.000902 0.001129 0.006028 3275.1
Figure: Estimated errors associated to the solution u(T, ·) of the 100-dimensional uncorrelated Black Scholes PDE
∂u ∂t (t, x) = 1 2
d
i=1 |σixi|2( ∂2u ∂x2
i )(t, x) + d
i=1 µixi( ∂u ∂xi )(t, x),
u(0, x) = exp(−rT) max
- maxi∈{1,2,...,d} xi − 100, 0
- , x ∈ [90, 110]100.
SLIDE 103 Number
Relative L1(λ[90,110]d ; R)-error Relative L2(λ[90,110]d ; R)-error Relative L∞(λ[90,110]d ; R)-error Runtime in seconds 1.003383 1.003385 1.011662 0.8 25000 0.631420 0.631429 0.640633 112.1 50000 0.269053 0.269058 0.275114 223.3 100000 0.000752 0.000948 0.00553 445.8 150000 0.000694 0.00087 0.004662 668.2 250000 0.000604 0.000758 0.006483 1119.3 500000 0.000493 0.000615 0.002774 2292.8 750000 0.000471 0.00059 0.002862 3466.8
Figure: Estimated errors associated to the solution u(T, ·) of the 100-dimensional correlated Black Scholes PDE
∂u ∂t (t, x) = 1 2
d
i,j=1 xixjβiβjςi, ςjRd( ∂2u ∂xi∂xj )(t, x) + d i=1 µixi( ∂u ∂xi )(t, x),
u(0, x) = exp(−µT) max
- 110 − mini∈{1,2,...,d}{xi}, 0
- , x ∈ [90, 110]100.
SLIDE 104 Number
Relative L1(λ[90,110]d ; R)-error Relative L2(λ[90,110]d ; R)-error Relative L∞(λ[90,110]d ; R)-error Runtime in seconds 1.003383 1.003385 1.011662 0.8 25000 0.631420 0.631429 0.640633 112.1 50000 0.269053 0.269058 0.275114 223.3 100000 0.000752 0.000948 0.00553 445.8 150000 0.000694 0.00087 0.004662 668.2 250000 0.000604 0.000758 0.006483 1119.3 500000 0.000493 0.000615 0.002774 2292.8 750000 0.000471 0.00059 0.002862 3466.8
Figure: Estimated errors associated to the solution u(T, ·) of the 100-dimensional correlated Black Scholes PDE
∂u ∂t (t, x) = 1 2
d
i,j=1 xixjβiβjςi, ςjRd( ∂2u ∂xi∂xj )(t, x) + d i=1 µixi( ∂u ∂xi )(t, x),
u(0, x) = exp(−µT) max
- 110 − mini∈{1,2,...,d}{xi}, 0
- , x ∈ [90, 110]100.
All computations were performed in single precision (float32) on a NVIDIA GeForce GTX 1080 GPU with 1974 MHz core clock and 8 GB GDDR5X memory with 1809.5 MHz clock rate. The underlying system consisted of an Intel Core i7-6800K CPU with 64 GB DDR4-2133 memory running Tensorflow 1.5 on Ubuntu 16.04.
SLIDE 105
Some Theoretical Results
SLIDE 106 Linear Affine Kolmogorov Equations
Given Σ : Rd → Rd×d, µ : Rd → Rd affine and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace
- Σ(x)ΣT(x)Hessxu(t, x)
- + µ(x) · ∇xu(t, x),
(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x).
SLIDE 107 Linear Affine Kolmogorov Equations
Given Σ : Rd → Rd×d, µ : Rd → Rd affine and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace
- Σ(x)ΣT(x)Hessxu(t, x)
- + µ(x) · ∇xu(t, x),
(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Includes Black-Scholes Equation with correlations!
SLIDE 108 Linear Affine Kolmogorov Equations
Given Σ : Rd → Rd×d, µ : Rd → Rd affine and initial value ϕ : Rd → R, find u : R+ × Rd → R with ∂u ∂t (t, x) = 1 2Trace
- Σ(x)ΣT(x)Hessxu(t, x)
- + µ(x) · ∇xu(t, x),
(t, x) ∈ [0, T] × Rd, u(0, x) = ϕ(x). Includes Black-Scholes Equation with correlations! Theorem [G-Hornung-Jentzen-Von Wurstenberger (2018)], simplified version Suppose that ϕ ∈ Hσ
(N0,...,NL) (or can be well approximated by NNs).
Then for all ǫ > 0 there is Φǫ with size(Φǫ) size(ϕ) · ǫ−2 and sup
x∈[a,b]d |u(T, x) − Rσ(Φǫ)(x)| ≤ ǫ.
The implicit constant depends at most polynomially on the dimension d = N0.
SLIDE 109 Option Pricing without Curse of Dimensionality
Theorem [Berner-G-Jentzen (2018)], very special case Let ϕ(x) = min{max{max(xi − Ki), 0}, R} or ϕ(x) = min{max{d
i=1 xi − K, 0}, R} (or any typical option). Then
for all ǫ > 0 there is Φǫ ∈ HReLU
(N0,...,NL) with size(Φǫ) = O(ǫ−2) and
1 (b − a)d/2
- [a,b]d |u(T, x) − Rσ(Φǫ)(x)|2dx
1/2 ≤ ǫ. Such networks can be found by solving the ERM problem with m ∼ ǫ−4 samples. The implicit constants depend at most polynomially on the dimension d = N0!
SLIDE 110 Option Pricing without Curse of Dimensionality
Theorem [Berner-G-Jentzen (2018)], very special case Let ϕ(x) = min{max{max(xi − Ki), 0}, R} or ϕ(x) = min{max{d
i=1 xi − K, 0}, R} (or any typical option). Then
for all ǫ > 0 there is Φǫ ∈ HReLU
(N0,...,NL) with size(Φǫ) = O(ǫ−2) and
1 (b − a)d/2
- [a,b]d |u(T, x) − Rσ(Φǫ)(x)|2dx
1/2 ≤ ǫ. Such networks can be found by solving the ERM problem with m ∼ ǫ−4 samples. The implicit constants depend at most polynomially on the dimension d = N0! Due to compositional structure of NNs, all results hold also for
- ptions operating on options...
SLIDE 111
Wrap Up
SLIDE 112
Wrap Up
Several PDEs can be reformulated as learning problem
SLIDE 113
Wrap Up
Several PDEs can be reformulated as learning problem Neural network based numerical solution of high-dimensional PDEs is extremely promising both empirically and mathematically – and it is possible to prove real theorems!
SLIDE 114 Wrap Up
Several PDEs can be reformulated as learning problem Neural network based numerical solution of high-dimensional PDEs is extremely promising both empirically and mathematically – and it is possible to prove real theorems! Specifically, we can prove that these methods are capable of
- vercoming the curse of dimensionality for an important class of
PDEs arising in computational finance.
SLIDE 115 Wrap Up
Several PDEs can be reformulated as learning problem Neural network based numerical solution of high-dimensional PDEs is extremely promising both empirically and mathematically – and it is possible to prove real theorems! Specifically, we can prove that these methods are capable of
- vercoming the curse of dimensionality for an important class of
PDEs arising in computational finance. We can observe these properties in simulations.
SLIDE 116
Thank You!
Questions?
SLIDE 117 Literature
Beck, Becker, G, Jaafari, Jentzen. Solving Stochastic Differential Equations and Kolmogorov Equations by Means of Deep Learning. ArXiv 1806.00421. Elbr¨ achter, G, Jentzen, Schwab. DNN Expression Rate Analysis of High Dimensional PDEs: Applications in Option Pricing. ArXiv 1806.xxxxx. Perekreshtenko, G, Elbr¨ achter, B¨
- lcskei. The universal approximation
power of finite-width deep ReLU networks. ArXiv 1806.01528. G, Hornung, Jentzen, Von Wurstemberger. A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. ArXiv 1809.xxxxx. Berner, G, Jentzen. Empirical risk minimization over deep neural network hypothesis classes breaks the curse of dimensionality for the numerical approximatoin of Black-Scholes partial differential equations. ArXiv 1809.xxxxx.