[PPT] - Robust pricing and hedging via neural SDEs Lukasz Szpruch PowerPoint Presentation

SLIDE 1

Robust pricing and hedging via neural SDEs

Lukasz Szpruch University of Edinburgh, The Alan Turing Institute, London joint work with: David Siska, Marc Sabate-Vilades @UoE Zan Zuric and Antoine Jacquier @ICL

1 / 45

SLIDE 2

Outline

◮ Robust pricing and hedging with neural SDEs, [Gierjatowicz et al., 2020]. ◮ Unbiased approximation of parametric path dependent PDEs, [Vidales et al., 2018]. ◮ Neural ODEs via Relaxed Optimal Control. A perspective on deep recurrent neural networks, [Jabir et al., 2019].

2 / 45

SLIDE 3

Supervised Learning

◮ Data: (ξi, ζi)n

i=1 ∼ P ∈ P(R × Rd). P unknown

3 / 45

SLIDE 4

Supervised Learning

◮ Data: (ξi, ζi)n

i=1 ∼ P ∈ P(R × Rd). P unknown

◮ Goal: Find f : Rd → R that minimises population risk

3 / 45

SLIDE 5

Supervised Learning

◮ Data: (ξi, ζi)n

i=1 ∼ P ∈ P(R × Rd). P unknown

◮ Goal: Find f : Rd → R that minimises population risk ◮ Population Risk. Fix ℓ : R × R → R+ R(f ) := E [ℓ(ζ, f (ξ))]

3 / 45

SLIDE 6

Classical view

◮ Fix F := {f (·, θ) : θ ∈ Θ ∈ Rp} ◮ Empirical Risk Minimisation min

f ∈F Rn(f ) := 1

n

i=1

ℓ(ζi, f (ξi))

4 / 45

SLIDE 7

Classical view

◮ Fix F := {f (·, θ) : θ ∈ Θ ∈ Rp} ◮ Empirical Risk Minimisation min

f ∈F Rn(f ) := 1

n

i=1

ℓ(ζi, f (ξi)) ◮ Uniform convergence sup

f ∈F

|Rn(f ) − R(f )| ≤ (n, F)

4 / 45

SLIDE 8

Classical view

◮ Fix F := {f (·, θ) : θ ∈ Θ ∈ Rp} ◮ Empirical Risk Minimisation min

f ∈F Rn(f ) := 1

n

i=1

ℓ(ζi, f (ξi)) ◮ Uniform convergence sup

f ∈F

|Rn(f ) − R(f )| ≤ (n, F) ◮ convex optimisation . Take F,ℓ so that ERM is convex

4 / 45

SLIDE 9

Classical view

◮ Fix F := {f (·, θ) : θ ∈ Θ ∈ Rp} ◮ Empirical Risk Minimisation min

f ∈F Rn(f ) := 1

n

i=1

ℓ(ζi, f (ξi)) ◮ Uniform convergence sup

f ∈F

|Rn(f ) − R(f )| ≤ (n, F) ◮ convex optimisation . Take F,ℓ so that ERM is convex ◮ ... non of these works for deep learning....

4 / 45

SLIDE 10

New era of overparameterized statistical models ?

From Belkin. et.al. [Belkin et al., 2018]. ◮ Need for new theory to study generalisation error. Classical Vapnik dimension and Rademacher complexity doesn’t help. ◮ Overparametrised models can be optimal in the high signal-to-noise ratio regime Montanari et.al [Mei and Montanari, 2019] ◮ Implicit Regularisation [Heiss et al., 2019], [Neyshabur et al., 2017]

5 / 45

SLIDE 11

Applications to Quant Finance

Deep nets enhance numerical approximation: ◮ [Hernandez, 2016] used neural networks to learn the calibration map from market data directly to model parameters.

6 / 45

SLIDE 12

Applications to Quant Finance

Deep nets enhance numerical approximation: ◮ [Hernandez, 2016] used neural networks to learn the calibration map from market data directly to model parameters. ◮ We can solve (a reasonable class of) high dimensional PDEs: [Ruf and Wang, 2020, Jacquier and Oumgari, 2019, Vidales et al., 2018]

6 / 45

SLIDE 13

Applications to Quant Finance

Deep nets enhance numerical approximation: ◮ [Hernandez, 2016] used neural networks to learn the calibration map from market data directly to model parameters. ◮ We can solve (a reasonable class of) high dimensional PDEs: [Ruf and Wang, 2020, Jacquier and Oumgari, 2019, Vidales et al., 2018] ◮ [Horvath et al., 2019] showed that good approximation of the parametric pricing operator is all what’s needed to calibrate

6 / 45

SLIDE 14

Applications to Quant Finance

Deep nets enhance numerical approximation: ◮ [Hernandez, 2016] used neural networks to learn the calibration map from market data directly to model parameters. ◮ We can solve (a reasonable class of) high dimensional PDEs: [Ruf and Wang, 2020, Jacquier and Oumgari, 2019, Vidales et al., 2018] ◮ [Horvath et al., 2019] showed that good approximation of the parametric pricing operator is all what’s needed to calibrate ◮ Deep hedging: [Buehler et al., 2019]

6 / 45

SLIDE 15

Applications to Quant Finance

Deep nets enhance numerical approximation: ◮ [Hernandez, 2016] used neural networks to learn the calibration map from market data directly to model parameters. ◮ We can solve (a reasonable class of) high dimensional PDEs: [Ruf and Wang, 2020, Jacquier and Oumgari, 2019, Vidales et al., 2018] ◮ [Horvath et al., 2019] showed that good approximation of the parametric pricing operator is all what’s needed to calibrate ◮ Deep hedging: [Buehler et al., 2019] and can be used to build novel (econometrics) models: ◮ Market generators via (conditional) generative modelling [Kondratyev and Schwarz, 2019, Buehler et al., 2020, Henry-Labordere, 2019, Xu et al., 2020, Ni et al., 2020, Wiese et al., 2020]

6 / 45

SLIDE 16

Applications to Quant Finance

Deep nets enhance numerical approximation: ◮ [Hernandez, 2016] used neural networks to learn the calibration map from market data directly to model parameters. ◮ We can solve (a reasonable class of) high dimensional PDEs: [Ruf and Wang, 2020, Jacquier and Oumgari, 2019, Vidales et al., 2018] ◮ [Horvath et al., 2019] showed that good approximation of the parametric pricing operator is all what’s needed to calibrate ◮ Deep hedging: [Buehler et al., 2019] and can be used to build novel (econometrics) models: ◮ Market generators via (conditional) generative modelling [Kondratyev and Schwarz, 2019, Buehler et al., 2020, Henry-Labordere, 2019, Xu et al., 2020, Ni et al., 2020, Wiese et al., 2020] ◮ Stochastic local volatility models without Dupire’s vol [Cuchiero et al., 2020]

6 / 45

SLIDE 17

Generative modelling

7 / 45

SLIDE 18

Model Selection

Until recently, models in finance and economics were mostly conceived in a three step fashion: ◮ gathering statistical properties of the underlying time-series or the so called stylized facts

8 / 45

SLIDE 19

Model Selection

Until recently, models in finance and economics were mostly conceived in a three step fashion: ◮ gathering statistical properties of the underlying time-series or the so called stylized facts ◮ handcrafting a parsimonious model, which would best capture the desired market characteristics without adding any needless complexity and

8 / 45

SLIDE 20

Model Selection

Until recently, models in finance and economics were mostly conceived in a three step fashion: ◮ gathering statistical properties of the underlying time-series or the so called stylized facts ◮ handcrafting a parsimonious model, which would best capture the desired market characteristics without adding any needless complexity and ◮ calibration and validation of the handcrafted model.

8 / 45

SLIDE 21

Model Selection

Until recently, models in finance and economics were mostly conceived in a three step fashion: ◮ gathering statistical properties of the underlying time-series or the so called stylized facts ◮ handcrafting a parsimonious model, which would best capture the desired market characteristics without adding any needless complexity and ◮ calibration and validation of the handcrafted model. ... model complexity was undesirable

8 / 45

SLIDE 22

Classical Risk Models:

Pros: ◮ Interpretable parameters ◮ Relatively easy to calibrate with relatively small amount of data ◮ Several decades of underpinning research

9 / 45

SLIDE 23

Classical Risk Models:

Pros: ◮ Interpretable parameters ◮ Relatively easy to calibrate with relatively small amount of data ◮ Several decades of underpinning research Cons: ◮ Lack of systematic framework for model selection ◮ Knighting uncertainty (unknown unknowns) ◮ limited expressivity

9 / 45

SLIDE 24

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups.

10 / 45

SLIDE 25

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data

10 / 45

SLIDE 26

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν”. We write T#µ = ν.

10 / 45

SLIDE 27

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν”. We write T#µ = ν. ◮ Parametrise transport map T(θ), θ ∈ Rp, e.g some network architecture or Heston model

10 / 45

SLIDE 28

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν”. We write T#µ = ν. ◮ Parametrise transport map T(θ), θ ∈ Rp, e.g some network architecture or Heston model ◮ Seek θ s.t T(θ)#µ ≈ ν.

10 / 45

SLIDE 29

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν”. We write T#µ = ν. ◮ Parametrise transport map T(θ), θ ∈ Rp, e.g some network architecture or Heston model ◮ Seek θ s.t T(θ)#µ ≈ ν. ◮ Need to make the choice of the metric D(T(θ)#µ, ν) := sup

f ∈K

f (x)(T(θ)#µ)(dx) −
f (x)ν(dx)

◮ K could be set of options we want to calibrate to, could be neural network

10 / 45

SLIDE 30

Generative modelling

◮ Generative models such as GANs or VAEs demonstrated a great success in seemingly high dimensional setups. ◮ Input: Source distribution µ and target distribution ν i.e input-output data ◮ A generative model is a transport map T from µ to ν i.e T is a map that “pushes µ onto ν”. We write T#µ = ν. ◮ Parametrise transport map T(θ), θ ∈ Rp, e.g some network architecture or Heston model ◮ Seek θ s.t T(θ)#µ ≈ ν. ◮ Need to make the choice of the metric D(T(θ)#µ, ν) := sup

f ∈K

f (x)(T(θ)#µ)(dx) −
f (x)ν(dx)

◮ K could be set of options we want to calibrate to, could be neural network ◮ The modelling choices are

1. metric D
2. parametrisation of T
3. algorithm used for training!!!

10 / 45

SLIDE 31

Generative modelling in finance

Pros: ◮ Expressive and work in high dimensions ◮ By design data driven, adaptable to change in environment ◮ Provide new perspective on classical problems in finance

11 / 45

SLIDE 32

Generative modelling in finance

Pros: ◮ Expressive and work in high dimensions ◮ By design data driven, adaptable to change in environment ◮ Provide new perspective on classical problems in finance Cons: ◮ Parameters are not interpretable - black box approach ◮ Training algorithms are data hungry ◮ Models might be hard to work with e.g how to go fro Q to P ? ◮ A field largely empirical, lack of standardised benchmarks, lack of theoretical guarantees

11 / 45

SLIDE 33

Robust pricing and hedging via neural SDEs

12 / 45

SLIDE 34

Model Calibration

Classical Calibration: ◮ Pick a parametric model (St(θ))t∈[0,T] (e.g an Itˆ

process) with

parameters θ ∈ Rp

13 / 45

SLIDE 35

Model Calibration

Classical Calibration: ◮ Pick a parametric model (St(θ))t∈[0,T] (e.g an Itˆ

process) with

parameters θ ∈ Rp ◮ Parametric model induces martingale measure Q(θ)

13 / 45

SLIDE 36

Model Calibration

Classical Calibration: ◮ Pick a parametric model (St(θ))t∈[0,T] (e.g an Itˆ

process) with

parameters θ ∈ Rp ◮ Parametric model induces martingale measure Q(θ) ◮ Input Data: prices of traded derivatives p(Φi)M

i=0 with corresponding

payoffs (Φi)M

i=0

13 / 45

SLIDE 37

Model Calibration

Classical Calibration: ◮ Pick a parametric model (St(θ))t∈[0,T] (e.g an Itˆ

process) with

parameters θ ∈ Rp ◮ Parametric model induces martingale measure Q(θ) ◮ Input Data: prices of traded derivatives p(Φi)M

i=0 with corresponding

payoffs (Φi)M

i=0

◮ Output : θ∗ such that p(Φi) ≈ EQ(Θ∗)[Φi]

13 / 45

SLIDE 38

Robust Price bounds

◮ There are infinitely many models that are consistent with the market

14 / 45

SLIDE 39

Robust Price bounds

◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data

14 / 45

SLIDE 40

Robust Price bounds

◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup

Q∈M

E[Ψ] and inf

Q∈M E[Ψ]

◮ Use duality theory to deduce (semi-static) hedging strategy

14 / 45

SLIDE 41

Robust Price bounds

◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup

Q∈M

E[Ψ] and inf

Q∈M E[Ψ]

◮ Use duality theory to deduce (semi-static) hedging strategy ◮ The obtain bounds typically to wide to be of practical value

14 / 45

SLIDE 42

Robust Price bounds

◮ There are infinitely many models that are consistent with the market ◮ M - set of all martingale measures that are calibrated to data ◮ Compute conservative bounds for the price sup

Q∈M

E[Ψ] and inf

Q∈M E[Ψ]

◮ Use duality theory to deduce (semi-static) hedging strategy ◮ The obtain bounds typically to wide to be of practical value ◮ Challenges:

a) Incorporate prior information to restrict a search space M b) Design efficient algorithms for computing price bounds and corresponding hedges

14 / 45

SLIDE 43

Claical Rik Model Geneaie Model Rob Finance

Neal SDE

15 / 45

SLIDE 44

Neural SDEs

◮ We build an Itˆ

process (X θ

t )t∈[0,T], with parameters θ ∈ Rp

dSθ

t = rSθ t dt + σS(t, X θ t , θ) dWt ,

dV θ

t = bV (t, X θ t , θ) dt + σV (t, X θ t , θ) dWt ,

X θ

t = (Sθ t , V θ t ) ,

where σS, bV , σV are given by neural networks (can be path-depedend)

16 / 45

SLIDE 45

Neural SDEs

◮ We build an Itˆ

process (X θ

t )t∈[0,T], with parameters θ ∈ Rp

dSθ

t = rSθ t dt + σS(t, X θ t , θ) dWt ,

dV θ

t = bV (t, X θ t , θ) dt + σV (t, X θ t , θ) dWt ,

X θ

t = (Sθ t , V θ t ) ,

where σS, bV , σV are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q(θ)

16 / 45

SLIDE 46

Neural SDEs

◮ We build an Itˆ

process (X θ

t )t∈[0,T], with parameters θ ∈ Rp

dSθ

t = rSθ t dt + σS(t, X θ t , θ) dWt ,

dV θ

t = bV (t, X θ t , θ) dt + σV (t, X θ t , θ) dWt ,

X θ

t = (Sθ t , V θ t ) ,

where σS, bV , σV are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q(θ) ◮ Solution map is an instance of causal transport

16 / 45

SLIDE 47

Neural SDEs

◮ We build an Itˆ

process (X θ

t )t∈[0,T], with parameters θ ∈ Rp

dSθ

t = rSθ t dt + σS(t, X θ t , θ) dWt ,

dV θ

t = bV (t, X θ t , θ) dt + σV (t, X θ t , θ) dWt ,

X θ

t = (Sθ t , V θ t ) ,

where σS, bV , σV are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q(θ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process.

16 / 45

SLIDE 48

Neural SDEs

◮ We build an Itˆ

process (X θ

t )t∈[0,T], with parameters θ ∈ Rp

dSθ

t = rSθ t dt + σS(t, X θ t , θ) dWt ,

dV θ

t = bV (t, X θ t , θ) dt + σV (t, X θ t , θ) dWt ,

X θ

t = (Sθ t , V θ t ) ,

where σS, bV , σV are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q(θ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. ◮ See [Arribas et al., 2020] for Sig-SDEs (neural SDE in a signature feature space)

16 / 45

SLIDE 49

Neural SDEs

◮ We build an Itˆ

process (X θ

t )t∈[0,T], with parameters θ ∈ Rp

dSθ

t = rSθ t dt + σS(t, X θ t , θ) dWt ,

dV θ

t = bV (t, X θ t , θ) dt + σV (t, X θ t , θ) dWt ,

X θ

t = (Sθ t , V θ t ) ,

where σS, bV , σV are given by neural networks (can be path-depedend) ◮ The model induces a martingale probability measure Q(θ) ◮ Solution map is an instance of causal transport ◮ See [Cuchiero et al., 2020] for neural SDEs with a prior on vol process. ◮ See [Arribas et al., 2020] for Sig-SDEs (neural SDE in a signature feature space) ◮ Neural SDEs are easy to work with e.g consistent change from Q to P.

16 / 45

SLIDE 50

Neural SDEs

i) Calibration to market prices Find model parameters θ∗ such that model prices match market prices: θ∗ ∈ arg min

θ∈Θ M

i=1

ℓ(EQ(θ)[Φi], p(Φi)) .

17 / 45

SLIDE 51

Neural SDEs

i) Calibration to market prices Find model parameters θ∗ such that model prices match market prices: θ∗ ∈ arg min

θ∈Θ M

i=1

ℓ(EQ(θ)[Φi], p(Φi)) . ii) Robust pricing Find model parameters θl,∗ and θu,∗ which provide robust arbitrage-free price bounds for an illiquid derivative, subject to available market data: θl,∗ ∈ arg min

θ∈Θ EQ(θ)[Ψ] ,

subject to

M

i=1

ℓ(EQ(θ)[Φi], p(Φi)) = 0 , θu,∗ ∈ arg max

θ∈Θ EQ(θ)[Ψ],

subject to

M

i=1

ℓ(EQ(θ)[Φi], p(Φi)) = 0 . where ℓ : R × R → [0, ∞) is a convex loss function such that minx∈R,y∈R ℓ(x, y) = 0.

17 / 45

SLIDE 52

Stochastic Optimisation

Let M = 1 and the loss function h(θ) = ℓ

EQ(θ)[Φ], p(Φ)
.

Then in the gradient step update we have ∂θh(θ) = ∂xℓ

EQ[Φ(X θ)], p(Φ)
EQ[∂θΦ(X θ)] ,

Since ℓ is typically not an identity function, a mini-batch estimator of ∂θh(θ), obtained by replacing Q with QN given by ∂θhN(θ) := ∂xℓ

EQN[Φ(X θ)], p(Φ)
EQN[∂θΦ(X θ)] ,

is a biased estimator of ∂θh.

18 / 45

SLIDE 53

Stochastic Optimisation

Let M = 1 and the loss function h(θ) = ℓ

EQ(θ)[Φ], p(Φ)
.

Then in the gradient step update we have ∂θh(θ) = ∂xℓ

EQ[Φ(X θ)], p(Φ)
EQ[∂θΦ(X θ)] ,

Since ℓ is typically not an identity function, a mini-batch estimator of ∂θh(θ), obtained by replacing Q with QN given by ∂θhN(θ) := ∂xℓ

EQN[Φ(X θ)], p(Φ)
EQN[∂θΦ(X θ)] ,

is a biased estimator of ∂θh.

Lemma 1

For ℓ(x, y) = |x − y|2, we have

EQ

∂θhN(θ)

− ∂θh(θ)
≤ 2

N

Var Q[Φ(X θ)]

1/2 Var Q[∂θΦ(X θ)] 1/2 .

18 / 45

SLIDE 54

Learning PDEs

Let X β

t = σ(t, (X β s∧t)s∈[0,T], β)dWt,

19 / 45

SLIDE 55

Learning PDEs

Let X β

t = σ(t, (X β s∧t)s∈[0,T], β)dWt,

F β

t := F β(t, (X β s∧t)s∈[0,T]) = E

Φ((X β

s )s∈[0,T])|(X β s∧t)s∈[0,T]

19 / 45

SLIDE 56

Learning PDEs

Let X β

t = σ(t, (X β s∧t)s∈[0,T], β)dWt,

F β

t := F β(t, (X β s∧t)s∈[0,T]) = E

Φ((X β

s )s∈[0,T])|(X β s∧t)s∈[0,T]

Martingale representation theorem via functional Itˆ
calculus

F β

t = Φ

(X β

s )s∈[0,T]

−

T

t

∇ωΦ

(X β

r∧s)r∈[0,T]

dX β

s .

19 / 45

SLIDE 57

Learning PDEs

Let X β

t = σ(t, (X β s∧t)s∈[0,T], β)dWt,

F β

t := F β(t, (X β s∧t)s∈[0,T]) = E

Φ((X β

s )s∈[0,T])|(X β s∧t)s∈[0,T]

Martingale representation theorem via functional Itˆ
calculus

F β

t = Φ

(X β

s )s∈[0,T]

−

T

t

∇ωΦ

(X β

r∧s)r∈[0,T]

dX β

s .

V

Φ
(X β

s )s∈[0,T]

−

T

t

∇ωF β

t

(X β

r∧s)r∈[0,T]

dXs | (X β

s∧t)s∈[0,T]

= 0

19 / 45

SLIDE 58

Learning PDEs

Let X β

t = σ(t, (X β s∧t)s∈[0,T], β)dWt,

F β

t := F β(t, (X β s∧t)s∈[0,T]) = E

Φ((X β

s )s∈[0,T])|(X β s∧t)s∈[0,T]

Martingale representation theorem via functional Itˆ
calculus

F β

t = Φ

(X β

s )s∈[0,T]

−

T

t

∇ωΦ

(X β

r∧s)r∈[0,T]

dX β

s .

V

Φ
(X β

s )s∈[0,T]

−

T

t

∇ωF β

t

(X β

r∧s)r∈[0,T]

dXs | (X β

s∧t)s∈[0,T]

= 0

◮ Can learn (parametric) path dependent PDEs ◮ We have unbiased approximation to the PDE by hybrid Monte Carlo/deep learning, see [Vidales et al., 2018]

19 / 45

SLIDE 59

Neural SDEs - Algorithm

Input: π = {t0, t1, . . . , tNsteps} time grid for numerical scheme. Input: (Φi)

Nprices i=1

ption payoffs.

Input: Market option prices p(Φj), j = 1, . . . , Nprices. for epoch : 1 : Nepochs do Generate Ntrn paths (xπ,θ,i

tn

)Nsteps

n=0

:= (sπ,θ,i

tn

, vπ,θ,i

tn

)Nsteps

n=0 , i = 1, . . . , Ntrn using

Euler scheme. During one epoch: Freeze ξ, use Adam to update θ, where θ = argmin

θ Nprices

j=1

 ENtrn  Φj

X π,θ

−

Nsteps−1

k=0

¯ h(tk, ˜ X π,

tk , ξj)∆˜

¯ Sπ,

tk

  −p(Φj) 2 During one epoch: Freeze θ, use Adam to update ξ, by optimising the sample variance ξ = argmin

ξ Nprices

j=1

VarNtrn  Φj

X π,θ

−

Nsteps−1

k=0

¯ h(tk, X π,θ

tk

, ξj)∆˜ ¯ Sπ,θ

tk

  end for return θ, ξj for all prices (Φi)

Nprices i=1

.

20 / 45

SLIDE 60

Results

We calibrate (local) Stochastic Volatility model dSt = rStdt + σS(t, St, Vt, ν)St dBS

t ,

S0 = 1, dVt = bV (Vt, φ) dt + σV (Vt, ϕ) dBV

t ,

V0 = v0, d〈BS, BV 〉t = ρdt to European option prices p(Φ) := EQ(θ)[Φ] = e−rTEQ(θ) (ST − K)+ | S0 = 1

for maturities of 2, 4, . . . , 12 months and typically 21 uniformly spaced

strikes between in [0.8, 1.2].

21 / 45

SLIDE 61

Results

We calibrate (local) Stochastic Volatility model dSt = rStdt + σS(t, St, Vt, ν)St dBS

t ,

S0 = 1, dVt = bV (Vt, φ) dt + σV (Vt, ϕ) dBV

t ,

V0 = v0, d〈BS, BV 〉t = ρdt to European option prices p(Φ) := EQ(θ)[Φ] = e−rTEQ(θ) (ST − K)+ | S0 = 1

for maturities of 2, 4, . . . , 12 months and typically 21 uniformly spaced

strikes between in [0.8, 1.2]. As an example of an illiquid derivative for which we wish to find robust bounds we take the lookback option p(Ψ) := EQ(θ)[Ψ] = e−rTEQ(θ)

max

t∈[0,T] St − ST| X0 = 1

.

21 / 45

SLIDE 62

Results

We calibrate (local) Stochastic Volatility model dSt = rStdt + σS(t, St, Vt, ν)St dBS

t ,

S0 = 1, dVt = bV (Vt, φ) dt + σV (Vt, ϕ) dBV

t ,

V0 = v0, d〈BS, BV 〉t = ρdt to European option prices p(Φ) := EQ(θ)[Φ] = e−rTEQ(θ) (ST − K)+ | S0 = 1

for maturities of 2, 4, . . . , 12 months and typically 21 uniformly spaced

strikes between in [0.8, 1.2]. As an example of an illiquid derivative for which we wish to find robust bounds we take the lookback option p(Ψ) := EQ(θ)[Ψ] = e−rTEQ(θ)

max

t∈[0,T] St − ST| X0 = 1

.

We generate synthetic data using Heston model.

21 / 45

SLIDE 63

Calibration to market prices

Figure: Vanilla option prices and implied volatility curves of the 10 calibrated Neural SDEs vs. the market data for different maturities.

22 / 45

SLIDE 64

Robust pricing

Figure: Exotic option price are in blue; Calibration error i in grey. The three box-plots in each group arise respectively from aiming for a lower bound, ad hoc and upper bound price of illiquid derivative. Each box plot comes from 10 different runs of Neural SDE calibration.

23 / 45

SLIDE 65

Control Variate effect on training

Figure: Root Mean Squared Error of calibration to Vanilla option prices with and without hedging strategy parametrisation

24 / 45

SLIDE 66

Joint SPX and VIX calibration with neural SDEs

Consider the Neural SDE dSθ

t = Sθ t σ(t, V θ t ; θ) dWt ,

dV θ

t = a(t, V θ t ; θ) dt + b(t, V θ t ; θ) dBt ,

ρ = 〈dW , dB〉t . It can be shown that the VIX dynamics at time t ∈ [0, T] can be expressed as VIX2

t :=

1 ∆τ E t+∆τ

t

σ2

s ds

Ft
= − 2

∆τ E

log

St+∆τ St

Ft
, ∆τ = 30

365 The VIX future with maturity maturity T is then given by F VIX

t,T

:= E [VIXT|Ft] VIX options are defined as C VIX

t

(T, K) := E

(VIXT − K)+
Ft
, PVIX

t

(T, K) := E

(K − VIXT)+
Ft
.

joint work with: Antoine Jacquier, Marc Sabate Vidales, David Siska, Zan Zuric.

25 / 45

SLIDE 67

Calibration to market data

Figure: Calibration to market data (data source: OptionMetrics) containing SPX options, VIX options an VIX future for T = 1, ..., 6 months

26 / 45

SLIDE 68

Calibration to market data

Figure: Calibrated neural SDE errors on SPX options and VIX options. Hatches correspond to combinations of Maturity/Strike for which there was not market data available

27 / 45

SLIDE 69

Extensions

◮ Neural SDE model in real-world measure P(θ)

28 / 45

SLIDE 70

Extensions

◮ Neural SDE model in real-world measure P(θ) ◮ Let ζ : [0, T] × Rd × Rp → Rn be another parametric function ◮ Let bS,P(t, X θ

t , θ) := rSθ t + σS(t, X θ t , θ)ζ(t, X θ t , θ) ,

bV ,P(t, X θ

t , θ) := bV (t, X θ t , θ) + σV (t, X θ t , θ)ζ(t, X θ t , θ) .

◮ We now define a real-world measure P(θ) via the Radon–Nikodym derivative dP(θ) dQ(θ) := exp T ζ(t, X θ

t , θ) dWt + 1

2 T |ζ(t, X θ

t , θ)|2 dt

.

28 / 45

SLIDE 71

Extensions

◮ Neural SDE model in real-world measure P(θ) ◮ Let ζ : [0, T] × Rd × Rp → Rn be another parametric function ◮ Let bS,P(t, X θ

t , θ) := rSθ t + σS(t, X θ t , θ)ζ(t, X θ t , θ) ,

bV ,P(t, X θ

t , θ) := bV (t, X θ t , θ) + σV (t, X θ t , θ)ζ(t, X θ t , θ) .

◮ We now define a real-world measure P(θ) via the Radon–Nikodym derivative dP(θ) dQ(θ) := exp T ζ(t, X θ

t , θ) dWt + 1

2 T |ζ(t, X θ

t , θ)|2 dt

.

◮ Under appropriate assumption on ζ (e.g. bounded) the measure P(θ) is a probability measure and by using Girsanov theorem we can find Brownian motion (W P(θ)

t

)t∈[0,T] such that dSθ

t = bS,P(t, X θ t , θ) dt + σS(t, X θ t , θ) dW P(θ) t

, dV θ

t = bV ,P(t, X θ t , θ) dt + σV (t, X θ t , θ) dW P(θ) t

.

28 / 45

SLIDE 72

Extensions

◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training

29 / 45

SLIDE 73

Extensions

◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training ◮ We can use neural SDEs to adversarially train hedging strategies using ideas from distributionally robust optimisation

29 / 45

SLIDE 74

Extensions

◮ We can incorporate additional market information e.g bound on realised variance by adding additional constrain during training ◮ We can use neural SDEs to adversarially train hedging strategies using ideas from distributionally robust optimisation ◮ We can simplify the learned models using ideas from explainable machine learning

29 / 45

SLIDE 75

Neural SDEs

Pros: ◮ Expressive yet consistent with classical framework ◮ By design data driven, adaptable to changes in environment ◮ Provide consistent models for calibrating under Q and P ◮ Provide systematic framework for model selection ◮ Ability to learn in law data regime (due to good prior) ◮ (Some) Theoretical guarantees for the generalisation error

30 / 45

SLIDE 76

Neural SDEs

Pros: ◮ Expressive yet consistent with classical framework ◮ By design data driven, adaptable to changes in environment ◮ Provide consistent models for calibrating under Q and P ◮ Provide systematic framework for model selection ◮ Ability to learn in law data regime (due to good prior) ◮ (Some) Theoretical guarantees for the generalisation error Cons: ◮ Parameters are not interpretable, but the models are. ◮ Computationally more intense than classical models, but because we train with gradient descent recalibration (typically) is cheap.

30 / 45

SLIDE 77

Neural ODEs - perspective on recurrent neural network

31 / 45

SLIDE 78

Example

◮ Recurrent neural networks can be written X l+1 = X l + φ(X l, θl) X 0 = ξ ∈ Rd

32 / 45

SLIDE 79

Example

◮ Recurrent neural networks can be written X l+1 = X l + φ(X l, θl) X 0 = ξ ∈ Rd ◮ Infinite network (useful when fitting time-series data) dX ξ

t (θ) = φ(X ξ t (θ), θt) dt , t ∈ [0, 1] , X0 = ξ ∈ Rd.

32 / 45

SLIDE 80

Example

◮ Recurrent neural networks can be written X l+1 = X l + φ(X l, θl) X 0 = ξ ∈ Rd ◮ Infinite network (useful when fitting time-series data) dX ξ

t (θ) = φ(X ξ t (θ), θt) dt , t ∈ [0, 1] , X0 = ξ ∈ Rd.

◮ Take input-output data (ξ, ζ) ∼ M. Our objective is to minimize J

(θ)t∈[0,T]
:=
Rd×Rd |ζ − X ξ

T(θ)|2 M(dξ, dζ) .

32 / 45

SLIDE 81

Example

◮ Recurrent neural networks can be written X l+1 = X l + φ(X l, θl) X 0 = ξ ∈ Rd ◮ Infinite network (useful when fitting time-series data) dX ξ

t (θ) = φ(X ξ t (θ), θt) dt , t ∈ [0, 1] , X0 = ξ ∈ Rd.

◮ Take input-output data (ξ, ζ) ∼ M. Our objective is to minimize J

(θ)t∈[0,T]
:=
Rd×Rd |ζ − X ξ

T(θ)|2 M(dξ, dζ) .

◮ Goal: Find ˜ θ such that

d dεJ

θ + ε(˜

θ − θ)

ε=0 ≤ 0 .

32 / 45

SLIDE 82

Relaxed Stochastic Control and Deep Learning

◮ Mean-field perspective on neural networks 1 n

n

i=1

βn,iϕ(αn,i · z + ρi,n · ζ) =

Rd βϕ(α · z + ρ · ζ) νn(dβ, dα, dρ) .

33 / 45

SLIDE 83

Relaxed Stochastic Control and Deep Learning

◮ Mean-field perspective on neural networks 1 n

n

i=1

βn,iϕ(αn,i · z + ρi,n · ζ) =

Rd βϕ(α · z + ρ · ζ) νn(dβ, dα, dρ) .

◮ Let φ(z, a, ζ) = βϕ(α · z + ρ · ζ), and consider X ν,ξ,ζ

t

= ξ + t

φ(X ν,ξ,ζ

r

, a, ζ) νr(da) dr

33 / 45

SLIDE 84

Relaxed Stochastic Control and Deep Learning

◮

Jσ,M(ν) :=

Rd ×S

T

ft(X ν,ξ,ζ

t

, a, ζ) νt(da) dt + g(X ν,ξ,ζ

T

, ζ)

M(dξ, dζ)

+ σ2 2 T Ent(νt) dt .

◮ Ent(m) :=

Rd m(x) log
m(x)

g(x)

dx

if m is a.c. w.r.t. Lebesgue measure ∞

therwise

and Gibbs measure g: g(x) = e−U(x) with U s.t.

Rd e−U(x) dx = 1 .

◮ See work by Weinan E [Weinan, 2017]; Cruchiero, Larsson, Teichmann,[Cuchiero et al., 2019]; Hu, Kazeykina, Ren [Hu et al., 2019]

34 / 45

SLIDE 85

Relaxed Stochastic Control and Deep Learning

◮ The goal is to find, for each t ∈ [0, T] a vector field flow (bs,t)s≥0 such that the measure flow (νs,t)s≥0 given by ∂sνs,t = div(νs,t bs,t) , s ≥ 0 , ν0,t = ν0

t ∈ P2(Rp) ,

satisfies that s → Jσ(νs,·) is decreasing.

35 / 45

SLIDE 86

Relaxed Stochastic Control and Deep Learning

◮ The goal is to find, for each t ∈ [0, T] a vector field flow (bs,t)s≥0 such that the measure flow (νs,t)s≥0 given by ∂sνs,t = div(νs,t bs,t) , s ≥ 0 , ν0,t = ν0

t ∈ P2(Rp) ,

satisfies that s → Jσ(νs,·) is decreasing. ◮ Relaxed Hamiltonian Hσ

t (x, p, m, ζ) :=

ht(x, p, a, ζ) m(da) + σ2

2 Ent(m) ht(x, p, a, ζ) := φt(x, a, ζ)p + ft(x, a, ζ)

35 / 45

SLIDE 87

Relaxed Stochastic Control and Deep Learning

◮ The goal is to find, for each t ∈ [0, T] a vector field flow (bs,t)s≥0 such that the measure flow (νs,t)s≥0 given by ∂sνs,t = div(νs,t bs,t) , s ≥ 0 , ν0,t = ν0

t ∈ P2(Rp) ,

satisfies that s → Jσ(νs,·) is decreasing. ◮ Relaxed Hamiltonian Hσ

t (x, p, m, ζ) :=

ht(x, p, a, ζ) m(da) + σ2

2 Ent(m) ht(x, p, a, ζ) := φt(x, a, ζ)p + ft(x, a, ζ) ◮ The adjoint process Pξ,ζ

t

(ν) = (∇xg)(X ξ,ζ

t

(ν), ζ), dPξ,ζ(ν)t = −(∇xHt)(X ξ,ζ(ν)t, Pν,ξ,ζ

t

(ν), νt) dt

35 / 45

SLIDE 88

Pontryagin’s principle

Theorem 2

If ν ∈ V2 is (locally) optimal then it must solve the following system: νt = argmin

µ∈P2(Rp)

Rd×S

Hσ

t (X ξ,ζ t

, Pξ,ζ

t

, µ, ζ) M(dξ, dζ) , dX ξ,ζ

t

= Φ(X ξ,ζ

t

, νt, ζ) dt , , X ξ,ζ = ξ ∈ Rd dPξ,ζ

t

= −(∇xHt)(X ξ,ζ

t

, Pξ,ζ

t

, νt, ζ) dt , Pξ,ζ

T

= (∇xg)(X ξ,ζ

T , ζ) .

36 / 45

SLIDE 89

Gradient Flow Dynamics

dθs,t = −

Rd ×S

(∇aht)(X ξ,ζ

s,t , Pξ,ζ s,t , θs,t, ζ) M(dξ, dζ) + σ2

2 (∇aU)(θs,t)

ds+σdBs

where for t ∈ [0, T]              νs,t = L(θs,t) X ξ,ζ

s,t = ξ +

t Φr(X ξ,ζ

s,r , νs,r, ζ) dr

Pξ,ζ

s,t = (∇xg)(X ξ,ζ T , ζ) +

T

t

(∇xHr)(X ξ,ζ

s,r , Pξ,ζ s,r , νs,r, ζ) dr .

37 / 45

SLIDE 90

Main Result

Theorem 3

Assume that σ > 0. Then i) if ν ∈ argminν∈V2 Jσ(ν) then ν is an invariant measure given by ν(a) = e− 2

σ2 ht(a,ν,M)g(a) , 38 / 45

SLIDE 91

Main Result

Theorem 3

Assume that σ > 0. Then i) if ν ∈ argminν∈V2 Jσ(ν) then ν is an invariant measure given by ν(a) = e− 2

σ2 ht(a,ν,M)g(a) ,

ii) if σ2κ − 4L > 0 then ν is unique and for all s ≥ 0 any any L(θ0,·) WT

2 (L(θs,·), ν)2 ≤ e−λsWT 2 (L(θ0,·), ν)2 .

◮ WT

q (µ, ν) :=

T Wq(µt, νt)q dt 1/q . ◮ ht(a, µ, M) :=

Rd ×S

ht(X ξ,ζ

t

(µ), Pξ,ζ

t

(µ), a, ζ)M(dξ, dζ) ,

38 / 45

SLIDE 92

Generalisation Error

◮ Recall the cost function

Jσ,M(ν) :=

Rd ×S

T

ft(X ν,ξ,ζ

t

, a, ζ) νt(da) dt + g(X ν,ξ,ζ

T

, ζ)

M(dξ, dζ)

+ σ2 2 T Ent(νt) dt .

39 / 45

SLIDE 93

Generalisation Error

◮ Recall the cost function

Jσ,M(ν) :=

Rd ×S

T

ft(X ν,ξ,ζ

t

, a, ζ) νt(da) dt + g(X ν,ξ,ζ

T

, ζ)

M(dξ, dζ)

+ σ2 2 T Ent(νt) dt .

◮ In practice, one does not have access to population distribution M, but works with finite sample MN1 :=

1 N1

N1

j1 δ(ξj1,ζj1)

39 / 45

SLIDE 94

Generalisation Error

◮ Recall the cost function

Jσ,M(ν) :=

Rd ×S

T

ft(X ν,ξ,ζ

t

, a, ζ) νt(da) dt + g(X ν,ξ,ζ

T

, ζ)

M(dξ, dζ)

+ σ2 2 T Ent(νt) dt .

◮ In practice, one does not have access to population distribution M, but works with finite sample MN1 :=

1 N1

N1

j1 δ(ξj1,ζj1)

◮ Practioner use JMN1 (ν) and NOT Jσ,MN1 (ν) to set stopping criteria for learning

39 / 45

SLIDE 95

Generalisation Error

◮ Recall the cost function

Jσ,M(ν) :=

Rd ×S

T

ft(X ν,ξ,ζ

t

, a, ζ) νt(da) dt + g(X ν,ξ,ζ

T

, ζ)

M(dξ, dζ)

+ σ2 2 T Ent(νt) dt .

◮ In practice, one does not have access to population distribution M, but works with finite sample MN1 :=

1 N1

N1

j1 δ(ξj1,ζj1)

◮ Practioner use JMN1 (ν) and NOT Jσ,MN1 (ν) to set stopping criteria for learning ◮ The entropy term can be viewed as implicit regularisation

39 / 45

SLIDE 96

Generalisation Error

Theorem 4

Let σ2κ >> 0. There is c > 0 independent of λ, S, N1, N2, d, p s.t E

JM(ν,σ) − JM(νσ,N1,N2,∆s

S,·

)

2

≤ c

e−λS + 1

N1 + 1 N2 + h

,

The generalisation error is given by JM(νσ,N1,N2,∆s

S,·

) = JM(νσ,N1,N2,∆s

S,·

) − JM(ν,σ) − σ2 2 T Ent(ν,σ) + min

µ∈V2 Jσ,M(µ) .

◮ N1 - size of the training data ◮ N2 - proxy to the the number of parameters ◮ γ - learning rate ◮ S/γ - proxy for training time ◮ By discretising ODEs we can get estimateson the number of layers

40 / 45

SLIDE 97

Assumption 1

Fix > 0 and N1 > 0. Assume that ∀MN1 JM1(ν,σ,N1) ≤ .

Theorem 5

There is c > 0 independent of λ, S, N1, N2, d, p s.t E

JM(νσ,N1,N2,∆s

S,·

)

2

≤ 2 + c

e−λS + 1

N1 + 1 N2 + h

.

41 / 45

SLIDE 98

Outlook

We have full analysis of convergence of regularised gradient descent algorithm for deep networks modelled by ODEs.

42 / 45

SLIDE 99

Outlook

We have full analysis of convergence of regularised gradient descent algorithm for deep networks modelled by ODEs. Key messages: ◮ Training of neural nets should be viewed as sampling rather then

ptimisation problem

◮ Wasserstein Gradient flow provides framework to study convergence

f training algorithms

◮ Probabilistic numerical analysis provides quantitative bounds that do not suffer from curse of dimensionality

42 / 45

SLIDE 100

References I

[Arribas et al., 2020] Arribas, I. P., Salvi, C., and Szpruch, L. (2020). Sig-sdes model for quantitative finance. arXiv preprint arXiv:2006.00218. [Belkin et al., 2018] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2018). Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118. [Buehler et al., 2019] Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep

hedging. Quantitative Finance, 19(8):1271–1291.

[Buehler et al., 2020] Buehler, H., Horvath, B., Lyons, T., Perez Arribas, I., and Wood, B. (2020). Generating financial markets with signatures. Available at SSRN. [Cuchiero et al., 2020] Cuchiero, C., Khosrawi, W., and Teichmann, J. (2020). A generative adversarial network approach to calibration of local stochastic volatility models. arXiv preprint arXiv:2005.02505. [Cuchiero et al., 2019] Cuchiero, C., Larsson, M., and Teichmann, J. (2019). Deep neural networks, generic universal interpolation, and controlled odes. arXiv preprint arXiv:1908.07838. [Gierjatowicz et al., 2020] Gierjatowicz, P., Sabate-Vidales, M., Siska, D., Szpruch, L., and Zuric,

Z. (2020). Robust pricing and hedging via neural sdes. Available at SSRN 3646241.

[Heiss et al., 2019] Heiss, J., Teichmann, J., and Wutte, H. (2019). How implicit regularization of neural networks affects the learned function–part i. arXiv preprint arXiv:1911.02903. [Henry-Labordere, 2019] Henry-Labordere, P. (2019). Generative models for financial data. Available at SSRN 3408007. [Hernandez, 2016] Hernandez, A. (2016). Model calibration with neural networks. Available at SSRN 2812140.

43 / 45

SLIDE 101

References II

[Horvath et al., 2019] Horvath, B., Muguruza, A., and Tomas, M. (2019). Deep learning volatility. Available at SSRN 3322085. [Hu et al., 2019] Hu, K., Kazeykina, A., and Ren, Z. (2019). Mean-field langevin system, optimal control and deep neural networks. arXiv preprint arXiv:1909.07278. [Jabir et al., 2019] Jabir, J.-F., ˇ Siˇ ska, D., and Szpruch,

L. (2019). Mean-field neural odes via

relaxed optimal control. arXiv preprint arXiv:1912.05475. [Jacquier and Oumgari, 2019] Jacquier, A. J. and Oumgari, M. (2019). Deep ppdes for rough local stochastic volatility. Available at SSRN 3400035. [Kondratyev and Schwarz, 2019] Kondratyev, A. and Schwarz, C. (2019). The market generator. Available at SSRN 3384948. [Mei and Montanari, 2019] Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355. [Neyshabur et al., 2017] Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. (2017). Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071. [Ni et al., 2020] Ni, H., Szpruch, L., Wiese, M., Liao, S., and Xiao, B. (2020). Conditional sig-wasserstein gans for time series generation. arXiv preprint arXiv:2006.05421. [Ruf and Wang, 2020] Ruf, J. and Wang, W. (2020). Neural networks for option pricing and hedging: a literature review. Journal of Computational Finance, Forthcoming. [Vidales et al., 2018] Vidales, M. S., ˇ Siˇ ska, D., and Szpruch, L. (2018). Martingale functional control variates via deep learning. arXiv:1810.05094.

44 / 45

SLIDE 102

References III

[Weinan, 2017] Weinan, E. (2017). A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11. [Wiese et al., 2020] Wiese, M., Knobloch, R., Korn, R., and Kretschmer, P. (2020). Quant gans: Deep generation of financial time series. Quantitative Finance, pages 1–22. [Xu et al., 2020] Xu, T., Wenliang, L. K., Munn, M., and Acciaio, B. (2020). Cot-gan: Generating sequential data via causal optimal transport. arXiv preprint arXiv:2006.08571.

45 / 45