Step by step implementation and optimization of simulations in - - PowerPoint PPT Presentation

step by step implementation and optimization of
SMART_READER_LITE
LIVE PREVIEW

Step by step implementation and optimization of simulations in - - PowerPoint PPT Presentation

Step by step implementation and optimization of simulations in quantitative finance Lokman Abbas-Turki UPMC, LPMA 10-12 October 2017 Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28 Plan Introduction Local volatility challenges GPUs Monte


slide-1
SLIDE 1

Step by step implementation and

  • ptimization of simulations in quantitative

finance

Lokman Abbas-Turki

UPMC, LPMA

10-12 October 2017

Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28

slide-2
SLIDE 2

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 2 / 28

slide-3
SLIDE 3

Introduction

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 3 / 28

slide-4
SLIDE 4

Introduction

What make bankers change their mind?

The CVA (Credit Valuation Adjustment) or XVA (X=C, D, F, K, M Valuation Adjustment) are tipping point applications,

FRTB (Fundamental Review of the Trading Book) is the other important application with a deadline in 2019.

Electronic trading and deep learning.

Lloyds Blankfein declared about Goldman Sachs: “We are a technology firm”.

HPC in banks

From distribution to parallelization.

From small to big nodes.

The efficiency of GPUs becomes undeniable.

Use the .net C, C++ and C#.

Remaining challenges and fears

Code management.

Possible conflicts within quant teams.

Can we extend the results for toy models to more general models?

Lokman (UPMC, LPMA) GTC Europe Lab 4 / 28

slide-5
SLIDE 5

Local volatility challenges GPUs

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 5 / 28

slide-6
SLIDE 6

Local volatility challenges GPUs

Local volatility from implied volatility

First array: rg

dSt = Strg(t)dt + Stσloc(St, t)dWt, S0 = x0.

S is the stock price process where x0 is the spot price

W is a Brownian motion with W0 = 0

rg is the risk-free rate, assumed piecewise constant

σloc(x, t) is a local volatility function: R∗

+ × R+ → R∗ +

Dupire Equation

Given a family C(K, T)K,T of call prices with strike K and maturity T σ2

loc(K, T) = 2 ∂C/∂T + Krg(T)∂C/∂K

K 2(∂2C/∂K 2)

From implied to local

Using the Black & Scholes implied volatility σimp(x, t), Andersen and Brotherton-Ratcliffe (1997) showed that σ2

loc(x, t) =

2 ∂σimp ∂t + σimp t + 2xrg(t) ∂σimp ∂x x2

  • ∂2σimp

∂x2 − d+ √ t ∂σimp ∂x 2 + 1 σimp

  • 1

x√t + d+ ∂σimp ∂x 2 with d+ = 1

2σimp

√t +

  • log(x0/x) +

t

0 rg(u)du

  • /
  • σimp

√t

  • Lokman

(UPMC, LPMA) GTC Europe Lab 6 / 28

slide-7
SLIDE 7

Local volatility challenges GPUs

SVI to simulate implied volatility

+2 arrays: Tg, Kg

For (x, t) ∈ Kg × Tg, we assume that the observed implied volatility σimp(x, t) =

  • w(x, t)

t where the cumulative variance is parameterized by w(x, t) = θt 2    1 + ρϕ(θt)

  • log( x

x0 ) −

t

0 rg(u)du

  • +
  • ϕ(θt)
  • log( x

x0 ) −

t

0 rg(u)du

  • + ρ

2 + (1 − ρ2)    (1) with ϕ(θt) = η θγ

t (1 + θt)1−γ and θt = a2

t + b(1 − e−λt)

  • .

All parameters of (1) are discussed in Gatheral & Jacquier (2013)

Interpolation on ]0, max(Kg)]× ]0, max(Tg)]

We compute some values of σimp(x, t) on the grid ∈ Kg × Tg. Then, we interpolate these values to obtain ¯ σ(x, t) defined on ]0, max(Kg)]×]0, max(Tg)] and we assume σimp(x, t) ≈ ¯ σ(x, t) when (x, t) ∈]0, max(Kg)]×]0, max(Tg)]

Lokman (UPMC, LPMA) GTC Europe Lab 7 / 28

slide-8
SLIDE 8

Local volatility challenges GPUs

Bicubic interpolation for implied volatility

Let k, q with (x, t) ∈]Kg[k], Kg[k + 1]] ×]Tg[q], Tg[q + 1]]

¯ σ(x, t) =

3

  • i=0

3

  • j=0

Cg(k, q, i, j)liuj where l = t − Tg[q] Tg[q + 1] − Tg[q] and u = x − Kg[k] Kg[k + 1] − Kg[k]

Fourth array: Cg

Cg(k, q, i, j) = Cg[k ∗ (nt − 1) ∗ 16 + q ∗ 16 + i ∗ 4 + j] with (k, q, i, j) ∈ {0, ..., nk − 1} × {0, ..., nt − 1} × {0, ..., 3} × {0, ..., 3} and nk, nt are the size of Kg and Tg

Approximated local volatility

σ2

loc(x, t) ≈ min(max(

σ2(x, t), 0.0001), 0.5) where σ2(x, t) = 2 ∂¯ σ ∂t + ¯ σ t + 2xrg(t) ∂¯ σ ∂x x2

  • ∂2¯

σ ∂x2 − d+ √ t ∂¯ σ ∂x 2 + 1 ¯ σ

  • 1

x√t + d+ ∂¯ σ ∂x 2 with d+ = 1

2 ¯

σ√t +

  • log(x0/x) +

t

0 rg(u)du

  • /
  • ¯

σ√t

  • Lokman

(UPMC, LPMA) GTC Europe Lab 8 / 28

slide-9
SLIDE 9

Monte Carlo and OpenACC parallelization

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 9 / 28

slide-10
SLIDE 10

Monte Carlo and OpenACC parallelization

Pricing bullet option with Monte Carlo (MC)

Price for t ∈ [0, T)

Ft = e−

T

t

rg (u)duE

  • (ST − K)+1{IT ∈[P1,P2]}|Ft
  • with It =
  • Ti ≤t

1{STi <B}

K, T are respectively the contract’s strike and maturity

0 < T1 < ... < TM < T is a predetermined schedule

The barrier B should be crossed IT times with {P1, ..., P2} ⊂ {0, ..., M}

(St, It) is Markov

F(t, x, j) = e−

T

t

rg (u)duE(X

  • St = x, It = j), X = (ST − K)+1{IT ∈[P1,P2]}

MC procedure

Simulate F(0, x0, 0) = E(X) using a family {Xi}i≤n of i.i.d ∼ X

Strong law of large numbers: P

  • lim

n→+∞

X1 + X2 + ... + Xn n = E(X)

  • = 1

Central limit theorem: Denoting ǫn = E(X) − X1 + X2 + ... + Xn n √n σ ǫn → G ∼ N(0, 1), σ is the standard deviation of X

There is a 95% chance of having: ǫn ≤ 1.96 σ √n

Lokman (UPMC, LPMA) GTC Europe Lab 10 / 28

slide-11
SLIDE 11

Monte Carlo and OpenACC parallelization

Discritization set 0 = t0 < t1 < ... < tNt = T finer than 0 < T1 < ... < TM < T with δt = √tk+1 − tk

Iterating for path-dependant contract (x0 = 50)

For each tk, k = 0, ..., Nt −1: 1

Random number generation (RNG) of independent Normal variables Gi

2

Stock price actualization Si

tk+1 = Si tk (1 + rg(tk)δtδt + σloc(Si tk , tk)δtGi)

3

If tk+1 = Tl with l ∈ {1, ..., M}, I i

Tl = I i Tl−1 + (Si Tl < B)

At tNt

Compute the payoff X i then average

Lokman (UPMC, LPMA) GTC Europe Lab 11 / 28

slide-12
SLIDE 12

Monte Carlo and OpenACC parallelization

  • P. L’Ecuyer CMRG on GPU to generate uniformly

distributed random variables

General Form of linear RNGs

Without loss of generality: Xn = (AXn−1 + C) mod(m) = (A : C)   Xn−1 . . . 1   mod(m) (3)

Parallel-RNG from Period Splitting of One RNG

* Pierre L’Ecuyer proposed a very efficient RNG (1996) which is a CMRG

  • n 32 bits: Combination of two Multiple Recursive Generator (MRG) with

lag = 3 for each MRG. * Very long period ∼ 2185 xn = (a1xn−1 + a2xn−2 + a3xn−3)mod(m)

Pre-computations

We launch as many parallel RNGs as the number of paths

Use

We prefer local variables to store the RNG’s state vector

Lokman (UPMC, LPMA) GTC Europe Lab 12 / 28

slide-13
SLIDE 13

Monte Carlo and OpenACC parallelization

Some OpenACC pragmas: #pragma acc clause

global variables

clause = declare device_resident, applied to rg, Kg, Tg and Cg. These arrays are known by all GPU functions

copying to GPU

clause = enter data copyin copies the values to the GPU

embarrassingly parallel

clause = parallel loop present (+reduction(operation,arguments)) making parallel the execution of the loop without data movement. reduction(operation,arguments) reduces all the private copies of arguments into one final result using operation

Lokman (UPMC, LPMA) GTC Europe Lab 13 / 28

slide-14
SLIDE 14

Monte Carlo and CUDA parallelization

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 14 / 28

slide-15
SLIDE 15

Monte Carlo and CUDA parallelization

GPU architecture

Streaming Multiprocessor Registers L1VCache Constant Texture Shared

...

L 2 V C a c h e

{

{

{

G l

  • b

a l V M e m

  • r

y

{

V V V V V V

...

MemoryVclockedVatV theVprocessingVrate ProcessingVunits CachedVmemory GPUVRAM Streaming Multiprocessor Registers L1VCache Constant Texture Shared Streaming Multiprocessor Registers L1VCache Constant Texture Shared

Hardware software equivalence

Streaming processor: Executes threads

Streaming multiprocessor: Executes blocks

Built-in variables

Known within functions excuted on GPU: threadIdx.x, blockIdx.x, blockDim.x, gridDim.x

Global memory

GPU RAM, allocated thanks to cudaMalloc. Transfer values from GPU/CPU to CPU/GPU thanks to cudaMemcpy

Constant memory

Read only, declared globally. Data transfer with cudaMemcpyToSymbol

Shared

Declared in the kernel

Lokman (UPMC, LPMA) GTC Europe Lab 15 / 28

slide-16
SLIDE 16

Monte Carlo and CUDA parallelization

Function declaration and calling

Standard C functions

The same as for C or C++ programming

Device functions

Called by the GPU and executed on the GPU

Declared as

__device__ void myDivFun (...) { ...; } __device__ float myDivFun (...) { ...; }

Called simply by myDivFun(...) but only within other device functions

  • r kernels

Kernel functions

Called by the CPU and executed on the GPU

Declared as __global__ void myKernel (...) { ...; }

Called standardly by myKernel<<<numBlocks, threadsPerBlock>>>(...); where

numBlocks should take into account the number of multiprocessors threadsPerBlock should be equal to 128, 256, 512 or 1024

Dynamic parallelism: kernels can be called within kernels by the GPU and executed on the GPU

Lokman (UPMC, LPMA) GTC Europe Lab 16 / 28

slide-17
SLIDE 17

Monte Carlo and CUDA parallelization

Adaptation steps

Extension

Change it from MC.cpp and rng.cpp to MC.cu and rng.cu

__constant__

Declare Tg, Kg, rg and Cg using constant memory

Copying

Copy the values Tg, Kg, rg, Cg and CMRG on the GPU

__device__

Except main, MC, VarMalloc, FreeVar and parameters, define all other functions in MC.cu using __device__

Parallel

Declare MC function as a kernel using __global__ + Replace the Monte Carlo loop using threadIdx.x, blockDim.x, blockIdx.x in MC kernel

Reduction

Perform the reduction using __shared__ memory in kernel MC and atomicAdd function

Lokman (UPMC, LPMA) GTC Europe Lab 17 / 28

slide-18
SLIDE 18

Monte Carlo and CUDA parallelization

GPU timer

float Tim; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); /******************************************************* To compute the execution time of this part of the code *******************************************************/ cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&Tim, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);

Lokman (UPMC, LPMA) GTC Europe Lab 18 / 28

slide-19
SLIDE 19

PDE formulation & Crank-Nicolson scheme

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 19 / 28

slide-20
SLIDE 20

PDE formulation & Crank-Nicolson scheme

PDE, u(t, x, j) with respect to x, Black & Scholes PDE

The price

F(t, x, j) = e−

T

t rg (u)duE(X

  • St = x, It = j), X = (ST − K)+1{IT ∈[P1,P2]}

For u(t, x, j) = e

T

t rg (u)duF(t, ex, j), we equivalently solve u PDE given by

1 2 σ2

loc(x, t) ∂2u

∂x2 (t, x, j) + µ(x, t) ∂u ∂x (t, x, j) = − ∂u ∂t (t, x, j) (5) µ(x, t) = rg(t) − σ2

loc(x, t)

2 , (x, t, j) ∈]0, max(Kg)] × ]Tk, Tk+1[ × [0, P2] with T0 = 0 and TM+1 = T

Final and boundary conditions

u(T, x, j) = max(ex − K, 0) for any (x, j) and heuristically u(t, log[min(Kg)], j) = pmin = 0 and u(t, log[max(Kg)], j) = pmax = max(Kg) − K

Crank-Nicolson scheme

Denoting uk,i,j = u(tk, xi, j), σ = σloc(xi, tk) and µ = µ(xi, tk) quuk,i+1 + qmuk,i + qduk,i−1 = puuk+1,i+1 + pmuk+1,i + pduk+1,i−1 qu = − σ2∆t 4∆x2 − µ∆t 4∆x , qm = 1 + σ2∆t 2∆x2 , qd = − σ2∆t 4∆x2 + µ∆t 4∆x pu = σ2∆t 4∆x2 + µ∆t 4∆x , pm = 1 − σ2∆t 2∆x2 , pd = σ2∆t 4∆x2 − µ∆t 4∆x

Lokman (UPMC, LPMA) GTC Europe Lab 20 / 28

slide-21
SLIDE 21

PDE formulation & Crank-Nicolson scheme

PDE, u(t, x, j) with respect to j, barrier condition

ut(x, j) = E  (ST − K)+1M

i=1 1{STi <B}∈[P1,P2]

  • St = x,
  • Ti ≤t

1{STi <B} = j  

t ∈ [TM, T[

ut(x, j) = E[(ST − K)+|St = x]

t ∈ [TM−1, TM]

ut(x, j) = E[(ST − K)+1{STM ≥B} | St = x]1{j=P2} + E[(ST − K)+1{STM <B} | St = x]1{j=P1−1} + E[(ST − K)+|St = x]1{j∈[P1,P2−1]}

[TM−k−1, TM−k] k = M − 1, ..., 1 (T0 = 0)

ut(x, j) = E[uTM−k (STM−k , P2)1{STM−k ≥B} | St = x]1{j=P2} + E[uTM−k (STM−k , p1

k )1{STM−k <B} | St = x]1{j=p1

k −1}

+ E

  • uTM−k (STM−k , j)1{STM−k ≥B}

+uTM−k (STM−k , j + 1)1{STM−k <B}

  • St = x
  • 1{j∈[p1

k ,P2−1]}

with p1

k = max(P1 − k, 0)

Lokman (UPMC, LPMA) GTC Europe Lab 21 / 28

slide-22
SLIDE 22

PDE formulation & Crank-Nicolson scheme Lokman (UPMC, LPMA) GTC Europe Lab 22 / 28

slide-23
SLIDE 23

PDE simulation and CUDA parallelization

Plan

Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization

Lokman (UPMC, LPMA) GTC Europe Lab 23 / 28

slide-24
SLIDE 24

PDE simulation and CUDA parallelization

Solve tridiagonal system for the implicit part

T =              d1 c1 a2 d2 c2 a3 d3 ... ... ... ... ... ... cn−1 an dn              .

Cyclic Reduction

e1 e7 e5 z2 z8 z4 z6 Stepg1:gForwardgreductiongtog ag4-unknowngsystemginvolvingg z2,gz4,gz6gandgz8 Stepg2:gForwardgreductiongto ag2-unknowngsystemginvolvingg z4gandgz8 Stepg3:gSolveg2-unknowngsystem Stepg4:gBackwardgsubstitutiongto solvegthegrestg2gunknowns Stepg5:gBackwardgsubstitutiongto solvegthegrestg4gunknowns A A A A A A A A A A A A A A A A A A A A A A A A A A A A

A A A A A A

A

A

e3 z3 z1 z5 z7 e1 e2 e3 e4 e5 e6 e7 e8 e'8 e'6 e'4 e'2 e''4 e''8 e'6 e'2 z4 z8 z2 z4 z6 z8

Lokman (UPMC, LPMA) GTC Europe Lab 24 / 28

slide-25
SLIDE 25

PDE simulation and CUDA parallelization

         d1 c1 c1 d2 c2 c2 d3 c3 c3 d4 c4 c4 d5 c5 c5 d6 c6 c6 d7          (R) − →          d′

1

c′

2

d′

2

c′

3

c′

2

d′

3

c′

4

c′

3

d′

4

c′

5

c′

4

d′

5

c′

6

c′

5

d′

6

c′

6

d′

7

         (P) − →          d′

1

c′

2

c′

2

d′

3

c′

4

c′

4

d′

5

c′

6

c′

6

d′

7

d′

2

c′

3

c′

3

d′

4

c′

5

c′

5

d′

6

         (R) − →          d′′

1

c′′

2

d′′

3

c′′

4

c′′

2

d′′

5

c′′

4

d′′

7

d′′

2

c′′

3

d′′

4

c′′

3

d′′

6

         (P) − →          d′′

1

c′′

2

c′′

2

d′′

5

d′′

3

c′′

4

c′′

4

d′

7

d′′

2

c′′

3

c′′

3

d′′

6

d′′

4

        

Lokman (UPMC, LPMA) GTC Europe Lab 25 / 28

slide-26
SLIDE 26

PDE simulation and CUDA parallelization

Adaptation steps

Extension

Change it from PDE.cpp to PDE.cu

__constant__

Declare Tg, Kg, rg and Cg using constant memory

Copying

Copy the values Tg, Kg, rg, Cg on the GPU

__device__

Except for main, PDE_Diff, Out2In, VarMalloc, FreeVar and parameters, define all other functions in PDE.cu using __device__

Registers

Compare the CPU version of PCR to the GPU version given to you. Remark the large use of shared memory and registers.

Parallel

Declare PDE_Diff and Out2In functions as kernels using __global__ + Replace loops related to St by threadIdx.x and loops related to It by blockIdx.x in PDE_Diff and Out2In kernels as well as in Exp device function.

Lokman (UPMC, LPMA) GTC Europe Lab 26 / 28

slide-27
SLIDE 27

PDE simulation and CUDA parallelization

GPU timer

float Tim; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); /******************************************************* To compute the execution time of this part of the code *******************************************************/ cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&Tim, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);

Lokman (UPMC, LPMA) GTC Europe Lab 27 / 28

slide-28
SLIDE 28

PDE simulation and CUDA parallelization

References

  • L. A. Abbas-Turki, S. Vialle, B. Lapeyre, and P. Mercier. “Pricing

derivatives on graphics processing units using Monte Carlo simulation”. Concurrency and Computation: Practice and Experience, vol. 26(9), pp. 1679–1697, 2014.

  • L. A. Abbas-Turki and Stef Graillat, “Resolving small random symmetric

linear systems on graphics processing units”, The Journal of Supercomputing 73(4), pp. 1360–1386, 2017.

  • L. B. G. Andersen and R. Brotherton-Ratcliffe, “The equity option

volatility smile: an implicit finite difference approach”, Journal of Computational Finance, vol. 1(2), pp. 5–37, 1997.

  • J. Gatheral and A. Jacquier, “Arbitrage-free SVI volatility surfaces”,

Quantitative Finance, vol. 14(1), pp. 59–71, 2013.

  • P. L’Ecuyer, “Combined multiple recursive random number generators”.

Operations Research, vol. 44(5), pp. 816–822, 1996.

Lokman (UPMC, LPMA) GTC Europe Lab 28 / 28