Step by step implementation and
- ptimization of simulations in quantitative
finance
Lokman Abbas-Turki
UPMC, LPMA
10-12 October 2017
Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28
Step by step implementation and optimization of simulations in - - PowerPoint PPT Presentation
Step by step implementation and optimization of simulations in quantitative finance Lokman Abbas-Turki UPMC, LPMA 10-12 October 2017 Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28 Plan Introduction Local volatility challenges GPUs Monte
Lokman Abbas-Turki
UPMC, LPMA
10-12 October 2017
Lokman (UPMC, LPMA) GTC Europe Lab 1 / 28
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 2 / 28
Introduction
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 3 / 28
Introduction
What make bankers change their mind?
◮
The CVA (Credit Valuation Adjustment) or XVA (X=C, D, F, K, M Valuation Adjustment) are tipping point applications,
◮
FRTB (Fundamental Review of the Trading Book) is the other important application with a deadline in 2019.
◮
Electronic trading and deep learning.
◮
Lloyds Blankfein declared about Goldman Sachs: “We are a technology firm”.
HPC in banks
◮
From distribution to parallelization.
◮
From small to big nodes.
◮
The efficiency of GPUs becomes undeniable.
◮
Use the .net C, C++ and C#.
Remaining challenges and fears
◮
Code management.
◮
Possible conflicts within quant teams.
◮
Can we extend the results for toy models to more general models?
Lokman (UPMC, LPMA) GTC Europe Lab 4 / 28
Local volatility challenges GPUs
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 5 / 28
Local volatility challenges GPUs
Local volatility from implied volatility
First array: rg
dSt = Strg(t)dt + Stσloc(St, t)dWt, S0 = x0.
◮
S is the stock price process where x0 is the spot price
◮
W is a Brownian motion with W0 = 0
◮
rg is the risk-free rate, assumed piecewise constant
◮
σloc(x, t) is a local volatility function: R∗
+ × R+ → R∗ +
Dupire Equation
Given a family C(K, T)K,T of call prices with strike K and maturity T σ2
loc(K, T) = 2 ∂C/∂T + Krg(T)∂C/∂K
K 2(∂2C/∂K 2)
From implied to local
Using the Black & Scholes implied volatility σimp(x, t), Andersen and Brotherton-Ratcliffe (1997) showed that σ2
loc(x, t) =
2 ∂σimp ∂t + σimp t + 2xrg(t) ∂σimp ∂x x2
∂x2 − d+ √ t ∂σimp ∂x 2 + 1 σimp
x√t + d+ ∂σimp ∂x 2 with d+ = 1
2σimp
√t +
t
0 rg(u)du
√t
(UPMC, LPMA) GTC Europe Lab 6 / 28
Local volatility challenges GPUs
SVI to simulate implied volatility
+2 arrays: Tg, Kg
For (x, t) ∈ Kg × Tg, we assume that the observed implied volatility σimp(x, t) =
t where the cumulative variance is parameterized by w(x, t) = θt 2 1 + ρϕ(θt)
x0 ) −
t
0 rg(u)du
x0 ) −
t
0 rg(u)du
2 + (1 − ρ2) (1) with ϕ(θt) = η θγ
t (1 + θt)1−γ and θt = a2
t + b(1 − e−λt)
All parameters of (1) are discussed in Gatheral & Jacquier (2013)
Interpolation on ]0, max(Kg)]× ]0, max(Tg)]
We compute some values of σimp(x, t) on the grid ∈ Kg × Tg. Then, we interpolate these values to obtain ¯ σ(x, t) defined on ]0, max(Kg)]×]0, max(Tg)] and we assume σimp(x, t) ≈ ¯ σ(x, t) when (x, t) ∈]0, max(Kg)]×]0, max(Tg)]
Lokman (UPMC, LPMA) GTC Europe Lab 7 / 28
Local volatility challenges GPUs
Bicubic interpolation for implied volatility
Let k, q with (x, t) ∈]Kg[k], Kg[k + 1]] ×]Tg[q], Tg[q + 1]]
¯ σ(x, t) =
3
3
Cg(k, q, i, j)liuj where l = t − Tg[q] Tg[q + 1] − Tg[q] and u = x − Kg[k] Kg[k + 1] − Kg[k]
Fourth array: Cg
Cg(k, q, i, j) = Cg[k ∗ (nt − 1) ∗ 16 + q ∗ 16 + i ∗ 4 + j] with (k, q, i, j) ∈ {0, ..., nk − 1} × {0, ..., nt − 1} × {0, ..., 3} × {0, ..., 3} and nk, nt are the size of Kg and Tg
Approximated local volatility
σ2
loc(x, t) ≈ min(max(
σ2(x, t), 0.0001), 0.5) where σ2(x, t) = 2 ∂¯ σ ∂t + ¯ σ t + 2xrg(t) ∂¯ σ ∂x x2
σ ∂x2 − d+ √ t ∂¯ σ ∂x 2 + 1 ¯ σ
x√t + d+ ∂¯ σ ∂x 2 with d+ = 1
2 ¯
σ√t +
t
0 rg(u)du
σ√t
(UPMC, LPMA) GTC Europe Lab 8 / 28
Monte Carlo and OpenACC parallelization
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 9 / 28
Monte Carlo and OpenACC parallelization
Pricing bullet option with Monte Carlo (MC)
Price for t ∈ [0, T)
Ft = e−
T
t
rg (u)duE
1{STi <B}
◮
K, T are respectively the contract’s strike and maturity
◮
0 < T1 < ... < TM < T is a predetermined schedule
◮
The barrier B should be crossed IT times with {P1, ..., P2} ⊂ {0, ..., M}
(St, It) is Markov
F(t, x, j) = e−
T
t
rg (u)duE(X
MC procedure
Simulate F(0, x0, 0) = E(X) using a family {Xi}i≤n of i.i.d ∼ X
◮
Strong law of large numbers: P
n→+∞
X1 + X2 + ... + Xn n = E(X)
◮
Central limit theorem: Denoting ǫn = E(X) − X1 + X2 + ... + Xn n √n σ ǫn → G ∼ N(0, 1), σ is the standard deviation of X
◮
There is a 95% chance of having: ǫn ≤ 1.96 σ √n
Lokman (UPMC, LPMA) GTC Europe Lab 10 / 28
Monte Carlo and OpenACC parallelization
Discritization set 0 = t0 < t1 < ... < tNt = T finer than 0 < T1 < ... < TM < T with δt = √tk+1 − tk
Iterating for path-dependant contract (x0 = 50)
For each tk, k = 0, ..., Nt −1: 1
Random number generation (RNG) of independent Normal variables Gi
2
Stock price actualization Si
tk+1 = Si tk (1 + rg(tk)δtδt + σloc(Si tk , tk)δtGi)
3
If tk+1 = Tl with l ∈ {1, ..., M}, I i
Tl = I i Tl−1 + (Si Tl < B)
At tNt
Compute the payoff X i then average
Lokman (UPMC, LPMA) GTC Europe Lab 11 / 28
Monte Carlo and OpenACC parallelization
distributed random variables
General Form of linear RNGs
Without loss of generality: Xn = (AXn−1 + C) mod(m) = (A : C) Xn−1 . . . 1 mod(m) (3)
Parallel-RNG from Period Splitting of One RNG
* Pierre L’Ecuyer proposed a very efficient RNG (1996) which is a CMRG
lag = 3 for each MRG. * Very long period ∼ 2185 xn = (a1xn−1 + a2xn−2 + a3xn−3)mod(m)
Pre-computations
We launch as many parallel RNGs as the number of paths
Use
We prefer local variables to store the RNG’s state vector
Lokman (UPMC, LPMA) GTC Europe Lab 12 / 28
Monte Carlo and OpenACC parallelization
Some OpenACC pragmas: #pragma acc clause
global variables
clause = declare device_resident, applied to rg, Kg, Tg and Cg. These arrays are known by all GPU functions
copying to GPU
clause = enter data copyin copies the values to the GPU
embarrassingly parallel
clause = parallel loop present (+reduction(operation,arguments)) making parallel the execution of the loop without data movement. reduction(operation,arguments) reduces all the private copies of arguments into one final result using operation
Lokman (UPMC, LPMA) GTC Europe Lab 13 / 28
Monte Carlo and CUDA parallelization
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 14 / 28
Monte Carlo and CUDA parallelization
GPU architecture
Streaming Multiprocessor Registers L1VCache Constant Texture Shared
L 2 V C a c h e
G l
a l V M e m
y
V V V V V V
MemoryVclockedVatV theVprocessingVrate ProcessingVunits CachedVmemory GPUVRAM Streaming Multiprocessor Registers L1VCache Constant Texture Shared Streaming Multiprocessor Registers L1VCache Constant Texture Shared
Hardware software equivalence
◮
Streaming processor: Executes threads
◮
Streaming multiprocessor: Executes blocks
Built-in variables
Known within functions excuted on GPU: threadIdx.x, blockIdx.x, blockDim.x, gridDim.x
Global memory
GPU RAM, allocated thanks to cudaMalloc. Transfer values from GPU/CPU to CPU/GPU thanks to cudaMemcpy
Constant memory
Read only, declared globally. Data transfer with cudaMemcpyToSymbol
Shared
Declared in the kernel
Lokman (UPMC, LPMA) GTC Europe Lab 15 / 28
Monte Carlo and CUDA parallelization
Function declaration and calling
Standard C functions
The same as for C or C++ programming
Device functions
◮
Called by the GPU and executed on the GPU
◮
Declared as
__device__ void myDivFun (...) { ...; } __device__ float myDivFun (...) { ...; }
◮
Called simply by myDivFun(...) but only within other device functions
Kernel functions
◮
Called by the CPU and executed on the GPU
◮
Declared as __global__ void myKernel (...) { ...; }
◮
Called standardly by myKernel<<<numBlocks, threadsPerBlock>>>(...); where
numBlocks should take into account the number of multiprocessors threadsPerBlock should be equal to 128, 256, 512 or 1024
◮
Dynamic parallelism: kernels can be called within kernels by the GPU and executed on the GPU
Lokman (UPMC, LPMA) GTC Europe Lab 16 / 28
Monte Carlo and CUDA parallelization
Adaptation steps
Extension
Change it from MC.cpp and rng.cpp to MC.cu and rng.cu
__constant__
Declare Tg, Kg, rg and Cg using constant memory
Copying
Copy the values Tg, Kg, rg, Cg and CMRG on the GPU
__device__
Except main, MC, VarMalloc, FreeVar and parameters, define all other functions in MC.cu using __device__
Parallel
Declare MC function as a kernel using __global__ + Replace the Monte Carlo loop using threadIdx.x, blockDim.x, blockIdx.x in MC kernel
Reduction
Perform the reduction using __shared__ memory in kernel MC and atomicAdd function
Lokman (UPMC, LPMA) GTC Europe Lab 17 / 28
Monte Carlo and CUDA parallelization
GPU timer
float Tim; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); /******************************************************* To compute the execution time of this part of the code *******************************************************/ cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&Tim, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);
Lokman (UPMC, LPMA) GTC Europe Lab 18 / 28
PDE formulation & Crank-Nicolson scheme
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 19 / 28
PDE formulation & Crank-Nicolson scheme
PDE, u(t, x, j) with respect to x, Black & Scholes PDE
The price
F(t, x, j) = e−
T
t rg (u)duE(X
For u(t, x, j) = e
T
t rg (u)duF(t, ex, j), we equivalently solve u PDE given by
1 2 σ2
loc(x, t) ∂2u
∂x2 (t, x, j) + µ(x, t) ∂u ∂x (t, x, j) = − ∂u ∂t (t, x, j) (5) µ(x, t) = rg(t) − σ2
loc(x, t)
2 , (x, t, j) ∈]0, max(Kg)] × ]Tk, Tk+1[ × [0, P2] with T0 = 0 and TM+1 = T
Final and boundary conditions
u(T, x, j) = max(ex − K, 0) for any (x, j) and heuristically u(t, log[min(Kg)], j) = pmin = 0 and u(t, log[max(Kg)], j) = pmax = max(Kg) − K
Crank-Nicolson scheme
Denoting uk,i,j = u(tk, xi, j), σ = σloc(xi, tk) and µ = µ(xi, tk) quuk,i+1 + qmuk,i + qduk,i−1 = puuk+1,i+1 + pmuk+1,i + pduk+1,i−1 qu = − σ2∆t 4∆x2 − µ∆t 4∆x , qm = 1 + σ2∆t 2∆x2 , qd = − σ2∆t 4∆x2 + µ∆t 4∆x pu = σ2∆t 4∆x2 + µ∆t 4∆x , pm = 1 − σ2∆t 2∆x2 , pd = σ2∆t 4∆x2 − µ∆t 4∆x
Lokman (UPMC, LPMA) GTC Europe Lab 20 / 28
PDE formulation & Crank-Nicolson scheme
PDE, u(t, x, j) with respect to j, barrier condition
ut(x, j) = E (ST − K)+1M
i=1 1{STi <B}∈[P1,P2]
1{STi <B} = j
t ∈ [TM, T[
ut(x, j) = E[(ST − K)+|St = x]
t ∈ [TM−1, TM]
ut(x, j) = E[(ST − K)+1{STM ≥B} | St = x]1{j=P2} + E[(ST − K)+1{STM <B} | St = x]1{j=P1−1} + E[(ST − K)+|St = x]1{j∈[P1,P2−1]}
[TM−k−1, TM−k] k = M − 1, ..., 1 (T0 = 0)
ut(x, j) = E[uTM−k (STM−k , P2)1{STM−k ≥B} | St = x]1{j=P2} + E[uTM−k (STM−k , p1
k )1{STM−k <B} | St = x]1{j=p1
k −1}
+ E
+uTM−k (STM−k , j + 1)1{STM−k <B}
k ,P2−1]}
with p1
k = max(P1 − k, 0)
Lokman (UPMC, LPMA) GTC Europe Lab 21 / 28
PDE formulation & Crank-Nicolson scheme Lokman (UPMC, LPMA) GTC Europe Lab 22 / 28
PDE simulation and CUDA parallelization
Plan
Introduction Local volatility challenges GPUs Monte Carlo and OpenACC parallelization Monte Carlo and CUDA parallelization PDE formulation & Crank-Nicolson scheme PDE simulation and CUDA parallelization
Lokman (UPMC, LPMA) GTC Europe Lab 23 / 28
PDE simulation and CUDA parallelization
Solve tridiagonal system for the implicit part
T = d1 c1 a2 d2 c2 a3 d3 ... ... ... ... ... ... cn−1 an dn .
Cyclic Reduction
e1 e7 e5 z2 z8 z4 z6 Stepg1:gForwardgreductiongtog ag4-unknowngsystemginvolvingg z2,gz4,gz6gandgz8 Stepg2:gForwardgreductiongto ag2-unknowngsystemginvolvingg z4gandgz8 Stepg3:gSolveg2-unknowngsystem Stepg4:gBackwardgsubstitutiongto solvegthegrestg2gunknowns Stepg5:gBackwardgsubstitutiongto solvegthegrestg4gunknowns A A A A A A A A A A A A A A A A A A A A A A A A A A A A
A A A A A A
e3 z3 z1 z5 z7 e1 e2 e3 e4 e5 e6 e7 e8 e'8 e'6 e'4 e'2 e''4 e''8 e'6 e'2 z4 z8 z2 z4 z6 z8
Lokman (UPMC, LPMA) GTC Europe Lab 24 / 28
PDE simulation and CUDA parallelization
d1 c1 c1 d2 c2 c2 d3 c3 c3 d4 c4 c4 d5 c5 c5 d6 c6 c6 d7 (R) − → d′
1
c′
2
d′
2
c′
3
c′
2
d′
3
c′
4
c′
3
d′
4
c′
5
c′
4
d′
5
c′
6
c′
5
d′
6
c′
6
d′
7
(P) − → d′
1
c′
2
c′
2
d′
3
c′
4
c′
4
d′
5
c′
6
c′
6
d′
7
d′
2
c′
3
c′
3
d′
4
c′
5
c′
5
d′
6
(R) − → d′′
1
c′′
2
d′′
3
c′′
4
c′′
2
d′′
5
c′′
4
d′′
7
d′′
2
c′′
3
d′′
4
c′′
3
d′′
6
(P) − → d′′
1
c′′
2
c′′
2
d′′
5
d′′
3
c′′
4
c′′
4
d′
7
d′′
2
c′′
3
c′′
3
d′′
6
d′′
4
Lokman (UPMC, LPMA) GTC Europe Lab 25 / 28
PDE simulation and CUDA parallelization
Adaptation steps
Extension
Change it from PDE.cpp to PDE.cu
__constant__
Declare Tg, Kg, rg and Cg using constant memory
Copying
Copy the values Tg, Kg, rg, Cg on the GPU
__device__
Except for main, PDE_Diff, Out2In, VarMalloc, FreeVar and parameters, define all other functions in PDE.cu using __device__
Registers
Compare the CPU version of PCR to the GPU version given to you. Remark the large use of shared memory and registers.
Parallel
Declare PDE_Diff and Out2In functions as kernels using __global__ + Replace loops related to St by threadIdx.x and loops related to It by blockIdx.x in PDE_Diff and Out2In kernels as well as in Exp device function.
Lokman (UPMC, LPMA) GTC Europe Lab 26 / 28
PDE simulation and CUDA parallelization
GPU timer
float Tim; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); /******************************************************* To compute the execution time of this part of the code *******************************************************/ cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&Tim, start, stop); cudaEventDestroy(start); cudaEventDestroy(stop);
Lokman (UPMC, LPMA) GTC Europe Lab 27 / 28
PDE simulation and CUDA parallelization
References
◮
derivatives on graphics processing units using Monte Carlo simulation”. Concurrency and Computation: Practice and Experience, vol. 26(9), pp. 1679–1697, 2014.
◮
linear systems on graphics processing units”, The Journal of Supercomputing 73(4), pp. 1360–1386, 2017.
◮
volatility smile: an implicit finite difference approach”, Journal of Computational Finance, vol. 1(2), pp. 5–37, 1997.
◮
Quantitative Finance, vol. 14(1), pp. 59–71, 2013.
◮
Operations Research, vol. 44(5), pp. 816–822, 1996.
Lokman (UPMC, LPMA) GTC Europe Lab 28 / 28