1
Counterparty Credit Risk and IM Computation for CCP
- n Multicore Systems
on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata - - PowerPoint PPT Presentation
Counterparty Credit Risk and IM Computation for CCP on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata Consultancy Services Limited 1 Overview Introduction Counter Party Credit Risk Basic Terminology Sequential
1
2
Introduction Counter Party Credit Risk
Sequential Algorithm Parallel algorithm using CUDA Optimizations applied on GPGPU Parallelized and Optimized algorithm on Intel Platform Comparison Results Conclusion and future work
3
Counterparty credit risk is defined as the risk that the counterparty to a transaction could default before the final settlement of the transaction’s cash flows.
speculation on interest rate direction by the market participants
received on predefined cash flow dates that are mentioned in the trade
which are plotted against the length of time they have to run to maturity and essentially it provides forward rates and spot rates used for calculating cash flow and discounting
reflects monetary gain or loss on the trade to the two parties of the trade
4
margin call to the required members.
250 different yield curves that are picked from historical data.
5
20,000 trades, each with ~150 cash flows.
with 20,000 trades along with ~150 cash flows on .NET based solution. Such a high timings leads to
which the valuation happens
an urgent basis
taken will be more than 2 hours which will be virtually unacceptable
trades are guaranteed for settlement from point of trade in TS which increases risk for CCP
6
tenors whose swap rates are not provided. Where (x,y) represents tenor and corresponding interest rate.
compounding method as:
beyond one year.
7
The input to the mark to market computation is business date, immediate previous and future cash flow dates, principal amount, accrued interest, fixed rate, floating rate and zero rate. 1. Compute fixed cash flows and compute floating cash flows using discrete equivalent formula for future floating interest rates i.e. forward rates: 2. Calculate MTM value of each trade by doing discounting of fixed and floating cash flows using zero rate and netting off fixed and floating cash flows.
Discounted Value = Present value of cash flows Trade MTM = Sum (Discounted Value)
3. The MTM value i.e. margin requirement for each member is obtained by aggregating MTM value of all the trades of the member.
MTM for Member = Sum (Trade MTM of that Member)
8
historical zero rates
record the 249 results
recent the result more is the weight
i.e. worst 5 percent, the corresponding percentage change is then multiplied by portfolio value and adjusted for the holding period to compute IM
9
10
Swap Rate, Swap Tenor
Cash_flow_dates, Prev_cash_flow_dates, Notional_amt, accured_int, fixed_rate
11
Compute zero rates MTMfinal = 0 for trades = 0 to nTrade do MTM[trade] = 0 for CF = 0 to nCf do Compute Present_CashFlow[CF] MTM[trade] = MTM[trade] + Present_CashFlow[CF] end for MTMfinal = MTMfinal + MTM[trade] end for
12
for CF = 0 to nCf do if(CF ==0) Read eff_date, curr_cash_flow_date else Read last_cash_flow_date, curr_cash_flow_date 1. Calculate no. of days between curr_date and last_date to calculate the tenor. 2. Calculates intermediate values of fw_rate, comp_fw_rate, dist_fw_rate etc 3. Calculated floating cashflow of particular trade based on above calculated values and inputs such as notional_amt, accrued_int and fixed_rate. 4. Compute Present value of cashflow using fixed/floatinig cashflow and yield curve. 5. MTM = MTM + Present_CashFlow[CF] end for
13
Compute current dates zero rates and retrieve 249 different zero rates from database. MTMfinal [nRate] = 0
for zero rate = 0 to nRate do for trades = 0 to nTrade do MTM[trade] = 0 for CF = 0 to nCf do Compute Present_CashFlow[CF] MTM[trade] = MTM[trade] + Present_CashFlow[CF] end for MTMfinal[nRate] = MTMfinal[nRate] + MTM[trade] end for end for
Compute IM using MTMfinal[nRate]
Single MTM Computations
14
Kepler K20x - Device: The nVidia’s Kepler K20x GPU with 796 MHz and 2496 cores, 5GB RAM. Host: The Intel Xeon(R) CPU E5-2697 v3@ 2.1 GHz, dual socket, 6 cores/socket, 16GB RAM. Kepler K40 - Device: The nVidia’s Kepler K40 GPU with 745 MHz and 2880 cores, 12 GB RAM. Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14 cores/socket, 64GB RAM. Kepler K80 - Device: The nVidia’s Kepler K80 GPU with 562 MHz and 2x2496 cores, 2x12 GB RAM. Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14 cores/socket, 64GB RAM.
15
MTMfinal = 0 Launch CUDA Kernel with nTrade=20000 threads Compute zero rates MTM[Ti] = 0 for CF = 0 to nCf do Compute discount rate[CF] MTM[Ti] = MTM[Ti] + discount rate[CF] end for
T1 T2 T3 T4 T20000
MTMfinal =∑ MTM Kernel Computation
16
Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250
81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. yield curves on K20x 9.612 8.44x Results taken on Kepler K20x system
17
18
Allows connection from multiple CUDA streams, Message Passing Interface (MPI) processes, or multiple threads of the same process. 32 concurrent work queues, can receive work from 32 process cores at the same time. 1.5x performance benefit achieved .
Figure source: nvidia.com
19
Set 32 CUDA stream and nRate=250 Distribute nRate/32 computations to each of 32 streams MTMfinal[nRate] = 0
Compute zero rates and retrieve 249 previous zero rates
S0 S1 S2 S31
Streams
Compute IM using MTM[nRate]
20
nvcc --default-stream per-thread -c MTM_value.cu -arch sm_35 -w -Xcompiler -fopenmp
21
Experiment
Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250
81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x Results taken on Kepler K20x system
22
Grid Global Memory Block (0, 0)
Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers
Block (1, 0)
Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers
Host Constant Memory
Figure source: nvidia.com
23
Threads0 Threads1 Threads30 Threads31
. . . . . . . . . . . . . . . . . . . . . .
128 192 256 2112 2176
Threads0 Threads1 Threads30 Threads31
. . . . . . . . . . . .
128 132 136 252 256
Shared Memory
T0 T1 T2 T30 T31
Global Memory Global Memory Threads
24
Here ARRAY_COUNT=22 and No. of threads per Block are 32,64 … As zero_rate and swap_tenor data is being used multiple times per thread, so stored in a shared memory.
25
Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250
81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x Results taken on Kepler K20x system
26
27
Data Structure - Stored Cash_flow_dates and prev_cash_flow_date in the form of structure typedef struct _d { int day; → 4 bytes int mon; → 4 bytes 12 bytes int year; → 4 bytes }dt;
28
. . . . . . . . . . . .
4 8 T0 T1 T2 T30 T31
Global Memory Threads
12 16 20 24
In one cycle cache fetches 128 bytes, will eventually fetches 128/12 ~= 10 elements of date Large size of data needs to be fetch from Global Memory Access pattern to global memory is strided
29
Original Data Structure - Stored Cash_flow_dates and prev_cash_flow_date in the form of structure typedef struct _d { int day; → 4 bytes int mon; → 4 bytes 12 bytes int year; → 4 bytes }dt; Insted of structure stored the date information in the form of integer.
int tmp_date; date.day = 12; date.mom = 3; tmp_date = date | date << 8 | date << 16; date.year = 2010;
30
31
. . . . . . . . . . . .
4 8 T0 T1 T2 T30 T31
Global Memory Threads
12 16 20 24
In one cycle cache fetches 128 bytes, will eventually fetches 128/4 = 32 elements of date Less data needs to be fetch from Global Memory Access pattern to global memory is coalesced
32
Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250
81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x Results taken on Kepler K20x system
33
34
35
36
37
Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250
81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x 6 Resolved the issue of warp divergence ( By separated if-else position of code) 3.148 3.05x
Results taken on Kepler K20x system
38
39
__constant__ float cst_ptr[size];
cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);
40
Here ARRAY_COUNT=22 and No. of threads per Block are 32,64 … Issues with above implementation – Warp Divergence
41
Experiment Time in Sec
Performance Gain
1 Sequential computation of 20000 trades with 150 cash flows each on 250 diff. yield curves 81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x 6 Resolved the issue of warp divergence ( By separated if-else position of code) 3.148 3.05x 7 Changed shared Memory to constant memory 2.892 3.32x
Results taken on Kepler K20x system
42
43
44
Without Read-only Cache With Read-only Cache
45
46
Sr. No. Experiment Time in Sec Performanc e Gain
1 Sequential computation of 20000 trades with 150 cash flows each on 250 diff. yield curves 81.15
Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x 6 Resolved the issue of warp divergence ( By separated if-else position of code) 3.148 3.05x 7 Changed shared Memory to constant memory 2.892 3.32x 8 Read-only Cache memory using const __restrict__ 2.765 3.47x
Results taken on Kepler K20x system
47
48
Parallel section where each thread will calculate MTM value of each trade Sequential region Sequential region Parallelization using OpenMP
Syntax: #pragma omp parallel for clauses(private, firstprivate, ...) for ( trades =0 to nTrade) { for( CF = 0 to nCf ) { computation steps } }
T1 T2 Tn
49
– no-prec-div – -unroll-aggressive – -ipo
50
slightly less precise results than full IEEE division.
This option determines whether the compiler uses more aggressive unrolling for certain loops.
This option permits inlining and other interprocedural optimizations among multiple source files.
51
Sr. No. Experiment Time in Sec Throughput 1 Parallel computation using OpenMP of 20000 trades with 150 cash flows each 10.6647
Using KMP_AFFINITY=granularity=fine, compact,1,0 7.7874 1.37x 3 Using optimization flag -O2 3.4197 3.12x 4 Using optimization flags -no-prec-div
3.1642 3.37x 5 Using optimization flag -ipo 2.7451 3.87x Results taken on Intel Haswell System
52
53
Traditional computing approaches are not suitable for the workload involved in the computation of MTM and IM for CCP risk assessment. Using HPC, 750 million cash-flows can be computed for identifying liquidity requirement in few seconds. Achieved more 100 times improvement compared to best known system for single MTM computation. HPC is well suited for complex calculations like credit value adjustment to price counterparty risk in deal (which is out next PoC use case), expected shortfall calculation for market risk measurement, potential future exposure and exposure at default calculations, collateral valuation, basis risk computation etc. Risk management is moving towards intraday risk computation for all major risk categories of credit risk, market risk, liquidity risk and counterparty credit risk and HPC is well suited for meeting the performance demands of these computations The performance of Nvidia Kepler K80 is the best among all systems compared in our benchmarking results.
54
We are thankful to Vinay Deshpande from Nvidia Pune, India for providing access to latest K40 and K80 GPU systems to benchmark and fine tune the application. We are thankful to HPC Advisory Council for providing access to the Thor system for evaluating our application.
55
55