on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata - - PowerPoint PPT Presentation

on multicore systems
SMART_READER_LITE
LIVE PREVIEW

on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata - - PowerPoint PPT Presentation

Counterparty Credit Risk and IM Computation for CCP on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata Consultancy Services Limited 1 Overview Introduction Counter Party Credit Risk Basic Terminology Sequential


slide-1
SLIDE 1

1

Counterparty Credit Risk and IM Computation for CCP

  • n Multicore Systems

Prasad Pawar Nishant Kumar Amit Kalele Tata Consultancy Services Limited

slide-2
SLIDE 2

2

Overview

 Introduction  Counter Party Credit Risk

  • Basic Terminology

 Sequential Algorithm  Parallel algorithm using CUDA  Optimizations applied on GPGPU  Parallelized and Optimized algorithm on Intel Platform  Comparison Results  Conclusion and future work

slide-3
SLIDE 3

3

Introduction

Counterparty credit risk

Counterparty credit risk is defined as the risk that the counterparty to a transaction could default before the final settlement of the transaction’s cash flows.

Basic Terminology

  • IRS trade - Interest rate swap trades are trades done primarily for hedging or

speculation on interest rate direction by the market participants

  • Cash Flow - In the context of IRS trades cash flow refers to sum of money to be paid or

received on predefined cash flow dates that are mentioned in the trade

  • Zero Coupon Yield Curve - It is a curve representing the yield of zero coupon bonds

which are plotted against the length of time they have to run to maturity and essentially it provides forward rates and spot rates used for calculating cash flow and discounting

  • MTM value - MTM value refers to the mark to market value of the IRS trade which

reflects monetary gain or loss on the trade to the two parties of the trade

slide-4
SLIDE 4

4

Counter Party Credit Risk

Mark to Market Computations

  • The central counterparty (CCP) values, using current yield curve, the complete portfolio
  • f all interest rate swap trades received from all the members on intraday basis.
  • Calculates the mark to market (MTM) margin requirement for each member.
  • Block the margin from a member’s collateral and if the margin is not sufficient make

margin call to the required members.

Initial Margin Computations

  • The IM computation requires 250 times valuation of the member’s current portfolio using

250 different yield curves that are picked from historical data.

slide-5
SLIDE 5

5

Challenges

  • On traditional systems (database systems) MTM computations takes ~25min for

20,000 trades, each with ~150 cash flows.

  • Initial Margin (IM) computations takes ~10min for 250 different yield curves each

with 20,000 trades along with ~150 cash flows on .NET based solution. Such a high timings leads to

  • It makes the process inefficient as the user is unproductive during the 25 minutes for

which the valuation happens

  • The timings are high if information is required by senior executives or regulators on

an urgent basis

  • If the trade volumes increase say to 100,000, a realistic possibility, then the time

taken will be more than 2 hours which will be virtually unacceptable

  • Till the time IM result is computed, a member can continue to do trading but the

trades are guaranteed for settlement from point of trade in TS which increases risk for CCP

An efficient solution is required to solve such problem

slide-6
SLIDE 6

6

Computational Steps

Yield curve generations:

  • 1. Using the linear interpolation, compute the intermediate swap rates for

tenors whose swap rates are not provided. Where (x,y) represents tenor and corresponding interest rate.

  • 2. Zero rates for tenor up to one year are computed using continuous

compounding method as:

  • 3. Standard bootstrapping method is used to compute zero rates for tenor

beyond one year.

slide-7
SLIDE 7

7

Computational Steps

MTM values

The input to the mark to market computation is business date, immediate previous and future cash flow dates, principal amount, accrued interest, fixed rate, floating rate and zero rate. 1. Compute fixed cash flows and compute floating cash flows using discrete equivalent formula for future floating interest rates i.e. forward rates: 2. Calculate MTM value of each trade by doing discounting of fixed and floating cash flows using zero rate and netting off fixed and floating cash flows.

Discounted Value = Present value of cash flows Trade MTM = Sum (Discounted Value)

3. The MTM value i.e. margin requirement for each member is obtained by aggregating MTM value of all the trades of the member.

MTM for Member = Sum (Trade MTM of that Member)

slide-8
SLIDE 8

8

Computational Steps

Initial Margin calculation

  • 1. Value the complete IRS trade portfolio of a Member 250 times using 250

historical zero rates

  • 2. Compute the daily percentage change in the MTM value of the portfolio and

record the 249 results

  • 3. Assign weight to each of the result using EWMA scheme such that more

recent the result more is the weight

  • 4. Sort the 249 results in ascending order
  • 5. Add the weights from top and wherever the cumulative weight is equal to 0.05

i.e. worst 5 percent, the corresponding percentage change is then multiplied by portfolio value and adjusted for the holding period to compute IM

slide-9
SLIDE 9

9

Implementation Details

slide-10
SLIDE 10

10

Input and Output of IRS

  • Input -

Swap Rate, Swap Tenor

  • Output - Yield curve
  • Input for MTM Computation -

Cash_flow_dates, Prev_cash_flow_dates, Notional_amt, accured_int, fixed_rate

  • Output – Present CashFlow, MTM value
slide-11
SLIDE 11

11

Sequential Algorithm

Single MTM Computations

Compute zero rates MTMfinal = 0 for trades = 0 to nTrade do MTM[trade] = 0 for CF = 0 to nCf do Compute Present_CashFlow[CF] MTM[trade] = MTM[trade] + Present_CashFlow[CF] end for MTMfinal = MTMfinal + MTM[trade] end for

slide-12
SLIDE 12

12

Sequential Algorithm

Single MTM Computation

for CF = 0 to nCf do if(CF ==0) Read eff_date, curr_cash_flow_date else Read last_cash_flow_date, curr_cash_flow_date 1. Calculate no. of days between curr_date and last_date to calculate the tenor. 2. Calculates intermediate values of fw_rate, comp_fw_rate, dist_fw_rate etc 3. Calculated floating cashflow of particular trade based on above calculated values and inputs such as notional_amt, accrued_int and fixed_rate. 4. Compute Present value of cashflow using fixed/floatinig cashflow and yield curve. 5. MTM = MTM + Present_CashFlow[CF] end for

slide-13
SLIDE 13

13

Sequential Algorithm

Initial Margin Computations using 250 MTM values:

Compute current dates zero rates and retrieve 249 different zero rates from database. MTMfinal [nRate] = 0

for zero rate = 0 to nRate do for trades = 0 to nTrade do MTM[trade] = 0 for CF = 0 to nCf do Compute Present_CashFlow[CF] MTM[trade] = MTM[trade] + Present_CashFlow[CF] end for MTMfinal[nRate] = MTMfinal[nRate] + MTM[trade] end for end for

Compute IM using MTMfinal[nRate]

Single MTM Computations

slide-14
SLIDE 14

14

NVIDIA GPU Systems

Kepler K20x - Device: The nVidia’s Kepler K20x GPU with 796 MHz and 2496 cores, 5GB RAM. Host: The Intel Xeon(R) CPU E5-2697 v3@ 2.1 GHz, dual socket, 6 cores/socket, 16GB RAM. Kepler K40 - Device: The nVidia’s Kepler K40 GPU with 745 MHz and 2880 cores, 12 GB RAM. Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14 cores/socket, 64GB RAM. Kepler K80 - Device: The nVidia’s Kepler K80 GPU with 562 MHz and 2x2496 cores, 2x12 GB RAM. Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14 cores/socket, 64GB RAM.

slide-15
SLIDE 15

15

MTMfinal = 0 Launch CUDA Kernel with nTrade=20000 threads Compute zero rates MTM[Ti] = 0 for CF = 0 to nCf do Compute discount rate[CF] MTM[Ti] = MTM[Ti] + discount rate[CF] end for

T1 T2 T3 T4 T20000

MTMfinal =∑ MTM Kernel Computation

GPU Algorithm for single MTM

slide-16
SLIDE 16

16

Results

  • Sr. No.

Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250

  • diff. yield curves

81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. yield curves on K20x 9.612 8.44x Results taken on Kepler K20x system

slide-17
SLIDE 17

17

Further Optimization

Nvidia GPU optimization:

  • Multi level parallelism using Hyper-Q
  • Using Shared Memory with coalesced memory access
  • Modified data structure
  • Resolved the issue of warp divergence
  • Using constant memory
  • Read-only cache memory using const __restrict__
slide-18
SLIDE 18

18

Multi level parallelism using Hyper-Q

Allows connection from multiple CUDA streams, Message Passing Interface (MPI) processes, or multiple threads of the same process. 32 concurrent work queues, can receive work from 32 process cores at the same time. 1.5x performance benefit achieved .

Figure source: nvidia.com

slide-19
SLIDE 19

19

Set 32 CUDA stream and nRate=250 Distribute nRate/32 computations to each of 32 streams MTMfinal[nRate] = 0

Compute zero rates and retrieve 249 previous zero rates

S0 S1 S2 S31

Streams

GPU Algorithm for 250 MTM & IM Computation

. . . . . . . . . . . . . . . . . . . . .

Compute IM using MTM[nRate]

slide-20
SLIDE 20

20

Hyper-Q using default streaming

nvcc --default-stream per-thread -c MTM_value.cu -arch sm_35 -w -Xcompiler -fopenmp

slide-21
SLIDE 21

21

Results

Experiment

  • Sr. No.

Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250

  • diff. yield curves

81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x Results taken on Kepler K20x system

slide-22
SLIDE 22

22

Using Shared Memory

Grid Global Memory Block (0, 0)

Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers

Block (1, 0)

Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers

Host Constant Memory

  • Read/write per-block
  • Speed equivalent to

local cache

  • 100x faster than Global

memory

  • Limit up to 48KB
  • Zero rate and Swap

tenor are used as shared memory

Figure source: nvidia.com

slide-23
SLIDE 23

23

Coalesced global memory access

Threads0 Threads1 Threads30 Threads31

. . . . . . . . . . . . . . . . . . . . . .

128 192 256 2112 2176

Threads0 Threads1 Threads30 Threads31

. . . . . . . . . . . .

128 132 136 252 256

Shared Memory

T0 T1 T2 T30 T31

Global Memory Global Memory Threads

slide-24
SLIDE 24

24

Using Shared Memory

Here ARRAY_COUNT=22 and No. of threads per Block are 32,64 … As zero_rate and swap_tenor data is being used multiple times per thread, so stored in a shared memory.

slide-25
SLIDE 25

25

Results

  • Sr. No.

Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250

  • diff. yield curves

81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x Results taken on Kepler K20x system

slide-26
SLIDE 26

26

NVVP Profiling

slide-27
SLIDE 27

27

Data Structure of Dates

Data Structure - Stored Cash_flow_dates and prev_cash_flow_date in the form of structure typedef struct _d { int day; → 4 bytes int mon; → 4 bytes 12 bytes int year; → 4 bytes }dt;

slide-28
SLIDE 28

28

Issue with Data Structure

. . . . . . . . . . . .

4 8 T0 T1 T2 T30 T31

Global Memory Threads

12 16 20 24

In one cycle cache fetches 128 bytes, will eventually fetches 128/12 ~= 10 elements of date Large size of data needs to be fetch from Global Memory Access pattern to global memory is strided

slide-29
SLIDE 29

29

Modified Data Structure

Original Data Structure - Stored Cash_flow_dates and prev_cash_flow_date in the form of structure typedef struct _d { int day; → 4 bytes int mon; → 4 bytes 12 bytes int year; → 4 bytes }dt; Insted of structure stored the date information in the form of integer.

  • ex. dt date;

int tmp_date; date.day = 12; date.mom = 3; tmp_date = date | date << 8 | date << 16; date.year = 2010;

slide-30
SLIDE 30

30

Modified Data Structure

To extract the data from integer to separated by day, month and year we use below method: date.day = tmp_date & 0xFF; date.mon = (tmp_date >> 8) & 0xFF; date.year = (tmp_date >> 16) & 0xFFFF;

slide-31
SLIDE 31

31

Modified Data Structure

. . . . . . . . . . . .

4 8 T0 T1 T2 T30 T31

Global Memory Threads

12 16 20 24

In one cycle cache fetches 128 bytes, will eventually fetches 128/4 = 32 elements of date Less data needs to be fetch from Global Memory Access pattern to global memory is coalesced

slide-32
SLIDE 32

32

Results

  • Sr. No.

Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250

  • diff. yield curves

81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x Results taken on Kepler K20x system

slide-33
SLIDE 33

33

NVVP Profiling

slide-34
SLIDE 34

34

Warp Divergence

Threads are executed in warps of 32, with all threads in the warp executing the same instruction at the same time

slide-35
SLIDE 35

35

Warp Divergence

slide-36
SLIDE 36

36

Warp Divergence - solution

slide-37
SLIDE 37

37

Results

  • Sr. No.

Experiment Time in Sec Performance Gain 1 Sequential computation of 20000 trades with 150 cash flows each on 250

  • diff. yield curves

81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x 6 Resolved the issue of warp divergence ( By separated if-else position of code) 3.148 3.05x

Results taken on Kepler K20x system

slide-38
SLIDE 38

38

NVVP Profiling

slide-39
SLIDE 39

39

Constant memory

Constant Memory Where is constant memory?

  • Data is stored in the device global memory
  • Read data through multiprocessor constant cache
  • 64KB constant memory and 8KB cache for each SM

//declare constant memory

__constant__ float cst_ptr[size];

//copy data from host to constant memory

cudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

slide-40
SLIDE 40

40

Constant memory

Here ARRAY_COUNT=22 and No. of threads per Block are 32,64 … Issues with above implementation – Warp Divergence

slide-41
SLIDE 41

41

Results

  • Sr. No.

Experiment Time in Sec

Performance Gain

1 Sequential computation of 20000 trades with 150 cash flows each on 250 diff. yield curves 81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x 6 Resolved the issue of warp divergence ( By separated if-else position of code) 3.148 3.05x 7 Changed shared Memory to constant memory 2.892 3.32x

Results taken on Kepler K20x system

slide-42
SLIDE 42

42

NVVP Profiling

slide-43
SLIDE 43

43

Read-Only Cache Memory

The read-only data cache was introduced with Compute Capability 3.5 architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780 GPUs). Each SMX has a 48KB read-only cache. The CUDA compiler automatically accesses data via the read-only cache when it can determine that data is read-only for the lifetime of kernel. In practice, you need to qualify pointers as const and __restrict__ before the compiler can satisfy this condition. Also specify a read-only data cache access with the __ldg() intrinsic function.

slide-44
SLIDE 44

44

Read-Only Cache Memory

Without Read-only Cache With Read-only Cache

slide-45
SLIDE 45

45

NVVP Profiling

slide-46
SLIDE 46

46

Results

Sr. No. Experiment Time in Sec Performanc e Gain

1 Sequential computation of 20000 trades with 150 cash flows each on 250 diff. yield curves 81.15

  • 2

Parallel computation of 20000 trades with 150 cash flows each on 250 diff. Zero rates. 9.612 1x 3 Using default streaming flag 6.24 1.54x 4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x 5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure 4.229 2.27x 6 Resolved the issue of warp divergence ( By separated if-else position of code) 3.148 3.05x 7 Changed shared Memory to constant memory 2.892 3.32x 8 Read-only Cache memory using const __restrict__ 2.765 3.47x

Results taken on Kepler K20x system

slide-47
SLIDE 47

47

Intel Multi-core Systems

Experimental Setup

Ivy Bridge : The Intel Xeon E5 2650 v2, Ivy Bridge 2.6 GHz, dual socket, 8 cores/socket with Hyper-threading, 24GB RAM. Haswell : The Intel Xeon E5-2697 v3 @ 2.60GHz, dual socket, 14 cores/socket with Hyper-threading, 64 GB RAM.

slide-48
SLIDE 48

48

Intel OpenMP

. . .

Parallel section where each thread will calculate MTM value of each trade Sequential region Sequential region Parallelization using OpenMP

Syntax: #pragma omp parallel for clauses(private, firstprivate, ...) for ( trades =0 to nTrade) { for( CF = 0 to nCf ) { computation steps } }

T1 T2 Tn

slide-49
SLIDE 49

49

Intel Optimizations

Intel optimization

  • Multi thread settings
  • Optimization using compiler flags

– no-prec-div – -unroll-aggressive – -ipo

  • Schedule (Dynamic, chunk)
slide-50
SLIDE 50

50

Intel Optimizations

No-prec-div

  • prec-div improves precision of floating-point divides.
  • no-prec-div disables this option and enables optimizations that give

slightly less precise results than full IEEE division.

  • unroll-aggressive

This option determines whether the compiler uses more aggressive unrolling for certain loops.

  • ipo[n]

This option permits inlining and other interprocedural optimizations among multiple source files.

slide-51
SLIDE 51

51

Intel Optimizations Results on HSW-EP

Sr. No. Experiment Time in Sec Throughput 1 Parallel computation using OpenMP of 20000 trades with 150 cash flows each 10.6647

  • 2

Using KMP_AFFINITY=granularity=fine, compact,1,0 7.7874 1.37x 3 Using optimization flag -O2 3.4197 3.12x 4 Using optimization flags -no-prec-div

  • unroll-aggressive

3.1642 3.37x 5 Using optimization flag -ipo 2.7451 3.87x Results taken on Intel Haswell System

slide-52
SLIDE 52

52

Comparative Performance

slide-53
SLIDE 53

53

Conclusion and Future Work

Traditional computing approaches are not suitable for the workload involved in the computation of MTM and IM for CCP risk assessment. Using HPC, 750 million cash-flows can be computed for identifying liquidity requirement in few seconds. Achieved more 100 times improvement compared to best known system for single MTM computation. HPC is well suited for complex calculations like credit value adjustment to price counterparty risk in deal (which is out next PoC use case), expected shortfall calculation for market risk measurement, potential future exposure and exposure at default calculations, collateral valuation, basis risk computation etc. Risk management is moving towards intraday risk computation for all major risk categories of credit risk, market risk, liquidity risk and counterparty credit risk and HPC is well suited for meeting the performance demands of these computations The performance of Nvidia Kepler K80 is the best among all systems compared in our benchmarking results.

slide-54
SLIDE 54

54

Acknowledgement

We are thankful to Vinay Deshpande from Nvidia Pune, India for providing access to latest K40 and K80 GPU systems to benchmark and fine tune the application. We are thankful to HPC Advisory Council for providing access to the Thor system for evaluating our application.

slide-55
SLIDE 55

55

55

THANK YOU