Juicing Up Ye Olde GPU Monte Carlo Code Richard Hayden, Andrey - PowerPoint PPT Presentation

March 2018 Juicing Up Ye Olde GPU Monte Carlo Code Richard Hayden, Andrey Zhezherun, Oleg Rasskazov ( JP Morgan )

GPUs in JP Morgan ❑ JP Morgan has been using GPUs extensively to speed up risk calculations and reduce computational costs since 2011. ❑ Speedup as of 2011 ~ 40x ❑ Large Cross Asset Quant Library (C++, Cuda) ❑ Monte Carlo and PDEs ❑ GPU code ❑ Hand-written Cuda Kernels ❑ Thrust ❑ Auto-Generated Cuda Kernels 2

GPU Compute use cases ❑ Revolutionary Compute density within the node ❑ Machine Learning applications ❑ Reducing Cost of Compute ❑ Fastest End-to-End calculations (focus of the talk) ❑ Real-time risk ❑ Example: pricing multiple “similar” instruments ❑ Common Monte Carlo diffusion ❑ Large number of similar payoffs ❑ E.g. parameterised by ❑ Coupon terms ❑ Barrier level ❑ Basket components ❑ Maturity 3

Starting point ❑ K80 on x86 ❑ Throughput-oriented setup ❑ Multi-tenancy on GPU ❑ Single instrument pricing interface ❑ Excess precision ? (storage/calculations are predominantly in double) ❑ Large overheads on multiple instrument pricing ❑ Repeated computations* ❑ CPU setup code ❑ Random number generation (GPU) ❑ Diffusion (GPU) ❑ Payoff compilation ❑ CUDA API calls * 1. there are modelling questions around “global” diffusion and correlations * 2. computations do not always fully overlap 4

Target improvements ❑ K80 on x86 ❑ Throughput-oriented setup ❑ Multi-tenancy on GPU ❑ Single instrument pricing interface ❑ Excess precision ? (storage/calculations are predominantly in double) ❑ Large overheads on multiple instrument pricing ❑ Repeated computations* ❑ CPU setup code ❑ Random number generation (GPU) ❑ Diffusion (GPU) ❑ Payoff compilation ❑ CUDA API calls * 1. there are modelling questions around “global” diffusion and correlations * 2. computations do not always fully overlap 5

IBM Power 8+ with P100 GPUs ❑ Half of the server (one chip). From https://www.ibm.com/blogs/systems/ibm-power8-cpu-and-nvidia-pascal-gpu-speed-ahead-with-nvlink/ 6

IBM Power 9 ( AC922 ) with 4 V100 GPUs ❑ NVLink 2 ❑ 6 Bricks, ❑ 1 Brick = 25 GB/s each way, see ❑ CPU <-> GPU – 75 GB/s each way ; System – +85% over Power 8 Memory ❑ NVidia Volta V100 GPUs Power 9 ❑ Half of Power 9 in the picture CPU V100 GPU V100 GPU 7

Payoff pricing interface ❑ Example auto-call instrument priced on 500,000 different 5-asset baskets, 10k MC paths each ❑ ~20% instrument/model object creation ❑ ~50% payoff compilation (on-the-fly CUDA) ❑ ~25% diffusion and setup ❑ <1% doing actual payoff computation on GPU ❑ GPU only running kernels about 1.5% of that time ❑ Vectorised payoff pricing interface ❑ Create instruments/model once and share for all payoff computation ❑ Compile payoff once (exposing all required parameterisations) ❑ Setup and diffuse the entire universe of required assets up front 8

Vectorised payoff pricing interface – initial results ❑ From >300 hrs to <1 minute Pricing time (s) GPU time (s) API time (s) Intel Haswell/K40 318.0 - - IBM Power8/P100 62.5 41.7 50.6 IBM Power9/V100 36.5 14.7 21.4 ❑ Lots of time spent in CUDA API ( cudaMalloc , cudaFree ) ❑ Use custom block allocator: Pricing GPU time (s) API time (s) Speedup time (s) Power8/P100 57.0 41.7 43.0 1.10 Power9/V100 31.1 14.7 16.5 1.17 9

GPU utilisation is low ❑ Move extra code to GPU, reuse data structures Pricing GPU time (s) API time (s) Speedup time (s) Power8/P100 (57.0) 53.0 (41.7) 41.8 (43.0) 43.2 1.18 Power9/V100 (31.1) 26.2 (14.7) 14.7 (16.5) 16.6 1.39 10

Single precision ❑ Use single precision for intermediate storage, means more paths fit into GPU memory at a time => further reduction in associated CPU overhead Pricing time GPU time (s) API time (s) Speedup (s) Power8/P100 (53.0) 45.0 (41.8) 38.1 (43.2) 39.1 1.39 Power9/V100 (26.2) 19.6 (14.7) 12.8 (16.6) 14.1 1.86 ❑ Use single precision also for computation of intermediate values Pricing time GPU time (s) API time (s) Speedup (s) Power8/P100 35.7 29.1 30.1 1.75 Power9/V100 17.7 11.1 12.4 2.06 11

Unified memory ❑ Use host memory to store final prices, leveraging unified memory / NVLink to access directly from GPU ❑ This frees up GPU memory for computing more paths/parameterisations at a time, reducing associated CPU overhead Pricing time GPU time (s) API time (s) Speedup (s) Power8/P100 (35.7) 40.6 (29.1) 34.2 (30.1) 38.4 1.54 Power9/V100 (17.7) 15.8 (11.1) 10.5 (12.4) 15.7 2.31 ❑ Final speedup Power 9 / V100 vs production code (K80): 20x 12

Summary ❑ Speed up Power 9 V100 vs Production : 20x (code optimizations + hardware) ❑ New use cases and hardware advances require architecture rethinks ❑ Our code is predominantly memory bound ❑ V100 and NVLink2 help ❑ Selective single precision works for computations but benefits mostly memory throughput storage ❑ Much more work to do ❑ Restructure the code to eliminate CUDA API overheads ❑ Optimize kernels for V100 ❑ Use all 4 GPUs within the node ❑ ~ 30-50x vs. baseline feasible ??? ❑ Benchmarking against Intel architecture with V100 (no NVLink to CPU) 13

Juicing Up Ye Olde GPU Monte Carlo Code Richard Hayden, Andrey - PowerPoint PPT Presentation

March 2018 Juicing Up Ye Olde GPU Monte Carlo Code Richard Hayden, Andrey Zhezherun, Oleg Rasskazov ( JP Morgan ) GPUs in JP Morgan JP Morgan has been using GPUs extensively to speed up risk calculations and reduce computational costs since

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Juicing Diet Student D, Student E, Student F Background The juicing diet started to get

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

OLDE SANDWICH TOWNE OLDE SANDWICH TOWNE COMMUNITY IMPROVEMENT PLAN (CIP) COMMUNITY IMPROVEMENT

SimProp : Monte Carlo code for UHECR propagation Eleonora Guido SimProp v2r4: Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

The Comparison of Information Structures in Games: Bayes Correlated Equilibrium and Individual

After the FBAR Overhaul: Foreign Account Reporting Enforcement Preparing for IRS Exams, Potential

Pay-to-Play Presentation Stan Mitchell, March 30, 2019 League of Independent Voters

FY 2020 Maryland Census Grant Program: Grants Administration Grantee Orientation Meeting July 8,

CS286r Presentation James Burns March 7, 2006 Calibrated Learning and Correlated Equi-

FALSE CLAIMS ACT OVERVIEW Enacted during the Civil War in 1863 To fight procurement

PAYROLL TAXES Prepared by JFO and LC 2 Federal Payroll Taxes FICA (Federal Insurance

FY 20-21 BUDGET ADDRESS MAYOR WENDELL LYNCH MAY 1, 2020 COVID-19 Impact Globally 3,251,925

Juicing Up Ye Olde GPU Monte Carlo Code Richard Hayden, Andrey - PowerPoint PPT Presentation

March 2018 Juicing Up Ye Olde GPU Monte Carlo Code Richard Hayden, Andrey Zhezherun, Oleg Rasskazov ( JP Morgan ) GPUs in JP Morgan JP Morgan has been using GPUs extensively to speed up risk calculations and reduce computational costs since

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Juicing Diet Student D, Student E, Student F Background The juicing diet started to get

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

OLDE SANDWICH TOWNE OLDE SANDWICH TOWNE COMMUNITY IMPROVEMENT PLAN (CIP) COMMUNITY IMPROVEMENT

SimProp : Monte Carlo code for UHECR propagation Eleonora Guido SimProp v2r4: Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

The Comparison of Information Structures in Games: Bayes Correlated Equilibrium and Individual

After the FBAR Overhaul: Foreign Account Reporting Enforcement Preparing for IRS Exams, Potential

Pay-to-Play Presentation Stan Mitchell, March 30, 2019 League of Independent Voters

FY 2020 Maryland Census Grant Program: Grants Administration Grantee Orientation Meeting July 8,

CS286r Presentation James Burns March 7, 2006 Calibrated Learning and Correlated Equi-

FALSE CLAIMS ACT OVERVIEW Enacted during the Civil War in 1863 To fight procurement

PAYROLL TAXES Prepared by JFO and LC 2 Federal Payroll Taxes FICA (Federal Insurance

FY 20-21 BUDGET ADDRESS MAYOR WENDELL LYNCH MAY 1, 2020 COVID-19 Impact Globally 3,251,925

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.