Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey - PowerPoint PPT Presentation

April 2016 Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey Zhezherun, Antti Lamberg (JP Morgan)

GPUs in JP Morgan  JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since 2011.  Speedup as of 2011 ~ 40x  Large Cross Asset Quant Library (C++, Cuda)  Monte Carlo and PDEs  GPU code  Hand-written Cuda Kernels  Thrust  Auto-Generated Cuda Kernels  Hardest part in delivering GPUs to production  Bugs 2

Auto-generating GPU code  Putting all of a Quant library to GPU is hard  Parts of the code change frequently, so need to be rewritten  Domain specific languages (DSL) could help  Interpreted  Compiled  We auto-generate lots of GPU code  Auto-generator is simplistic, converting DSL to .cu file  We rely on CUDA compilers  For optimizations  For understanding our horrible auto-generated .cu code  We need a regression test harness around it 3

(Rare) Compiler issues  Sources (ways to notice)  Driver upgrades  SDK upgrades  Hardware upgrades  Mitigation  Hand written code  Modify the code to go around the issue  Auto-generated code  ???  Modifying “generator” is hard  Complex code  Performance ?  Backward compatibility ?  Share an extensive set of our regression tests with NVidia 4

Pointers that we have a compiler issue  How to verify that issue is not your code bug  May be it is, but  Different behaviour on different cards  Different behaviour with different versions of Cuda  CPU/GPU code match (in some cases)  Ptx inspection  Assume the issue could be reproduced by run of a standalone kernel, e.g.  No concurrent execution issues  Not related to special objects allocated by the driver  Streams data,  Local memory,  Etc  => Create a small reproducer 5

Creating standalone kernel tests/reproducers  Capture  kernel code  Auto-generated .cu file  kernel inputs  correct outputs  Capture current GPU memory that is being operated on by the Kernel  Memory state was created as a complex interaction of previous kernels and cpu calls  We would like to be very generic at that point  Replay  Restore the GPU memory  How? Cuda Malloc does not allow one to choose address range for newly alloc-ed memory  Compile and load kernel  Pass in parameters and Run  Compare the outputs 6

Why dump/restore memory is hard  Dump an array from GPU memory  Restored Array can be allocated at different address  Ok, as long as we know all the pointers to array and re-point them to new allocation  What if we had array of pointers to objects?  Complex data structures?  Ideally we want to snapshot/restore current state of GPU memory  No public API from NVidia  Problem is hard because there is “private” memory for the driver which depends on kernels loaded, local memory configurations, etc.  We came up with a set of tricks 7

32 bit GPU code: restoring / dumping memory  Assume GPU memory fits into 2GB  Intercept GPU memory allocations in your code and replace with your custom allocator from preallocated blob.  Allocate 3GB block of GPU memory BB  On 32bit we have 4GB address space  3GB block always has virtual address range 1GB-3GB  Custom allocate from range 1GB in BB  Dump all the custom-allocated memory from 1GB  Replay simply allocates 3GB and always guaranteed to have address range 1GB-3GB  So simply load the dump starting from 1GB address  All the internal data pointers guaranteed to work as addresses are exactly the same 8

64 bit GPU code: Dumping memory  Assume GPU memory for the kernel fits into 1GB  Intercept GPU memory allocations in your code and replace with your custom allocator from preallocated blob.  Assume we do not store pointers back to CPU memory on the GPU  Run 1:  Allocate 2GB block of GPU memory BB  BB has at least 1 ranges, M  starting from 1GB boundary  Of size 1GB  Use custom allocator starting from M, and before running kernel, dump GPU memory BB_M  Run 2:  Repeat run 1 but with 1GB address range starting from N, M!=N and dump GPU memory BB_N 9

64 bit GPU code: Restoring memory  Allocate 2GB block of GPU memory, BB  Find 1GB stride starting with 1GB memory, P  Assuming code path in run1 and run2 of the application are deterministic and are the same  Assume preallocated BB was set to 0 in both runs  Relocate BB_N’s addresses into P:  Unless non-linear address arithmetic was involved  Dumps BB_N and BB_M would only be different where GPU memory stores the addresses to GPU memory  Difference being N-M  Size of the difference can be used to validate our assumptions about dumps  Starting with BB_N dump, replace add the different bytes (e.g. addresses in Nth GB) with addresses to Pth GB  Now we ready to run the kernel. 10

Summary  If many conditions are held  We can automatically create standalone cuda test cases out our auto-generated kernels  Surprisingly, preconditions hold for us say 99% of time  100GB worth of standalone tests from snapshot of our production (uncompressed)  64 bit GPU codes  Based on hundreds of trades  Could be shipped outside of JPMorgan without sharing proprietary quant library  Refreshed tests are to be shipped to NVidia (pending internal clearance). 11

Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey - PowerPoint PPT Presentation

April 2016 Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey Zhezherun, Antti Lamberg (JP Morgan) GPUs in JP Morgan JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Creation of new mark Creation of new markets ets Creation of new mark Creation of new markets

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Automated integration tests for AJAX applications Overview CEE-SECR 2011 October 31

HELCOM Muni report (2013) findings and recommendations Claus Boettcher Member of ad hoc working

Parki king a and Transportation Annual Public Hearing April 22, 2019 3:00-4:00 pm 4/26/2019

Status Report on Chromatics Optics Measurements and Correction M. Bai, R. Calaga, Y. Luo, S.

Virtual Make n Take Fluency within 10 Welcome! Your host Lisa Riggs Regional Consultant

Wording Amendments for Small-Scale Residential Infill Department of Planning & Urban Design

NTE for Nonroad Nonroad Diesel Diesel NTE for Engines Engines Matt Spears - - U.S. EPA U.S.

OSCILLOSCOPES Use an o-scope to: Troubleshoot and find more information than what you can

the RekkEVidde Assessing Range and Performance of Electric Vehicles in Nordic Driving

Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey - PowerPoint PPT Presentation

April 2016 Automated Creation of Tests from CUDA Kernels Oleg Rasskazov, Andrey Zhezherun, Antti Lamberg (JP Morgan) GPUs in JP Morgan JP Morgan is extensively using GPUs to speed up risk calculations and reduce computational costs since

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU

Lecture 2.1 - Introduction to CUDA C CUDA C vs. Thrust vs. CUDA Libraries Objective To learn

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

CUDA/Ada An Ada binding to CUDA Reto B urki, Adrian-Ken R uegsegger University of Applied

GPU Programming Alan Gray EPCC The University of Edinburgh Overview Motivation and need

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

A High-Level Intro to CUDA CS5220 Fall 2015 What is CUDA? C ompute U nified D evice A

Lecture 2.4 Introduction to CUDA C Introduction to the CUDA Toolkit Objective To become

Creation of new mark Creation of new markets ets Creation of new mark Creation of new markets

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Automated integration tests for AJAX applications Overview CEE-SECR 2011 October 31

HELCOM Muni report (2013) findings and recommendations Claus Boettcher Member of ad hoc working

Parki king a and Transportation Annual Public Hearing April 22, 2019 3:00-4:00 pm 4/26/2019

Status Report on Chromatics Optics Measurements and Correction M. Bai, R. Calaga, Y. Luo, S.

Virtual Make n Take Fluency within 10 Welcome! Your host Lisa Riggs Regional Consultant

Wording Amendments for Small-Scale Residential Infill Department of Planning &amp; Urban Design

NTE for Nonroad Nonroad Diesel Diesel NTE for Engines Engines Matt Spears - - U.S. EPA U.S.

OSCILLOSCOPES Use an o-scope to: Troubleshoot and find more information than what you can

the RekkEVidde Assessing Range and Performance of Electric Vehicles in Nordic Driving

Wording Amendments for Small-Scale Residential Infill Department of Planning & Urban Design