Accelerating Financial Applications on the GPU Scott Grauer-Gray - PowerPoint PPT Presentation

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Accelerating Financial Applications on the GPU Scott Grauer-Gray William Killian Robert Searles John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General Purpose Processing Using GPUs

Introduction Execution Environment Future Work Conclusion 5 Alternate Architectures Results Framework Auto-Tuning 4 NVIDIA K20 Results Application Results 3 Compilation Experiment Setup Source Code Modifications Experiment Setup 2 Directive-Based Acceleration QuantLib and Financial Applications Introduction 1 Outline Conclusion Auto-Tuning Application Results Final Notes

Introduction Experiment Setup Application Results Auto-Tuning Conclusion QuantLib and Financial Applications QuantLib Open-Source library for Quantitative Finance Written in C++ Contains various financial models and methods Models: yield curves, interest rates, volatility Methods: analytic formulae, finite difference, monte-carlo Financial applications optimized are particular code paths in QuantLib

Introduction Monte-Carlo Each application is data-parallelized Double which are sold and bought back later Repurchase agreement pricing of securities Repo Double forward-curve Bond pricing using a fixed-rate bond with a flat Bonds Single Monte-Carlo method Pricing of a single option using QMB (Sobol) Single Experiment Setup pricing Option pricing using Black-Scholes-Merton Black-Scholes Precision Description Application Four financial applications selected for parallelization Financial Applications QuantLib and Financial Applications Conclusion Auto-Tuning Application Results Algorithm for each application is parallelized where possible

Introduction run on an accelerator between scientists Simplifies interaction implementation of code Preserves serial parallelism of code Focuses on highlighting Annotates what code should Experiment Setup to OpenMP Syntax comparable Overview on Directive-Based Acceleration Directive-Based Acceleration Conclusion Auto-Tuning Application Results and programmers

Introduction Directive syntax near identical to OpenMP with added data Fundemental execution unit is a codelet Originally developed by CAPS Entreprise HMPP parallelization Introduces a kernel directive that drives compiler-assisted clauses NVIDIA Experiment Setup Joint collaboration between CAPS Entreprise, CRAY, PGI, and OpenACC Directive-Based Programming Languages Directive-Based Acceleration Conclusion Auto-Tuning Application Results Provides fine-grain control for optimizations

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Source Code Modifications Source Code Modifications Implementations derived from Sequential C code Argument passing — Structure of Arrays Verification: Compared all results to original QuantLib code Code flatten QuantLib C++ ⇒ Sequential C code paths. All results were within 3 degrees of precision ( 10 − 3 )

Introduction // flattened code: myObj.addFour(); } }; A inst; inst.foo(); // flattened code: int inst_x; struct A : public B { inst_x += 4; // Alternative flattening: int addFour (int x) { return x + 4; } int inst_x; virtual void foo() { }; Experiment Setup struct C { Application Results Auto-Tuning Conclusion Source Code Modifications Code Flattening // C++ code: int x; virtual void foo() = 0; void addFour() { x += 4; } }; struct B { C myObj; inst_x = addFour (inst_x);

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Compilation Compilation Host code compiled with GCC 4.7.0 -O2 flag used for serial -O3 -march=native flag used for OpenMP OpenACC and HMPP compiled with HMPP Workbench 3.2.1 CUDA compiled with CUDA 5 Toolkit OpenCL used NVIDIA driver version 304.54

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Compilation Compile Workflow Using HMPP Workbench HMPP Workbench used for HMPP and OpenACC code compilation Target CUDA and OpenCL code generation

Introduction CUDA Cores Kepler GK110 NVIDIA K20c 1344 Kepler GK104 NVIDIA GTX 670 448 Fermi NVIDIA C2050 240 Tesla NVIDIA C1060 Architecture Experiment Setup NVIDIA GPU Auto-Tuning Targets: NOTE: Also ran all experiments on NVIDIA C2050 2.6GHz ECC RAM GPU — NVIDIA K20c (2496 CUDA Cores @ 706MHz) with 5GB GDDR5 DDR3-1066 ECC RAM CPU — Dual Xeon X5530 (Quad-Core @ 2.40GHz) with 24GB Execution Environment Execution Environment Conclusion Auto-Tuning Application Results 2496

Introduction Number of Options OpenCL HMPP OpenACC Speedup over Sequential Number of Options OpenMP CUDA HMPP OpenACC Experiment Setup Speedup over Sequential OpenMP NVIDIA K20 Results OpenCL Results Black-Scholes — K20 Results CUDA Results Conclusion Application Results Auto-Tuning 10 2 10 2 10 1 10 1 10 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 1 2 5 0 0 0 1 2 5 1 2 5

Introduction Experiment Setup Application Results Auto-Tuning Conclusion NVIDIA K20 Results Black-Scholes — K20 Results CUDA outperformed OpenCL on NVIDIA K20 461x speedup for CUDA 446x speedup for OpenCL HMPP and OpenACC targeting the same language achieved near-identical speedup HMPP and OpenACC targeting OpenCL was faster than targeting CUDA 369x speedup for CUDA 380x speedup for OpenCL

Introduction OpenACC OpenACC HMPP Experiment Setup OpenMP Number of Samples Speedup over Sequential HMPP Number of Samples OpenCL OpenMP Random Number Generation: C/OpenMP — rand CUDA — cuRand HMPP/OpenACC/OpenCL — Mersenne Twister Speedup over Sequential CUDA CUDA Results Monte-Carlo — K20 Results NVIDIA K20 Results Application Results Conclusion Auto-Tuning OpenCL Results 10 3 10 3 10 2 10 2 10 1 10 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 1 2 5 0 0 0 1 2 5 0 0 0 1 2 5 1 2 5 Dropoff in speedup for CUDA ⇒ cache misses

Introduction Experiment Setup Application Results Auto-Tuning Conclusion NVIDIA K20 Results Monte-Carlo — K20 Results Manual CUDA outperformed manual OpenCL Up to 1006x vs 180x HMPP and OpenACC performed similarly Targeting CUDA was faster than targeting OpenCL Up to 162x vs up to 130x

Introduction Number of Bonds OpenMP CUDA HMPP OpenACC Speedup over Sequential Number of Repos Experiment Setup OpenMP CUDA HMPP OpenACC Speedup over Sequential Problem: Generating OpenCL code from HMPP and OpenACC Repo (CUDA) Application Results Auto-Tuning Bonds (CUDA) Conclusion NVIDIA K20 Results Bonds and Repo — K20 Results 100 80 80 60 60 40 40 20 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 0 0 1 2 5 0 0 0 0 0 1 2 5 0 0 0 0 1 2 5 0 0 1 2 5 0 1 2 1

Introduction Experiment Setup Application Results Auto-Tuning Conclusion NVIDIA K20 Results Bonds and Repo — K20 Results Bonds: Up to 87.9x speedup Repo: Up to 94x speedup HMPP and OpenACC versions produced near-identical execution time HMPP and OpenACC versions ran within 2% execution time as manually-written CUDA Speedup flattened as problem size increased beyond 100,000 Bonds and 2,000,000 Repos

Accelerating Financial Applications on the GPU Scott Grauer-Gray - PowerPoint PPT Presentation

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Accelerating Financial Applications on the GPU Scott Grauer-Gray William Killian Robert Searles John Cavazos Department of Computer and Information Science

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

VSI's Open Source Strategy Plans and schemes for Open Source so9ware on OpenVMS Bre% Cameron /

L A T EX Presentation Timoth Van Meter Wise Lab, University of Illinois at Chicago (UIC)

Modeling: A Short Course in OpenModelica and Dakota Danielle Mass Wright State University

http://golang.org Thursday, July 22, 2010 Public Static Void Rob Pike OSCON July 22, 2010

SPWAG SIMPLE PRESENTATION WEB APP GENERATOR Lauren Zou, Aftab Khan, Richard Chiou Yunhe

Pony How I learned to stop worrying and embrace an unproven technology Sean T Allen Author of

Library Choco: an Open Source Java Constraint Programming publics ou privs. recherche

A Rewriting Approach to the Design and Evolution of Object-Oriented Languages Mark Hills and

Accelerating Financial Applications on the GPU Scott Grauer-Gray - PowerPoint PPT Presentation

Introduction Experiment Setup Application Results Auto-Tuning Conclusion Accelerating Financial Applications on the GPU Scott Grauer-Gray William Killian Robert Searles John Cavazos Department of Computer and Information Science

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU Computing at the Netherlands eScience Center Ben van Werkhoven NIRICT GPU Applications

ACCELERATING YOUR VR APPLICATIONS WITH VRWORKS Cem Cebenoyan Edward Liu 1 ACCELERATING YOUR

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

VSI's Open Source Strategy Plans and schemes for Open Source so9ware on OpenVMS Bre% Cameron /

L A T EX Presentation Timoth Van Meter Wise Lab, University of Illinois at Chicago (UIC)

Modeling: A Short Course in OpenModelica and Dakota Danielle Mass Wright State University

http://golang.org Thursday, July 22, 2010 Public Static Void Rob Pike OSCON July 22, 2010

SPWAG SIMPLE PRESENTATION WEB APP GENERATOR Lauren Zou, Aftab Khan, Richard Chiou Yunhe

Pony How I learned to stop worrying and embrace an unproven technology Sean T Allen Author of

Library Choco: an Open Source Java Constraint Programming publics ou privs. recherche

A Rewriting Approach to the Design and Evolution of Object-Oriented Languages Mark Hills and

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,