GPU Accelerated Machine Learning for Bond Price Prediction Venkat - - PowerPoint PPT Presentation

gpu accelerated machine learning for bond price prediction
SMART_READER_LITE
LIVE PREVIEW

GPU Accelerated Machine Learning for Bond Price Prediction Venkat - - PowerPoint PPT Presentation

GPU Accelerated Machine Learning for Bond Price Prediction Venkat Bala Rafael Nicolas Fermin Cota Motivation Primary Goals Demonstrate potential benefjts of using GPUs over CPUs for machine learning Exploit inherent parallelism to


slide-1
SLIDE 1

GPU Accelerated Machine Learning for Bond Price Prediction

Venkat Bala Rafael Nicolas Fermin Cota

slide-2
SLIDE 2

Motivation

Primary Goals

  • Demonstrate potential benefjts of using GPUs over CPUs for machine learning
  • Exploit inherent parallelism to improve model performance
  • Real world application using a bond trade dataset

1

slide-3
SLIDE 3

Highlights

Ensemble

  • Bagging: Train independent regressors on equal sized bags of samples
  • Generally, performance is superior to any single individual regressor
  • Scalable: Each individual model can be trained independently and in parallel

Hardware Specifjcations

  • CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  • GPU: GeForce GTX 1080 Ti
  • RAM : 1 TB (DDR4 2400 MHZ)

2

slide-4
SLIDE 4

Bond Trade Dataset

Feature Set

  • 100+ features per trade
  • Trade Size/Historical Features
  • Coupon Rate/Time to Maturity
  • Bond Rating
  • Trade Type: Buy/Sell
  • Reporting Delays
  • Current Yield/Yield To Maturity

Response

  • Trade Price

3

slide-5
SLIDE 5

Modeling Approach

slide-6
SLIDE 6

The Machine Learning Pipeline

DATA PROCESSING

TRAINING SET CV/TEST SET

MODEL BUILDING EVALUATE DEPLOY

Accelerate each stage in the pipeline for maximum performance

4

slide-7
SLIDE 7

Data Preprocessing

Exposing Data Parallelism

  • Important stage in the pipeline (Garbage In → Garbage out)
  • Many models rely on input data being on the same scale
  • Standardization, log transformations, imputations, polynomial/non-linear feature

generation, etc.

  • Most cases, no data dependence so each operation can be executed independently
  • Signifjcant speedups can be obtained using GPUs, given suffjcient

data/computation

5

slide-8
SLIDE 8

Data Preprocessing: Sequential Approach

Apply function F (·) sequentially to each element in a feature column

a0 a1 a2 a3 . . . aN F (·)

6

slide-9
SLIDE 9

Data Preprocessing: Parallel Approach

Apply function F (·) in parallel to each element in a feature column

a0 a1 a2 a3 . . . aN b0 b1 b2 b3 . . . bN F (·) F (·) F (·) F (·) F (·)

7

slide-10
SLIDE 10

Programming Details

Implementation Basics

  • Task is embarrassingly parallel
  • Improve CPU code performance
  • Auto vectorizations + compiler optimizations
  • Using performance libraries (Intel MKL)
  • Adopting Threaded (OpenMP)/Distributed computing (MPI) approaches
  • Great application case for GPUs
  • Offmoad computations onto the GPU via CUDA kernels
  • Launch as many threads as there are data elements
  • Launch several kernels concurrently using CUDA streams

8

slide-11
SLIDE 11

Toy Example: Speedup Over Sequential C++

  • Log transformation of an array of fmoats
  • N = 2p, Number of elements, p = log2(N)

18 19 20 21 22 23 p 2 4 6 8 10 Speedup Over Sequential C++

Vectorized C++ CUDA

9

slide-12
SLIDE 12

Bond Dataset Preprocessing

Applied Transformations

  • Log transformation of highly skewed features (Trade Size, Time to Maturity)
  • Standardization (Trade Price & historical prices)
  • Missing value imputation
  • Winsorizing features to handle outliers
  • Feature generation (Price differences, Yield measurements)

Implementation Details

  • CPU: C++ implementation using Intel MKL/Armadillo
  • GPU: CUDA

10

slide-13
SLIDE 13

GPU Speedup over CPU implementation

  • Nearly 10x speedup obtained after CUDA optimizations

20 21 22 23 24 25 p 2 4 6 8 10 Speedup over CPU Unoptimized CUDA Optimized CUDA

11

slide-14
SLIDE 14

CUDA Optimizations

Standard Tricks

  • Concurrent kernel executions of kernels using CUDA streams to maximizing GPU

utilization

  • Use of optimized libraries such as cuBLAS/Thrust
  • Coalesced memory access
  • Maximizing memory bandwidth for low arithmetic intensive operations
  • Caching using GPU shared memory

12

slide-15
SLIDE 15

Model Building

slide-16
SLIDE 16

Ensemble Model

Model Choices

  • GBT: XGBoost, DNN: Tensorfmow/Keras

ENSEMBLE MODEL

GBT MODELS DNN 13

slide-17
SLIDE 17

Hyperparameter Tuning: Hyperopt

GBT: XGBoost

  • Learning Rate
  • Max depth
  • Minimum child weight
  • Subsample, Colsample-bytree
  • Regularization parameters

DNN: MLPs

  • Learning Rate/Decay Rate
  • Batch Size
  • Epochs
  • Hidden layers/Layer width
  • Activations/Dropouts

14

slide-18
SLIDE 18

Hyperparameters Tuning: Hyperopt

200 400 600 800 1000 Iterations 0.0 0.2 0.4 0.6 0.8 1.0 Learning Rate

15

slide-19
SLIDE 19

XGBoost: Training & Hyperparameter Optimization Time

2 4 6 8

  • Avg. Training Time (H)

GPU CPU

GBT, Speedup ≈ 3x Intel(R) Xeon(R) E5-2699, 32 cores GTX 1080 Ti 16

slide-20
SLIDE 20

TensorFlow/Keras Time Per Epoch

0.00 0.05 0.10 0.15 0.20 0.25 0.30 Time Per Epoch (s) 15 16 17 18 p Speedup ≈ 3 x GTX 1080 Ti Intel(R) Xeon(R) E5-2699, 32 cores 17

slide-21
SLIDE 21

Model Test Set Performance

20 40 60 80 100 120 140 160

Prediction

20 40 60 80 100 120 140 160

Valid

TEST SET R2 : 0.9858 18

slide-22
SLIDE 22

Summary

slide-23
SLIDE 23

Summary

Final Remarks

  • Leveraging the GPU computation power → dramatic speedups
  • Maximum performance when GPUs incorporated into every stage of the pipeline
  • Ensembles: Bagging/Boosting to improve model accuracy/throughput
  • Shorter training times allows more experimentation
  • Extensive support available
  • Deploy this pipeline now in our in-house DGX-1

19

slide-24
SLIDE 24

Questions?