SLIDE 1
GPU Accelerated Machine Learning for Bond Price Prediction
Venkat Bala Rafael Nicolas Fermin Cota
SLIDE 2 Motivation
Primary Goals
- Demonstrate potential benefjts of using GPUs over CPUs for machine learning
- Exploit inherent parallelism to improve model performance
- Real world application using a bond trade dataset
1
SLIDE 3 Highlights
Ensemble
- Bagging: Train independent regressors on equal sized bags of samples
- Generally, performance is superior to any single individual regressor
- Scalable: Each individual model can be trained independently and in parallel
Hardware Specifjcations
- CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
- GPU: GeForce GTX 1080 Ti
- RAM : 1 TB (DDR4 2400 MHZ)
2
SLIDE 4 Bond Trade Dataset
Feature Set
- 100+ features per trade
- Trade Size/Historical Features
- Coupon Rate/Time to Maturity
- Bond Rating
- Trade Type: Buy/Sell
- Reporting Delays
- Current Yield/Yield To Maturity
Response
3
SLIDE 5
Modeling Approach
SLIDE 6
The Machine Learning Pipeline
DATA PROCESSING
TRAINING SET CV/TEST SET
MODEL BUILDING EVALUATE DEPLOY
Accelerate each stage in the pipeline for maximum performance
4
SLIDE 7 Data Preprocessing
Exposing Data Parallelism
- Important stage in the pipeline (Garbage In → Garbage out)
- Many models rely on input data being on the same scale
- Standardization, log transformations, imputations, polynomial/non-linear feature
generation, etc.
- Most cases, no data dependence so each operation can be executed independently
- Signifjcant speedups can be obtained using GPUs, given suffjcient
data/computation
5
SLIDE 8
Data Preprocessing: Sequential Approach
Apply function F (·) sequentially to each element in a feature column
a0 a1 a2 a3 . . . aN F (·)
6
SLIDE 9
Data Preprocessing: Parallel Approach
Apply function F (·) in parallel to each element in a feature column
a0 a1 a2 a3 . . . aN b0 b1 b2 b3 . . . bN F (·) F (·) F (·) F (·) F (·)
7
SLIDE 10 Programming Details
Implementation Basics
- Task is embarrassingly parallel
- Improve CPU code performance
- Auto vectorizations + compiler optimizations
- Using performance libraries (Intel MKL)
- Adopting Threaded (OpenMP)/Distributed computing (MPI) approaches
- Great application case for GPUs
- Offmoad computations onto the GPU via CUDA kernels
- Launch as many threads as there are data elements
- Launch several kernels concurrently using CUDA streams
8
SLIDE 11 Toy Example: Speedup Over Sequential C++
- Log transformation of an array of fmoats
- N = 2p, Number of elements, p = log2(N)
18 19 20 21 22 23 p 2 4 6 8 10 Speedup Over Sequential C++
Vectorized C++ CUDA
9
SLIDE 12 Bond Dataset Preprocessing
Applied Transformations
- Log transformation of highly skewed features (Trade Size, Time to Maturity)
- Standardization (Trade Price & historical prices)
- Missing value imputation
- Winsorizing features to handle outliers
- Feature generation (Price differences, Yield measurements)
Implementation Details
- CPU: C++ implementation using Intel MKL/Armadillo
- GPU: CUDA
10
SLIDE 13 GPU Speedup over CPU implementation
- Nearly 10x speedup obtained after CUDA optimizations
20 21 22 23 24 25 p 2 4 6 8 10 Speedup over CPU Unoptimized CUDA Optimized CUDA
11
SLIDE 14 CUDA Optimizations
Standard Tricks
- Concurrent kernel executions of kernels using CUDA streams to maximizing GPU
utilization
- Use of optimized libraries such as cuBLAS/Thrust
- Coalesced memory access
- Maximizing memory bandwidth for low arithmetic intensive operations
- Caching using GPU shared memory
12
SLIDE 15
Model Building
SLIDE 16 Ensemble Model
Model Choices
- GBT: XGBoost, DNN: Tensorfmow/Keras
ENSEMBLE MODEL
GBT MODELS DNN 13
SLIDE 17 Hyperparameter Tuning: Hyperopt
GBT: XGBoost
- Learning Rate
- Max depth
- Minimum child weight
- Subsample, Colsample-bytree
- Regularization parameters
DNN: MLPs
- Learning Rate/Decay Rate
- Batch Size
- Epochs
- Hidden layers/Layer width
- Activations/Dropouts
14
SLIDE 18
Hyperparameters Tuning: Hyperopt
200 400 600 800 1000 Iterations 0.0 0.2 0.4 0.6 0.8 1.0 Learning Rate
15
SLIDE 19 XGBoost: Training & Hyperparameter Optimization Time
2 4 6 8
GPU CPU
GBT, Speedup ≈ 3x Intel(R) Xeon(R) E5-2699, 32 cores GTX 1080 Ti 16
SLIDE 20
TensorFlow/Keras Time Per Epoch
0.00 0.05 0.10 0.15 0.20 0.25 0.30 Time Per Epoch (s) 15 16 17 18 p Speedup ≈ 3 x GTX 1080 Ti Intel(R) Xeon(R) E5-2699, 32 cores 17
SLIDE 21
Model Test Set Performance
20 40 60 80 100 120 140 160
Prediction
20 40 60 80 100 120 140 160
Valid
TEST SET R2 : 0.9858 18
SLIDE 22
Summary
SLIDE 23 Summary
Final Remarks
- Leveraging the GPU computation power → dramatic speedups
- Maximum performance when GPUs incorporated into every stage of the pipeline
- Ensembles: Bagging/Boosting to improve model accuracy/throughput
- Shorter training times allows more experimentation
- Extensive support available
- Deploy this pipeline now in our in-house DGX-1
19
SLIDE 24
Questions?