High-Performance Data Loading and Augmentation for Deep Neural Network Training
Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com
and Augmentation for Deep Neural Network Training Trevor Gale - - PowerPoint PPT Presentation
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale Steven Eliuk Cameron Upright tgale@ece.neu.edu steven.eliuk@gmail.com c.upright@samsung.com Roadmap 1. The General-Purpose Acceleration
Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com
Distributed math library (dMath). Used to accelerate popular machine learning frameworks
Samsung Advanced Learning v1.0 Goal: Design & build software & hardware infrastructure to accelerate machine learning and mathematical workloads. Specifically through the use of many-GPU, distributed systems
Custom GPU clusters used across Samsung
= Expresso (topic for today)
RDMA transfers
pass computation
runtime (topic for today)
convolution)
(See EDMNN@NIPS, GTC 2016 talk)
1. 2. 3. 4.
Previous batch computation
Batch Size Frames / Second
Peak training speed (bars) for AlexNet on 8 GPUs. Dotted lines mark peak data loading speeds with 5 threads / GPU
Source: developer.nvidia.com/cudnn
(note: we refer to the stage of the pipeline at which the data is moved to the GPU as the transfer index)
Mean Sub (float) Mirror (uint8) Crop (uint8)
Mean Sub (float) Mirror (uint8) Crop (uint8) CPU GPU
(uint8)
Peak training speed (bars) for AlexNet on 8 GPUs. Dotted lines mark peak data loading speeds with 5 threads / GPU
AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs
AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs
AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs
1/3 1/3 1/1 2/0 3/0 4/1 5/0 1/2 1/2 2/2 3/0 3/0 4/0 4/0
AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs
AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs
AlexNet on 8 M40 GPUs AlexNet on 8 P100 GPUs
performance samples, etc.)
Note: because data augmentation can be done on GPU, it will be able to leverage speedups from new GPUs as well
GoogleNet on 8 M40 GPUs
Expresso AlexNet training speed compared to BVLC Caffe (12/09/16). 2.12x speedup on average