Distributed Deep Learning
Mathew Salvaris
Distributed Deep Learning Mathew Salvaris What will be covered - - PowerPoint PPT Presentation
Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed Training What affects distributed training Network Model Data location Data format Deep Learning Model (CNN) penultimate layer Cat
Mathew Salvaris
penultimate layer
RGB Channels
Convolution layer with Kernels Pooling layer Fully connected layer Cat Dog Mouse
Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Job manager Worker 2
Dataset CNN model Subset 1 CNN model Worker 1 CNN model Job manager Worker 2 Subset 1
Data parallelism
Model parallelism
345 experiments across many different models including ResNet50, MobileNet V2 etc. Using synthetic data Batch size remains 64 across all models and GPUs Use the benchmarking scripts from TensorFlow
A I
I
Compute Pool
Single GPU
Mathew Salvaris @msalvaris
32 GPUs
32 GPUs
Mathew Salvaris @msalvaris
MobileNet
Mathew Salvaris @msalvaris
MobileNet
Mathew Salvaris @msalvaris
Time it takes to transfer weights between GPUs Time it takes to process batch on GPU
6,629 23,436 54 82 10 20 30 40 50 60 70 80 90
5000 10000 15000 20000 25000
Full precision[64] Mixed precision [256]
SCALING EFFICIENCY
IMAGES/SECOND
ResNet50 Full Precision vs Mixed Precision [32 V100s]
Images/second Scaling efficiency
Using ResNet50 across three frameworks [PyTorch, TensorFlow, Keras] Using real and synthetic data. Real data on local, NFS and Blob storage Batch size remains 64 across all configurations Uses V100 GPUs
A I
I
Compute Pool
NFS Share Mounted Fileshare Copy Data
A I
I
Compute Pool
Mounted Blob Mounted Fileshare Copy Data
A I
I
Compute Pool
Mounted Fileshare Copy Data
0.2 0.4 0.6 0.8 1
TensorFlow Keras PyTorch
ResNet50 - Relative performance across storage
Synthetic Local(SSD) NFS Premium Blob Blob
Simple with no parameters for buffering and parallelizing
Specify number of workers with num_workers
https://www.tensorflow.org/guide/performance/datasets
for TensorFlow
efficient for transferring and reading in the cloud
1,000 2,000 3,000 4,000 5,000 6,000 7,000 8 16 32
AVERAGE IMAGES/SECOND
ResNet50 – Data Type Performance [Average]
Synthetic Images TFRecords
1,000 2,000 3,000 4,000 5,000 6,000 7,000 8 16 32
MAXIMUM IMAGES/SECOND
ResNet50 – Data Format Performance [Maximum]
Synthetic Images TFRecords
Asynchronous distributed training Tradeoff between batch size and other parameters Optimization of TensorFlow pipeline Other data formats such as Parquet (Petastorm) Transform libraries [albumentations] Distributed file systems BeeGFs and other storage GlusterFS, Lustre etc. Models other than CNN
Do try to use enhanced networking wherever possible especially for the latest GPUs Training small models using distributed training is not recommended Do use TFRecords or other columnar or row based data formats Not all data loaders are equal