Distributed Deep Learning Mathew Salvaris What will be covered - - PowerPoint PPT Presentation

distributed deep learning
SMART_READER_LITE
LIVE PREVIEW

Distributed Deep Learning Mathew Salvaris What will be covered - - PowerPoint PPT Presentation

Distributed Deep Learning Mathew Salvaris What will be covered Overview of Distributed Training What affects distributed training Network Model Data location Data format Deep Learning Model (CNN) penultimate layer Cat


slide-1
SLIDE 1

Distributed Deep Learning

Mathew Salvaris

slide-2
SLIDE 2

What will be covered

  • Overview of Distributed Training
  • What affects distributed training
  • Network
  • Model
  • Data location
  • Data format
slide-3
SLIDE 3

penultimate layer

RGB Channels

  • f input image

Convolution layer with Kernels Pooling layer Fully connected layer Cat Dog Mouse

Deep Learning Model (CNN)

slide-4
SLIDE 4

Distributed training mode: Data parallelism

Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Job manager Worker 2

slide-5
SLIDE 5

Distributed training mode: Model parallelism

Dataset CNN model Subset 1 CNN model Worker 1 CNN model Job manager Worker 2 Subset 1

slide-6
SLIDE 6

Data parallelism vs model parallelism

Data parallelism

  • Easier implementation
  • Stronger fault tolerance
  • Higher cluster utilization

Model parallelism

  • Better scalability of large models
  • Less memory on each GPU
slide-7
SLIDE 7

Horovod: Ring All Reduce

slide-8
SLIDE 8

Effects of Network, Model and Precision

slide-9
SLIDE 9

Setup

Clusters of 8 nodes using K80, P40, P100 and V100 (4 GPUs per node+Infiniband) Two MPI configurations OpenMPI+NCCL and IntelMPI

slide-10
SLIDE 10

Experiments

345 experiments across many different models including ResNet50, MobileNet V2 etc. Using synthetic data Batch size remains 64 across all models and GPUs Use the benchmarking scripts from TensorFlow

slide-11
SLIDE 11

Distributed training with synthetic data

A I

I

Compute Pool

slide-12
SLIDE 12

Single GPU

Mathew Salvaris @msalvaris

slide-13
SLIDE 13

32 GPUs

slide-14
SLIDE 14

32 GPUs

Mathew Salvaris @msalvaris

slide-15
SLIDE 15

MobileNet

Mathew Salvaris @msalvaris

slide-16
SLIDE 16

MobileNet

Mathew Salvaris @msalvaris

slide-17
SLIDE 17

Data Transfer Batch Execution K80 GPU P40 P100 V100

Time it takes to transfer weights between GPUs Time it takes to process batch on GPU

slide-18
SLIDE 18

6,629 23,436 54 82 10 20 30 40 50 60 70 80 90

5000 10000 15000 20000 25000

Full precision[64] Mixed precision [256]

SCALING EFFICIENCY

IMAGES/SECOND

ResNet50 Full Precision vs Mixed Precision [32 V100s]

Images/second Scaling efficiency

slide-19
SLIDE 19

Effects of Storage

slide-20
SLIDE 20

Experiments

Using ResNet50 across three frameworks [PyTorch, TensorFlow, Keras] Using real and synthetic data. Real data on local, NFS and Blob storage Batch size remains 64 across all configurations Uses V100 GPUs

slide-21
SLIDE 21

Distributed training with NFS

A I

I

Compute Pool

NFS Share Mounted Fileshare Copy Data

slide-22
SLIDE 22

Distributed training with blob storage

A I

I

Compute Pool

Mounted Blob Mounted Fileshare Copy Data

slide-23
SLIDE 23

Distributed training with local storage

A I

I

Compute Pool

Mounted Fileshare Copy Data

slide-24
SLIDE 24

0.2 0.4 0.6 0.8 1

TensorFlow Keras PyTorch

ResNet50 - Relative performance across storage

Synthetic Local(SSD) NFS Premium Blob Blob

slide-25
SLIDE 25

Data Loaders and Preprocessors

Keras Data Loader

Simple with no parameters for buffering and parallelizing

PyTorch Data Loader

Specify number of workers with num_workers

slide-26
SLIDE 26

TensorFlow

Highly configurable Many options : buffer, shuffle, cache and shard Daunting and easy to get wrong

https://www.tensorflow.org/guide/performance/datasets

slide-27
SLIDE 27

Effects of Data Type

slide-28
SLIDE 28

TensorFlow Records

  • Binary data format created for TensorFlow – Recommended format

for TensorFlow

  • Can aggregate number of examples to smaller number of TFRecords –

efficient for transferring and reading in the cloud

  • Have to export data to format - Has to be tailored to use case
slide-29
SLIDE 29

1,000 2,000 3,000 4,000 5,000 6,000 7,000 8 16 32

AVERAGE IMAGES/SECOND

ResNet50 – Data Type Performance [Average]

Synthetic Images TFRecords

slide-30
SLIDE 30

1,000 2,000 3,000 4,000 5,000 6,000 7,000 8 16 32

MAXIMUM IMAGES/SECOND

ResNet50 – Data Format Performance [Maximum]

Synthetic Images TFRecords

slide-31
SLIDE 31

Things not discussed

Asynchronous distributed training Tradeoff between batch size and other parameters Optimization of TensorFlow pipeline Other data formats such as Parquet (Petastorm) Transform libraries [albumentations] Distributed file systems BeeGFs and other storage GlusterFS, Lustre etc. Models other than CNN

slide-32
SLIDE 32

Summary

Do try to use enhanced networking wherever possible especially for the latest GPUs Training small models using distributed training is not recommended Do use TFRecords or other columnar or row based data formats Not all data loaders are equal

slide-33
SLIDE 33

Thanks & Questions?