DIS DISTRI TRIBUT BUTED ED TRA TRAINI INING NG OF OF DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS
Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro
DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS Mathew Salvaris - - PowerPoint PPT Presentation
DIS DISTRI TRIBUT BUTED ED TRA TRAINI INING NG OF OF DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro Rosetta Stone of
Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
more info: https://github.com/ilkarman/DeepLearningFrameworks
Rosetta Stone of Deep Learning
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
ImageNet Competition
error (%)
ImageNet top-5 error
15.3% 7.3% 6.7% 3.6% 3.1% 5.1% (human)
AlexNet (2012) VGG (2014) Inception (2015) ResNet (2015) Inception- ResNet (2016) NASNet (2017)
3.8%
AmoebaNet (2017)
3.8% 2.4%
ResNext Instagram (2018)
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Data parallelism
Dataset CNN model Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Job manager
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training mode: Model parallelism
Dataset CNN model Dataset Submodel 1 Worker 1 Submodel 2 Worker 2 Job manager Dataset
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Data parallelism vs model parallelism
Da Data ta parallelism arallelism ▪ Easier implementation ▪ Stronger fault tolerance ▪ Higher cluster utilization Mod
el parallelism arallelism ▪ Better scalability of large models ▪ Less memory on each GPU Wh Why no y not bo t both th? ? Da Data ta para aralleli llelism sm fo for r CN CNN lay N layers ers an and model del par aralle allelism lism in in FC FC la laye yers rs
so source: ce: Alex ex Krizhevs zhevsky ky. . 2014. On One weird rd trick ck fo for pa paralleli lelizing zing co convolutio lutional nal neura ural l netwo works.
ps://a //arxiv.o rxiv.org/a rg/abs bs/14 /1404.5 04.5997
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: parameter averaging
Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Average of weights for each worker
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Training strategies: distributed gradient based
Subset 1 CNN model Worker 1 Subset 2 CNN model Worker 2 Gradients of each worker
Synchronous Asynchronous
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Overview of distributed training
Install software and containers Provision clusters
Schedule jobs Distribute data Share results Handling failures Scale resources
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Azure Distributed Platforms ▪Batch AI ▪Batch Shipyard ▪DL Workspace
Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch Shipyard
https://github.com/Azure/batch-shipyard
run your Docker and Singularity containers within the same job, side-by-side or even concurrently
accessible storage systems, remote filesystems, Azure Blob or File Storage, and compute nodes
Blob or File Storage, and NFS.
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI
https://github.com/Azure/BatchAI
container as well as the Data Science Virtual Machine
Blob or File Storage, and NFS.
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
DL Workspace
https://github.com/Microsoft/DLWorkspace
just Azure
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
A I
1) Create scripts to run on Batch AI and transfer them to file storage 2) Write the data to storage 3) Create the docker containers for each DL framework and transfer them to a container registry
1 2 3
I
Training with Batch AI
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
1) Create a Batch AI Pool 2) Each job will pull in the appropriate container, script and load data from chosen storage 3) Once the job is completed all the results will be written to the fileshare
Batch AI Pool
1 2 2 2 3
A I
I
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Batch AI Interface
CLI
az batchai cluster create
Python SDK
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with NFS
▪ Batch AI cluster configuration with NFS share
A I
I Batch AI Pool
NFS Share Mounted Fileshare Copy Data
az batchai cluster create
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with blob storage
▪ Batch AI cluster configuration with mounted blob
A I
I Batch AI Pool
Mounted Blob Mounted Fileshare Copy Data
az batchai cluster create
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with local storage
▪ Batch AI cluster configuration with copying the data to the nodes
A I
I Batch AI Pool
Node preparation configuration Copy Data
az batchai cluster create
Mounted Fileshare
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training Results
images/second
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Horovod
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with PyTorch
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with Chainer
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Distributed training with CNTK
1-bit SGD with MPI Blocked Momentum with MPI
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)
Hongzhi Li Alex Sutton Alex Yukhanov
Attribution of some images: http://morguefile.com/
Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro