dee deep le p learni arning ng mod models els
play

DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS Mathew Salvaris - PowerPoint PPT Presentation

DIS DISTRI TRIBUT BUTED ED TRA TRAINI INING NG OF OF DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro Rosetta Stone of


  1. DIS DISTRI TRIBUT BUTED ED TRA TRAINI INING NG OF OF DEE DEEP LE P LEARNI ARNING NG MOD MODELS ELS Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro

  2. Rosetta Stone of Deep Learning more info: https://github.com/ilkarman/DeepLearningFrameworks Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  3. ImageNet Competition error (%) ImageNet top-5 error 15.3% 7.3% 6.7% 5.1% (human) 3.8% 3.8% 3.6% 3.1% 2.4% Inception- ResNext ResNet NASNet AmoebaNet AlexNet VGG Inception ResNet Instagram (2015) (2017) (2017) (2012) (2014) (2015) (2016) (2018) Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  4. Distributed training mode: Data parallelism Worker 1 Worker 2 Job manager CNN model Subset 1 CNN model Subset 2 CNN model Dataset Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  5. Distributed training mode: Model parallelism Worker 1 Worker 2 Job manager Submodel 1 Submodel 2 CNN model Dataset Dataset Dataset Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  6. Data parallelism vs model parallelism Da Data ta parallelism arallelism Mod odel el parallelism arallelism ▪ Easier implementation ▪ Better scalability of large models ▪ Stronger fault tolerance ▪ Less memory on each GPU ▪ Higher cluster utilization Why no Wh y not bo t both th? ? Da Data ta para aralleli llelism sm fo for r CN CNN lay N layers ers an and model del par aralle allelism lism in in FC FC la laye yers rs so source: ce: Alex ex Krizhevs zhevsky ky. . 2014. On One weird rd trick ck fo for pa paralleli lelizing zing co convolutio lutional nal neura ural l netwo works. rks. https ps://a //arxiv.o rxiv.org/a rg/abs bs/14 /1404.5 04.5997 Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  7. Training strategies: parameter averaging Worker 1 Worker 2 Subset 1 CNN model Subset 2 CNN model Average of weights for each worker Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  8. Training strategies: distributed gradient based Worker 1 Worker 2 Subset 1 CNN model Subset 2 CNN model Synchronous Gradients of each worker Asynchronous Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  9. Overview of distributed training Install software and containers Schedule jobs Provision clusters of VMs Share results Distribute data Scale resources Handling failures Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  10. Azure Distributed Platforms ▪ Batch AI ▪ Batch Shipyard ▪ DL Workspace Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  11. Batch Shipyard • Supports Docker and Singularity: run your Docker and Singularity containers within the same job, side-by-side or even concurrently • Move data easily between locally accessible storage systems, remote filesystems, Azure Blob or File Storage, and compute nodes • Supports local storage, Azure Blob or File Storage, and NFS. • Low priority nodes https://github.com/Azure/batch-shipyard Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  12. Batch AI • Supports running on Docker container as well as the Data Science Virtual Machine • Supports local storage, Azure Blob or File Storage, and NFS. • Low priority nodes https://github.com/Azure/BatchAI Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  13. DL Workspace • Runs jobs inside Docker • Uses Kubernetes • Can be deployed anywhere not just Azure • Supports local storage and NFS https://github.com/Microsoft/DLWorkspace Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  14. Training with Batch AI 1) Create scripts to run on Batch AI 1 and transfer them to file storage I 2) Write the data to storage A I 2 3) Create the docker containers for each DL framework and transfer them to a container registry 3 Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  15. 1) Create a Batch AI Pool I A I 2 2) Each job will pull in the 2 appropriate container, script and 1 Batch AI Pool load data from chosen storage 3) Once the job is completed all the 2 results will be written to the fileshare 3 Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  16. Batch AI Interface CLI Python SDK az batchai cluster create --name nc24r --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  17. Distributed training with NFS ▪ Batch AI cluster configuration with Copy Data NFS share Batch AI Pool NFS I Share A I az batchai cluster create Mounted --name nc24r Fileshare --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key --nfs $NFS_NAME --nfs-mount-path nfs Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  18. Distributed training with blob storage ▪ Batch AI cluster configuration with Copy Data mounted blob Batch AI Pool Mounted I Blob A I az batchai cluster create Mounted --name nc24r Fileshare --image UbuntuLTS --vm-size Standard_NC24rs_v3 --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  19. Distributed training with local storage ▪ Batch AI cluster configuration with Copy Data copying the data to the nodes Batch AI Pool I A I az batchai cluster create --name nc24r Mounted --image UbuntuLTS Fileshare --vm-size Standard_NC24r --min 8 --max 8 --afs-name $FILESHARE_NAME --afs-mount-path extfs --container-name $CONTAINER_NAME --container-mount-path extcn --storage-account-name $STORAGE_ACCOUNT_NAME --storage-account-key $storage_account_key Node preparation configuration -c cluster.json Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  20. Distributed training Results images/second Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  21. Distributed training Results images/second Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  22. Distributed training Results images/second Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  23. Distributed training with Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  24. Distributed training with Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  25. Distributed training with Horovod Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  26. Distributed training with PyTorch Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  27. Distributed training with Chainer Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  28. Distributed training with CNTK 1-bit SGD with MPI Blocked Momentum with MPI Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  29. Demo Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  30. Acknowledgements Hongzhi Li Alex Sutton Alex Yukhanov Attribution of some images: http://morguefile.com/ Mathew Salvaris (@msalvaris) – Ilia Karmanov (@ikdeepl) – Miguel Fierro (@miguelgfierro)

  31. Thanks! Mathew Salvaris @msalvaris Ilia Karmanov @ikdeepl Miguel Fierro @miguelgfierro

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend