Project Adam: Building an Efficient and Scalable Deep Learning - - PowerPoint PPT Presentation

project adam
SMART_READER_LITE
LIVE PREVIEW

Project Adam: Building an Efficient and Scalable Deep Learning - - PowerPoint PPT Presentation

Project Adam: Building an Efficient and Scalable Deep Learning Training System Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, Microsoft Research Credits: Proceedings of the 11th USENIX Symposium on Operating Systems


slide-1
SLIDE 1

Project Adam:

Building an Efficient and Scalable Deep Learning Training System

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, Microsoft Research

Credits: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (Alex Zahdeh)

slide-2
SLIDE 2

Traditional Machine Learning

slide-3
SLIDE 3

Deep Learning

Data Deep Learning Humans Prediction Objective Function

slide-4
SLIDE 4

Deep Learning

slide-5
SLIDE 5

Problem with Deep Learning

Current computational needs on the order of petaFLOPS!

slide-6
SLIDE 6

Accuracy scales with data and model size

slide-7
SLIDE 7

Neural Networks

http://neuralnetworksanddeeplearning.com/images/tikz11.png

Activation function:

slide-8
SLIDE 8

Convolutional Neural Networks

http://colah.github.io/posts/2014-07-Conv-Nets-Modular/img/Conv2-9x5-Conv2Conv2.png

slide-9
SLIDE 9

Convolutional Neural Networks with Max Pooling

http://colah.github.io/posts/2014-07-Conv-Nets-Modular/img/Conv-9-Conv2Max2Conv2.png

slide-10
SLIDE 10

Neural Network Training (with Stochastic Gradient Descent)

  • Inputs processed one at a time in random order with

three steps: 1. Feed-forward evaluation 2. Back propagation 3. Weight updates

slide-11
SLIDE 11

Project Adam

  • Optimizing and balancing both computation and

communication for this application through whole system co- design

  • Achieving high performance and scalability by exploiting the

ability of machine learning training to tolerate inconsistencies well

  • Demonstrating that system efficiency, scaling, and

asynchrony all contribute to improvements in trained model accuracy

slide-12
SLIDE 12

Adam System Architecture

slide-13
SLIDE 13

Fast Data Serving

  • Large quantities of data needed (10-100TBs)
  • Data requires transformation to prevent over-fit
  • Small set of machines configured separately to perform

transformations and serve data

  • Data servers pre-cache images using nearly all of system memory as a

cache

  • Model training machines fetch data in advance in batches in the

background

slide-14
SLIDE 14

Multi Threaded Training

  • Multiple threads on a single machine
  • Different images assigned to threads that share model weights
  • Per-thread training context stores activations and weight update

values

slide-15
SLIDE 15

Fast Weight Updates

  • Weights updated locally without locks
  • Race condition permitted
  • Weight updates are commutative and associative
  • Deep neural networks are resilient to small amounts of noise
  • Important for good scaling
slide-16
SLIDE 16

Reducing Memory Copies

  • Pass pointers rather than copying data for local communication
  • Custom network library for non local communication
  • Exploit knowledge of the static model partitioning to optimize communication
  • Reference counting to ensure safety under asynchronous network IO
slide-17
SLIDE 17

Memory System Optimizations

  • Partition so that model layers fit in L3 cache
  • Optimize computation for cache locality
slide-18
SLIDE 18

Mitigating the Impact of Slow Machines

  • Allow threads to process multiple images in parallel
  • Use a dataflow framework to trigger progress on individual images

based on arrival of data from remote machines

  • At end of epoch, only wait for 75% of the model replicas to complete
  • Arrived at through empirical observation
  • No impact on accuracy
slide-19
SLIDE 19

Parameter Server Communication

Two protocols for communicating parameter weight updates

  • 1. Locally compute and accumulate weight updates and periodically

send them to the server

  • Works well for convolutional layers since the volume of weights is low due to

weight sharing

  • 2. Send the activation and error gradient vectors to the parameter

servers so that weight updates can be computed there

  • Needed for fully connected layers due to the volume of weights. This reduces

traffic volume from M*N to K*(M+N)

slide-20
SLIDE 20

Evaluation

  • Visual Object Recognition Benchmarks
  • System Hardware
  • Baseline Performance and Accuracy
  • System Scaling and Accuracy
slide-21
SLIDE 21

Visual Object Recognition Benchmarks

  • MNIST digit recognition

http://cs.nyu.edu/~roweis/data/mnist_train1.jpg

slide-22
SLIDE 22

Visual Object Recognition Benchmarks

  • ImageNet 22k Image Classification

American Foxhound English Foxhound

http://www.exoticdogs.com/breeds/english-fh/4.jpg http://www.juvomi.de/hunde/bilder/m/FOXEN01M.jpg

slide-23
SLIDE 23

System Hardware

  • 120 HP Proliant servers
  • Each server has an Intel Xeon E5-2450L processor 16 core, 1.8GHZ
  • Each server has 98GB of main memory, two 10Gb NICs, one 1 Gb NIC
  • 90 model training machines, 20 parameter servers, 10 image servers
  • 3 racks each of 40 servers, connected by IBM G8264 switches
slide-24
SLIDE 24

Baseline Performance and Accuracy

  • Single model training machine, single parameter server.
  • Small model on MNIST digit classification task
slide-25
SLIDE 25

Model Training System Baseline

slide-26
SLIDE 26

Parameter Server Baseline

slide-27
SLIDE 27

Model Accuracy Baseline

slide-28
SLIDE 28

System Scaling and Accuracy

  • Scaling with Model Workers
  • Scaling with Model Replicas
  • Trained Model Accuracy
slide-29
SLIDE 29

Scaling with Model Workers

slide-30
SLIDE 30

Scaling with Model Replicas

slide-31
SLIDE 31

Trained Model Accuracy at Scale

slide-32
SLIDE 32

Trained Model Accuracy at Scale

slide-33
SLIDE 33

Exascale Deep Learning for Climate Analytics

Thorst en Kurt h*, Josh Romero*, S ean Treichler, Mayur Mudigonda, Nat han Luehr, Everet t Phillips, Ankur Mahesh, Michael Mat heson, Jack Deslippe, Massimiliano Fat ica, Prabhat, Michael Houst on

Credits: nersc, nvidia, Oak Ridge National Laboratory

slide-34
SLIDE 34

Socio-Economic Impact of Extreme Weather Events

  • t ropical cyclones and

at mospheric rivers have maj or impact on modern economy and societ y

  • CA: 50%
  • f rainfall t hrough

at mospheric rivers

  • FL: flooding, influence on

insurance premiums and home prices

  • $200B wort h of damage in 2017
  • cost s of ~$10B/ event for large

event s

3

Harvey 2017 Katrina 2005 Berkeley 2019 Santa R

  • sa 2018

pixabay pixabay Chris S amuel

pixabay

slide-35
SLIDE 35

Understanding Extreme Weather Phenomena

  • will t here be more hurricanes?
  • will t hey be more int ense?
  • will t hey make landfall more oft en?
  • will at mospheric rivers carry more

wat er?

  • can t hey help mit igat e drought s and

decrease risk of forest fires?

  • will t hey cause flooding and heavy

precipit at ion?

pixabay pixabay

3 5

pixabay

slide-36
SLIDE 36

Impact Quantification of Extreme Weather Events

  • detect hurricanes and atmospheric

rivers in climate model proj ections

  • enable geospatial analysis of EW

events and statistical impact studies for regions around the world

  • flexible and scalable detection

algorithm

  • gear up for future simulations with

∼1 km2 spatial resolution

M.F. Wehner, doi:10.1002/2013MS000276

slide-37
SLIDE 37

Unique Challenges for Climate Analytics

  • interpret as segmentation problem
  • 3 classes - background (BG), tropical cyclones (TC), atmospheric rivers (AR)
  • deep learning has proven successful for these tasks
  • climate data is complex
  • high imbalance - more than 95%
  • f

pixels are background high variance - shape of events change many input channels w/ different properties high resolution required no static background, highly variable in space and time

NAS A

slide-38
SLIDE 38

Unique Challenges for Deep Learning

  • need labeled data for supervised approach
  • can be leveraged from existing heuristic-based approaches
  • define neural network architecture
  • balance between compute performance and model accuracy

employ high productivity/ flexibility frameworks for rapid prototyping performance optimization requires holistic approach

  • hyper parameter tuning (HPO)
  • necessary for convergence and accuracy
slide-39
SLIDE 39

Unique Challenges for Deep Learning at Extreme Scale

  • data management
  • shuffling/loading/processing/feeding 20 TB dataset to keep GPUs busy
  • efficient use of remote filesystem
  • multi-node coordination and synchronization
  • synchronous reduction of O(50)MB across 27360 GPUs after each iteration
  • hyper parameter tuning (HPO)
  • convergence and accuracy challenging due to larger global batch sizes
slide-40
SLIDE 40

Label Creation: Atmospheric Rivers

  • 1. The climate model predicts

wat er vapor, wind speeds and humidit y

  • 2. These observables are used to

compute the Int egrat ed Wat er Vapor Transport

slide-41
SLIDE 41

Label Creation: Atmospheric Rivers

  • 3. Binarization by thresholding at

95th percentile

  • 4. Flood fill algorithm generates

AR candidates by masking out regions in mid-latitudes

slide-42
SLIDE 42

Label Creation: Tropical Cyclones

  • 1. Extract cyclone center and radius using

thresholds for pressure, temperature, and vorticity

  • 2. Binarize patch around cyclone center

using thresholds for water vapor, wind, and precipitation

slide-43
SLIDE 43

Syst ems

  • Cray XC50 HPC syst em at CSCS, 5t h on t op500
  • 5320 nodes wit h Int el Xeon E5-2695v3

and 1 NVIDIA P100 GPU

  • Cray Aries int erconnect in

diamet er 5 dragonfly t opology

  • ~54.4 Pet aFlop/ s peak performance (FP32)

Piz Daint S ummit

  • leadership class HPC system at OLCF

, 1st on top500

  • 4609 nodes with 2 IBM P9 CPU and 6 NVIDIA V100 GPU
  • 300 GB/ s NVLink connection btw. 3 GPUs in a group
  • 800 GB available NVMe storage/ node
  • dual-rail EDR Infiniband in fat-tree topology
  • ~3.45 ExaFlop/ s theoretical peak performance (FP16)

Carlos Jones (ORNL) CSCS

15

slide-44
SLIDE 44

cuDNN

Single GPU

  • Things to consider:
  • Is my TensorFlow model

efficiently using GPU resources? Is my data input pipeline keeping up? Is my TensorFlow model providing reasonable results?

slide-45
SLIDE 45

NCCL MPI

cuDNN cuDNN cuDNN cuDNN cuDNN cuDNN

Single Node

  • Things to consider:
  • Is my data input pipeline

st ill keeping up? Is my dat a-parallel TensorFlow model providing reasonable results? How is my performance scaling over PCIe/ NVLink using Horovod?

slide-46
SLIDE 46

Multi Node

  • Things to consider:
  • Is my data-parallel

TensorFlow model st ill providing reasonable results? How is my performance scaling over NVLink + Inf iniBand using Horovod? How do I distribute my data across so many nodes?

NCCL MPI

slide-47
SLIDE 47

Deep Learning Models for Extreme Weather Segmentation

decoder encoder

1152×768, 16

5×5 conv, 64

1152×768, 3

input

  • utput

5×5 conv, +32 2× 1×1 conv, 128, /2 5×5 conv, +32 1×1 conv, 384, /2 5×5 conv, +32 4× 5× 1x1 deconv, 160, /2 5×5 conv, +32 4× 1x1 deconv, 128, /2 5×5 conv, +32 2× 1x1 deconv, 64, /2 5×5 conv, +32 1x1 deconv, 64, /2 5×5 conv, +32 1×1 conv, 3 2× 2× 5×5 conv, +32 2× 1×1 conv, 192, /2 5×5 conv, +32 2× 1×1 conv, 256, /2

1152×768, 128 72×48, 160 144×96, 384 144×96, 544 576×384, 192 144×96, 384 288×192, 256 288×192, 384 288×192, 256 576×384, 192 576×384, 256 1152×768, 128 1152×768, 192 1152×768, 256

Tiramisu, 35 layers, 7.8M parameters, 4.2 TF/ sample DeepLabv3+, 66 layers, 43.7M parameters, 14.4 TF/ sample

decoder ASPP encoder

7×7 conv, 64, /2

1152×768, 16

1×1 conv, 64 3×3 conv, 64 1×1 conv, 256

288×192, 64

3× 1×1 conv, 128 3×3 conv, 128 1×1 conv, 512

144×96, 256

4× 1×1 conv, 256 3×3 conv, 256, d 2 1×1 conv, 1024

144×96, 512

6× 1×1 conv, 512 3×3 conv, 512, d 4 1×1 conv, 2048

144×96, 1024

3× 1×1 conv, 256 3×3 conv, 256, d 12 3×3 conv, 256, d 24 3×3 conv, 256, d 36 1×1 conv, 256

144×96, 1024 144×96, 2048

3×3 deconv, 256, /2 1×1 conv, 48 3×3 conv, 64 3×3 conv, 128 3×3 conv, 256 3×3 conv, 256 3×3 deconv, 256, /2 3×3 deconv, 256, /2 1×1 conv, 3 3×3 conv, 256 3×3 conv, 256

1152×768, 3 288×192, 256 1152×768, 256 1152×768, 128 288×192, 256

3×3 maxpool, /2 input

  • utput
slide-48
SLIDE 48

On-Node I/O Pipeline

  • files are in HDF5 with single sample + label/ file
  • list of filenames passed to TensorFlow Dataset API (tf.data)
  • HDF5 serialization bottleneck addressed with multiprocessing + h5py
  • extract and batch using tf.datainput pipeline

... data-2107-12-26-02-4.h5 data-2107-12-26-03-1.h5 data-2107-12-26-03-4.h5 data-2107-12-26-04-1.h5 data-2107-12-26-04-4.h5 data-2107-12-26-05-1.h5 data-2107-12-26-05-4.h5 data-2107-12-26-06-1.h5 data-2107-12-26-06-4.h5 data-2107-12-26-07-1.h5 ... ... data-2107-03-03-06-1.h5 data-2107-05-24-00-4.h5 data-2107-08-30-03-4.h5 data-2107-10-29-01-4.h5 data-2107-12-11-07-1.h5 data-2107-08-14-03-4.h5 data-2107-01-08-01-4.h5 data-2107-09-08-04-1.h5 data-2107-09-22-00-1.h5 data-2107-07-16-03-4.h5 ...

shuffle 4-way parallel read + preprocess bat ch

CPU GPU

asimovinst it ut e.org/ neural-net work-zoo

slide-49
SLIDE 49

Improvements to Horovod: Original Control Plane

w4 w3 w1

{1, 2, 5, 8, 13}

...

{2, 3, 5, 10, 13} {1, 3, 5, 10, 13, 14}

w2

{1, 3, 5, 7, 13}

w1 w1

{5, 13} int ersect list s allreduce 5, 13 gat her broadcast

w4 w3 w1

...

w2

slide-50
SLIDE 50

Improvements to Horovod: Tree-based Control Plane

...

w4 w3 w1 w2 wN

allreduce 5, 13 asynchronous gat her + int ersect

...

w4 w3 w1 w2 w5 w6 w7 w5 w6 w7

... ...

t ree-based broadcast

slide-51
SLIDE 51

Improvements to Horovod: Hybrid All-Reduce

  • NCCL uses NVLink for

high throughput, but ring-based algorithms are latency-limited at scale

  • hybrid NCCL/ MPI

strategy uses strengths

  • f both
  • one

inter-node all reduce per virtual NIC

  • MPI work overlaps well

with GPU computation

int ra-node allreduce (NCCL) 4x int er-node allreduce (MPI) 4x int ra-node broadcast (NCCL)

slide-52
SLIDE 52

Gradient Pipelining (Lag)

wN

...

gNk allreduce

ḡk

wN w3

g3k

ḡk

w3 w2

g2k

ḡk

w2 w1

g1k

ḡk

w1

lag-0 (fully synchronous)

qN q1 wN

...

w1

gNk-1 g1k-1 gNk g1k

...

q1 qN

allreduce

ḡk-1 ḡk-1

wN w1

... ...

lag-1

...

slide-53
SLIDE 53

Scaling Tiramisu

  • FP16-model sensitive

to communication

  • FP16-model BW-bound

(only 2.5x faster than FP32)

  • almost ideal scaling for

both precisions on S ummit when gradient lag is used

slide-54
SLIDE 54

Scaling DeepLabv3+

  • FP16-model sensitive

to communication

  • FP16-model BW-bound

(only 2.5x faster than FP32)

  • excellent scaling for

both precisions on S ummit when gradient lag is used

slide-55
SLIDE 55

Concurrency/Precision and Convergence

33

slide-56
SLIDE 56

Concurrency/Precision and Convergence

~2.1x improvement in t ime t o solut ion

33

slide-57
SLIDE 57

Model/Lag and Convergence

slide-58
SLIDE 58

Segmentation Analysis

  • best result for intersection-over-union (IoU) obtained: ∼73%
  • result at large scale (batch-size > 1500): IoU ∼55%

35

slide-59
SLIDE 59

Conclusions

  • deep learning and HPC converge, achieving exascale performance
  • compute capabilities of contemporary HPC systems can be utilized to tackle

challenging scientific deep learning problems

  • HPO and convergence at scale still an open problem —but now we can do it
  • software enhancements benefit deep learning community at large
  • deep learning-powered techniques usher in a new era of precision analytics for

various science areas

slide-60
SLIDE 60

Thank You. Questions?