[PPT] - Distributed Training Khoa Le & Somin Wadhwa Background The PowerPoint Presentation

SLIDE 1

Distributed Training

Khoa Le & Somin Wadhwa

SLIDE 2

Background

SLIDE 3

The Problem?

SLIDE 4

Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes

SLIDE 5

SLIDE 6

Traditional Machine Learning Task

SLIDE 7

Distributed Machine Learning Task

SLIDE 8

SLIDE 9

SLIDE 10

Obvious solution:

Utilize Tensorflow’s native support for Distributed Training.

Issues:

New concepts….but not a lot of documentation!
Simply didn’t scale well.. :(

SLIDE 11

Data Parallel Approach:

SLIDE 12

Updating gradients: Parameter Server Approach

What should be the right ratio of workers/parameter servers?
Increased complexity leads to increase in the amount of information being

passed on...

SLIDE 13

Better way to update gradients: Ring-AllReduce

SLIDE 14

Horovod:

Created a standalone package based on Baidu’s Ring-Allreduce algorithm,

fully integrated with Tensorflow.

Replace the actual Ring-Allreduce algorithm with NVIDIA’s native NCCL (i.e

Ring-Allreduce across multiple machines.

Added support for models that fit inside a single server, potentially on multiple

GPUs, whereas the original version only supported models that fit on a single

GPU. (How??)
API improvements.

SLIDE 15

Benchmarking:

SLIDE 16

SLIDE 17

Motivation for CROSSBOW

To reduce training time, systems

exploit data parallelism across many GPUs to speed up training.

How: parallel synchronous SGD.

SLIDE 18

Motivation for CROSSBOW

To utilise many GPUs effectively, a

large batch size is needed.

Why: communication overhead to

move data to and from the GPU dominates if batch size is too small

SLIDE 19

Motivation for CROSSBOW

However, large batch sizes reduce

statistical efficiency.

Why: small batches ensure faster

training and are more likely to find solutions with better accuracy

SLIDE 20

Typical solutions: dynamically adjusting batch sizes, or other hyper-

parameters such as learning rate, momentum

Not always work well because of time consuming model-specific methodology
Need a DL system that can effectively trains with small batch sizes (2-32),

while still scaling to many GPUs => CROSSBOW: single-server, multi-GPU DL system that improve statistical efficiency when increasing the number of GPUs, irrespective of the batch size.

Motivation for CROSSBOW

SLIDE 21

Key contributions of CROSSBOW

Synchronous model averaging
Auto-tuning the number of learners
Concurrent task engine

SLIDE 22

SMA with Learners (Important Concepts)

Learner: an entity that trains a

single model replica independently with a given batch size

In contrast with parallel S-SGD,

model replicas with learners evolve independently because they are not reset to a single global model after each batch

SLIDE 23

Parallel Synchronous SGD VS SMA with learners

SMA with Learners (Important Concepts)

SLIDE 24

Synchronize local model of learners

by maintaining a Central Average Model

Prevent learners from diverging by

applying a Correction to the model

Use momentum to make the central

model converge faster than the learners

SMA with Learners (Important Concepts)

SLIDE 25

Input/output
Initialization
Iterative Learning process
Update to central average model

SMA with Learners (Algorithm)

SLIDE 26

To achieve high hardware utilisation,

we can execute multiple learners per GPU

Local Reference Model VS Central

Average Model

SMA with Learners (Expansion)

SLIDE 27

CROSSBOW System Design

Must share GPU efficiently
Decide on # of learners/GPU at runtime
Global SMA synchronization needs to

be efficient

SLIDE 28

Data pre-processors: prepares

training dataset into batches

Task manager: controls the pools of

model replicas, input batches and learner streams

Task scheduler: assigns learning

tasks to GPUs based on the available resources

Auto-tuner

SMA with Learners (Crossbow Implementation)

SLIDE 29

SMA with Learners (Crossbow Implementation)

SLIDE 30

Learner Streams: hold a learning task

and a corresponding local synchronization task

Synchronization Streams: holds a

global synchronization task

Overlaps the synchronisation tasks

from one iteration with the learning tasks of the next

SMA with Learners (Crossbow Implementation)

SLIDE 31

Concurrency!
Use All-reduce (hello Horovod!) for

inter-GPU operations: evenly distributes the computation of the update for the average model among the GPUs.

Schedule new learning tasks on a first-

come-first-serve basis

SMA with Learners (Crossbow Implementation)

SLIDE 32

Choosing the number of learners

Too few, a GPU is under-utilised, wasting

resources

Too many, the execution of otherwise

independent learners is partially sequentialized on a GPU, leading to a slow-down => Tune the number of learners per GPU based on the training throughput at runtime

SLIDE 33

Tuning Learners (CROSSBOW Implementation)

Auto-tuner: measures the training

throughput by considering the rate at which learning tasks complete, as recorded by the task manager

Server with homogeneous GPUs,

measure only the throughput of a single GPU to adapt the number of learners for all GPUs

SLIDE 34

Adding a learner to a GPU requires

allocating a new model replica and a corresponding learner stream

Places temporarily a global execution

barrier between two successive iterations

Also locks the resources pools,

preventing access by the task scheduler or manager during resizing

Tuning Learners (CROSSBOW Implementation)

SLIDE 35

Memory MGMT (CROSSBOW Implementation)

CROSSBOW uses double buffering to create a pipeline between data pre-

processors and the task scheduler

Offline memory plan to reuse the output buffers of operators using reference

counters, which reduces the memory footprint of a learner by up to 50%

For multiple learners/GPU, enables the sharing of some of the output buffers

among learners on the same GPU using an online memory plan to avoid

ver-allocate memory

SLIDE 36

Scalability Results

SLIDE 37

Statistical Efficiency VS Hardware Efficiency

SLIDE 38

Selecting number of learners

SLIDE 39

SMA vs other

SLIDE 40

Synchronization efficiency

SLIDE 41

Pros:
It introduces an alternative synchronization strategy (SMA), which allows training

with small batch size to achieve better hardware efficiency

System design provides efficient concurrent execution of learning and

synchronisation tasks on GPUs

Cons:
It lacks automatic differentiation and other more advanced user primitives when

compared to TensorFlow

It’s only tested for a single multi-GPU server. Distribution of CROSSBOW across

cluster would see more challenges, such as heterogeneous resources.

SLIDE 42

SLIDE 43

Imperative Programming:

Caffe
TensorFlow (sort of….mixture of both)

Declarative Programming:

NumPy
Matlab

SLIDE 44

MXNet (mix-net) Programming Interface:

Symbol: Used to generate compute graph (compositions of symbol range

from simple operators to complex ConvLayers).

Supports auto-diff, in addition to load, save, visualize etc.

SLIDE 45

MXNet (mix-net) Programming Interface:

NDArray: Computations work seamlessly with the Symbol.
Fills the gap between declarative symbolic expression & the host language.
Complex symbolic expressions are often evaluated efficiently because MXNet

also uses lazy execution of NDarray. (So?)

SLIDE 46

MXNet (mix-net) Programming Interface:

K-V Store: Distributed key-value store for data-sync over multiple nodes.
Weight updating function is registered to the KVStore.
Each worker repeatedly pulls the newest weight from the store.
Pushes out the locally computed gradient.

SLIDE 47

MXNet Implementation:

Graph Computation: Suggest straightforward implementations like at

inference time, only forward pass is needed, to extract features we can simply skip the last layers, multiple operators can be grouped into one etc.

Memory Allocation: Simple idea, reuse non-intersecting variables. To reduce

complexity in determining such an allocation, use of a heuristic is proposed.

SLIDE 48

SLIDE 49

Distributed Training

Khoa Le & Somin Wadhwa

Background

The Problem?

Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes

Traditional Machine Learning Task

Distributed Machine Learning Task

Obvious solution:

Issues:

Data Parallel Approach:

Updating gradients: Parameter Server Approach

Better way to update gradients: Ring-AllReduce

Horovod:

Benchmarking:

Motivation for CROSSBOW

Motivation for CROSSBOW

Motivation for CROSSBOW

Motivation for CROSSBOW

Key contributions of CROSSBOW

SMA with Learners (Important Concepts)

SMA with Learners (Important Concepts)

SMA with Learners (Important Concepts)

SMA with Learners (Algorithm)

SMA with Learners (Expansion)

CROSSBOW System Design

SMA with Learners (Crossbow Implementation)

SMA with Learners (Crossbow Implementation)

SMA with Learners (Crossbow Implementation)

SMA with Learners (Crossbow Implementation)

Choosing the number of learners

Tuning Learners (CROSSBOW Implementation)

Tuning Learners (CROSSBOW Implementation)

Memory MGMT (CROSSBOW Implementation)

Scalability Results

Statistical Efficiency VS Hardware Efficiency

Selecting number of learners

SMA vs other

Synchronization efficiency

Imperative Programming:

Declarative Programming:

MXNet (mix-net) Programming Interface:

MXNet (mix-net) Programming Interface:

MXNet (mix-net) Programming Interface:

MXNet Implementation:

Discussions