Distributed Training Khoa Le & Somin Wadhwa Background The - - PowerPoint PPT Presentation

distributed training
SMART_READER_LITE
LIVE PREVIEW

Distributed Training Khoa Le & Somin Wadhwa Background The - - PowerPoint PPT Presentation

Distributed Training Khoa Le & Somin Wadhwa Background The Problem? Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes Traditional Machine Learning Task Distributed Machine Learning Task Obvious solution:


slide-1
SLIDE 1

Distributed Training

Khoa Le & Somin Wadhwa

slide-2
SLIDE 2

Background

slide-3
SLIDE 3

The Problem?

slide-4
SLIDE 4

Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes

slide-5
SLIDE 5
slide-6
SLIDE 6

Traditional Machine Learning Task

slide-7
SLIDE 7

Distributed Machine Learning Task

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

Obvious solution:

  • Utilize Tensorflow’s native support for Distributed Training.

Issues:

  • New concepts….but not a lot of documentation!
  • Simply didn’t scale well.. :(
slide-11
SLIDE 11

Data Parallel Approach:

slide-12
SLIDE 12

Updating gradients: Parameter Server Approach

  • What should be the right ratio of workers/parameter servers?
  • Increased complexity leads to increase in the amount of information being

passed on...

slide-13
SLIDE 13

Better way to update gradients: Ring-AllReduce

slide-14
SLIDE 14

Horovod:

  • Created a standalone package based on Baidu’s Ring-Allreduce algorithm,

fully integrated with Tensorflow.

  • Replace the actual Ring-Allreduce algorithm with NVIDIA’s native NCCL (i.e

Ring-Allreduce across multiple machines.

  • Added support for models that fit inside a single server, potentially on multiple

GPUs, whereas the original version only supported models that fit on a single

  • GPU. (How??)
  • API improvements.
slide-15
SLIDE 15

Benchmarking:

slide-16
SLIDE 16
slide-17
SLIDE 17

Motivation for CROSSBOW

  • To reduce training time, systems

exploit data parallelism across many GPUs to speed up training.

  • How: parallel synchronous SGD.
slide-18
SLIDE 18

Motivation for CROSSBOW

  • To utilise many GPUs effectively, a

large batch size is needed.

  • Why: communication overhead to

move data to and from the GPU dominates if batch size is too small

slide-19
SLIDE 19

Motivation for CROSSBOW

  • However, large batch sizes reduce

statistical efficiency.

  • Why: small batches ensure faster

training and are more likely to find solutions with better accuracy

slide-20
SLIDE 20
  • Typical solutions: dynamically adjusting batch sizes, or other hyper-

parameters such as learning rate, momentum

  • Not always work well because of time consuming model-specific methodology
  • Need a DL system that can effectively trains with small batch sizes (2-32),

while still scaling to many GPUs => CROSSBOW: single-server, multi-GPU DL system that improve statistical efficiency when increasing the number of GPUs, irrespective of the batch size.

Motivation for CROSSBOW

slide-21
SLIDE 21

Key contributions of CROSSBOW

  • Synchronous model averaging
  • Auto-tuning the number of learners
  • Concurrent task engine
slide-22
SLIDE 22

SMA with Learners (Important Concepts)

  • Learner: an entity that trains a

single model replica independently with a given batch size

  • In contrast with parallel S-SGD,

model replicas with learners evolve independently because they are not reset to a single global model after each batch

slide-23
SLIDE 23

Parallel Synchronous SGD VS SMA with learners

SMA with Learners (Important Concepts)

slide-24
SLIDE 24
  • Synchronize local model of learners

by maintaining a Central Average Model

  • Prevent learners from diverging by

applying a Correction to the model

  • Use momentum to make the central

model converge faster than the learners

SMA with Learners (Important Concepts)

slide-25
SLIDE 25
  • Input/output
  • Initialization
  • Iterative Learning process
  • Update to central average model

SMA with Learners (Algorithm)

slide-26
SLIDE 26
  • To achieve high hardware utilisation,

we can execute multiple learners per GPU

  • Local Reference Model VS Central

Average Model

SMA with Learners (Expansion)

slide-27
SLIDE 27

CROSSBOW System Design

  • Must share GPU efficiently
  • Decide on # of learners/GPU at runtime
  • Global SMA synchronization needs to

be efficient

slide-28
SLIDE 28
  • Data pre-processors: prepares

training dataset into batches

  • Task manager: controls the pools of

model replicas, input batches and learner streams

  • Task scheduler: assigns learning

tasks to GPUs based on the available resources

  • Auto-tuner

SMA with Learners (Crossbow Implementation)

slide-29
SLIDE 29

SMA with Learners (Crossbow Implementation)

slide-30
SLIDE 30
  • Learner Streams: hold a learning task

and a corresponding local synchronization task

  • Synchronization Streams: holds a

global synchronization task

  • Overlaps the synchronisation tasks

from one iteration with the learning tasks of the next

SMA with Learners (Crossbow Implementation)

slide-31
SLIDE 31
  • Concurrency!
  • Use All-reduce (hello Horovod!) for

inter-GPU operations: evenly distributes the computation of the update for the average model among the GPUs.

  • Schedule new learning tasks on a first-

come-first-serve basis

SMA with Learners (Crossbow Implementation)

slide-32
SLIDE 32

Choosing the number of learners

  • Too few, a GPU is under-utilised, wasting

resources

  • Too many, the execution of otherwise

independent learners is partially sequentialized on a GPU, leading to a slow-down => Tune the number of learners per GPU based on the training throughput at runtime

slide-33
SLIDE 33

Tuning Learners (CROSSBOW Implementation)

  • Auto-tuner: measures the training

throughput by considering the rate at which learning tasks complete, as recorded by the task manager

  • Server with homogeneous GPUs,

measure only the throughput of a single GPU to adapt the number of learners for all GPUs

slide-34
SLIDE 34
  • Adding a learner to a GPU requires

allocating a new model replica and a corresponding learner stream

  • Places temporarily a global execution

barrier between two successive iterations

  • Also locks the resources pools,

preventing access by the task scheduler or manager during resizing

Tuning Learners (CROSSBOW Implementation)

slide-35
SLIDE 35

Memory MGMT (CROSSBOW Implementation)

  • CROSSBOW uses double buffering to create a pipeline between data pre-

processors and the task scheduler

  • Offline memory plan to reuse the output buffers of operators using reference

counters, which reduces the memory footprint of a learner by up to 50%

  • For multiple learners/GPU, enables the sharing of some of the output buffers

among learners on the same GPU using an online memory plan to avoid

  • ver-allocate memory
slide-36
SLIDE 36

Scalability Results

slide-37
SLIDE 37

Statistical Efficiency VS Hardware Efficiency

slide-38
SLIDE 38

Selecting number of learners

slide-39
SLIDE 39

SMA vs other

slide-40
SLIDE 40

Synchronization efficiency

slide-41
SLIDE 41
  • Pros:
  • It introduces an alternative synchronization strategy (SMA), which allows training

with small batch size to achieve better hardware efficiency

  • System design provides efficient concurrent execution of learning and

synchronisation tasks on GPUs

  • Cons:
  • It lacks automatic differentiation and other more advanced user primitives when

compared to TensorFlow

  • It’s only tested for a single multi-GPU server. Distribution of CROSSBOW across

cluster would see more challenges, such as heterogeneous resources.

slide-42
SLIDE 42
slide-43
SLIDE 43

Imperative Programming:

  • Caffe
  • TensorFlow (sort of….mixture of both)

Declarative Programming:

  • NumPy
  • Matlab
slide-44
SLIDE 44

MXNet (mix-net) Programming Interface:

  • Symbol: Used to generate compute graph (compositions of symbol range

from simple operators to complex ConvLayers).

  • Supports auto-diff, in addition to load, save, visualize etc.
slide-45
SLIDE 45

MXNet (mix-net) Programming Interface:

  • NDArray: Computations work seamlessly with the Symbol.
  • Fills the gap between declarative symbolic expression & the host language.
  • Complex symbolic expressions are often evaluated efficiently because MXNet

also uses lazy execution of NDarray. (So?)

slide-46
SLIDE 46

MXNet (mix-net) Programming Interface:

  • K-V Store: Distributed key-value store for data-sync over multiple nodes.
  • Weight updating function is registered to the KVStore.
  • Each worker repeatedly pulls the newest weight from the store.
  • Pushes out the locally computed gradient.
slide-47
SLIDE 47

MXNet Implementation:

  • Graph Computation: Suggest straightforward implementations like at

inference time, only forward pass is needed, to extract features we can simply skip the last layers, multiple operators can be grouped into one etc.

  • Memory Allocation: Simple idea, reuse non-intersecting variables. To reduce

complexity in determining such an allocation, use of a heuristic is proposed.

slide-48
SLIDE 48
slide-49
SLIDE 49

Discussions