Distributed Training Khoa Le & Somin Wadhwa Background The - - PowerPoint PPT Presentation
Distributed Training Khoa Le & Somin Wadhwa Background The - - PowerPoint PPT Presentation
Distributed Training Khoa Le & Somin Wadhwa Background The Problem? Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes Traditional Machine Learning Task Distributed Machine Learning Task Obvious solution:
Background
The Problem?
Goto Solution: Distribute Machine Learning Applications to Multiple Processors/Nodes
Traditional Machine Learning Task
Distributed Machine Learning Task
Obvious solution:
- Utilize Tensorflow’s native support for Distributed Training.
Issues:
- New concepts….but not a lot of documentation!
- Simply didn’t scale well.. :(
Data Parallel Approach:
Updating gradients: Parameter Server Approach
- What should be the right ratio of workers/parameter servers?
- Increased complexity leads to increase in the amount of information being
passed on...
Better way to update gradients: Ring-AllReduce
Horovod:
- Created a standalone package based on Baidu’s Ring-Allreduce algorithm,
fully integrated with Tensorflow.
- Replace the actual Ring-Allreduce algorithm with NVIDIA’s native NCCL (i.e
Ring-Allreduce across multiple machines.
- Added support for models that fit inside a single server, potentially on multiple
GPUs, whereas the original version only supported models that fit on a single
- GPU. (How??)
- API improvements.
Benchmarking:
Motivation for CROSSBOW
- To reduce training time, systems
exploit data parallelism across many GPUs to speed up training.
- How: parallel synchronous SGD.
Motivation for CROSSBOW
- To utilise many GPUs effectively, a
large batch size is needed.
- Why: communication overhead to
move data to and from the GPU dominates if batch size is too small
Motivation for CROSSBOW
- However, large batch sizes reduce
statistical efficiency.
- Why: small batches ensure faster
training and are more likely to find solutions with better accuracy
- Typical solutions: dynamically adjusting batch sizes, or other hyper-
parameters such as learning rate, momentum
- Not always work well because of time consuming model-specific methodology
- Need a DL system that can effectively trains with small batch sizes (2-32),
while still scaling to many GPUs => CROSSBOW: single-server, multi-GPU DL system that improve statistical efficiency when increasing the number of GPUs, irrespective of the batch size.
Motivation for CROSSBOW
Key contributions of CROSSBOW
- Synchronous model averaging
- Auto-tuning the number of learners
- Concurrent task engine
SMA with Learners (Important Concepts)
- Learner: an entity that trains a
single model replica independently with a given batch size
- In contrast with parallel S-SGD,
model replicas with learners evolve independently because they are not reset to a single global model after each batch
Parallel Synchronous SGD VS SMA with learners
SMA with Learners (Important Concepts)
- Synchronize local model of learners
by maintaining a Central Average Model
- Prevent learners from diverging by
applying a Correction to the model
- Use momentum to make the central
model converge faster than the learners
SMA with Learners (Important Concepts)
- Input/output
- Initialization
- Iterative Learning process
- Update to central average model
SMA with Learners (Algorithm)
- To achieve high hardware utilisation,
we can execute multiple learners per GPU
- Local Reference Model VS Central
Average Model
SMA with Learners (Expansion)
CROSSBOW System Design
- Must share GPU efficiently
- Decide on # of learners/GPU at runtime
- Global SMA synchronization needs to
be efficient
- Data pre-processors: prepares
training dataset into batches
- Task manager: controls the pools of
model replicas, input batches and learner streams
- Task scheduler: assigns learning
tasks to GPUs based on the available resources
- Auto-tuner
SMA with Learners (Crossbow Implementation)
SMA with Learners (Crossbow Implementation)
- Learner Streams: hold a learning task
and a corresponding local synchronization task
- Synchronization Streams: holds a
global synchronization task
- Overlaps the synchronisation tasks
from one iteration with the learning tasks of the next
SMA with Learners (Crossbow Implementation)
- Concurrency!
- Use All-reduce (hello Horovod!) for
inter-GPU operations: evenly distributes the computation of the update for the average model among the GPUs.
- Schedule new learning tasks on a first-
come-first-serve basis
SMA with Learners (Crossbow Implementation)
Choosing the number of learners
- Too few, a GPU is under-utilised, wasting
resources
- Too many, the execution of otherwise
independent learners is partially sequentialized on a GPU, leading to a slow-down => Tune the number of learners per GPU based on the training throughput at runtime
Tuning Learners (CROSSBOW Implementation)
- Auto-tuner: measures the training
throughput by considering the rate at which learning tasks complete, as recorded by the task manager
- Server with homogeneous GPUs,
measure only the throughput of a single GPU to adapt the number of learners for all GPUs
- Adding a learner to a GPU requires
allocating a new model replica and a corresponding learner stream
- Places temporarily a global execution
barrier between two successive iterations
- Also locks the resources pools,
preventing access by the task scheduler or manager during resizing
Tuning Learners (CROSSBOW Implementation)
Memory MGMT (CROSSBOW Implementation)
- CROSSBOW uses double buffering to create a pipeline between data pre-
processors and the task scheduler
- Offline memory plan to reuse the output buffers of operators using reference
counters, which reduces the memory footprint of a learner by up to 50%
- For multiple learners/GPU, enables the sharing of some of the output buffers
among learners on the same GPU using an online memory plan to avoid
- ver-allocate memory
Scalability Results
Statistical Efficiency VS Hardware Efficiency
Selecting number of learners
SMA vs other
Synchronization efficiency
- Pros:
- It introduces an alternative synchronization strategy (SMA), which allows training
with small batch size to achieve better hardware efficiency
- System design provides efficient concurrent execution of learning and
synchronisation tasks on GPUs
- Cons:
- It lacks automatic differentiation and other more advanced user primitives when
compared to TensorFlow
- It’s only tested for a single multi-GPU server. Distribution of CROSSBOW across
cluster would see more challenges, such as heterogeneous resources.
Imperative Programming:
- Caffe
- TensorFlow (sort of….mixture of both)
Declarative Programming:
- NumPy
- Matlab
MXNet (mix-net) Programming Interface:
- Symbol: Used to generate compute graph (compositions of symbol range
from simple operators to complex ConvLayers).
- Supports auto-diff, in addition to load, save, visualize etc.
MXNet (mix-net) Programming Interface:
- NDArray: Computations work seamlessly with the Symbol.
- Fills the gap between declarative symbolic expression & the host language.
- Complex symbolic expressions are often evaluated efficiently because MXNet
also uses lazy execution of NDarray. (So?)
MXNet (mix-net) Programming Interface:
- K-V Store: Distributed key-value store for data-sync over multiple nodes.
- Weight updating function is registered to the KVStore.
- Each worker repeatedly pulls the newest weight from the store.
- Pushes out the locally computed gradient.
MXNet Implementation:
- Graph Computation: Suggest straightforward implementations like at
inference time, only forward pass is needed, to extract features we can simply skip the last layers, multiple operators can be grouped into one etc.
- Memory Allocation: Simple idea, reuse non-intersecting variables. To reduce
complexity in determining such an allocation, use of a heuristic is proposed.