PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak - - PowerPoint PPT Presentation

pipedream generalized pipeline parallelism for dnn
SMART_READER_LITE
LIVE PREVIEW

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak - - PowerPoint PPT Presentation

PipeDream: Generalized Pipeline Parallelism for DNN Training Deepak Narayanan , Aaron Harlap , Amar Phanishayee , Vivek Seshadri , Nikhil R. Devanur , Gregory R. Ganger , Phillip B. Gibbons , Matei Zaharia


slide-1
SLIDE 1

PipeDream: Generalized Pipeline Parallelism for DNN Training

Deepak Narayanan§, Aaron Harlap†, Amar Phanishayee★, Vivek Seshadri★, Nikhil R. Devanur★, Gregory R. Ganger†, Phillip B. Gibbons†, Matei Zaharia§

★Microsoft Research † Carnegie Mellon University §Stanford University

slide-2
SLIDE 2

Deep Neural Networks have empowered state of the art results across a range of applications…

2

cat dog வண#க% எ' ெபய+ த-ப# Hello, my name is Deepak

Machine Translation Game Playing Speech-to-Text Image Classification

slide-3
SLIDE 3

…but first need to be trained!

3

!" = tiger $" = activations gradients % optimized using standard iterative optimization procedures

% = % − ' ⋅ ∇%

*% loss(!", 0 !") !" = lio lion prediction Weight parameters %

slide-4
SLIDE 4

Background: DNN Training

4

!" = tiger $" = activations gradients W optimized using standard iterative optimization procedures

% = % − ' ⋅ ∇%

*% loss(!", 0 !") !" = lio lion prediction Weight parameters %

Model training time- and compute- intensive!

slide-5
SLIDE 5

Parallelizing DNN Training: Data Parallelism

Worker 1 ∇" = ∇"$ + ∇"& + ⋯ + ∇"( ∇"$

Gradient aggregation using AllReduce

) copies of the same model

5

Despite many performance optimizations, communication overhead high!

8xV100s with NVLink (AWS) PyTorch + NCCL 2.4

Worker * ∇"(

slide-6
SLIDE 6

Worker !

Parallelizing DNN training: Model Parallelism

All inputs

Single version of weights split over workers Activations and gradients sent between workers using peer-to-peer communication

6

Low hardware efficiency

Worker 1

slide-7
SLIDE 7

PipeDream: Pipeline-Parallel Training

7

Pipeline-parallel training up to 5.3x faster than data parallelism without sacrificing on final accuracy of the model We propose pipeline parallelism, a combination of data and model parallelism with pipelining

slide-8
SLIDE 8

Pipelining in DNN Training != Traditional Pipelining

8

  • How should the operators in a DNN model be partitioned into pipeline stages?
  • Each operator has a different computation time
  • Activations and gradients need to be communicated across stages
  • How should forward and backward passes of different inputs be scheduled?
  • Training is bidirectional
  • Forward pass followed by backward pass to compute gradients
  • How should weight and activation versions be managed?
  • Backward pass operators depend on internal state (!, activations)
slide-9
SLIDE 9

Outline

9

  • Background and Motivation
  • Challenges for effective pipeline-parallel training
  • Partitioning and load balancing operators across workers
  • Scheduling of forward and backward passes of different inputs
  • Managing weights and activation versions for effective learning
  • Evaluation
slide-10
SLIDE 10

How do we assign operators to pipeline stages?

10

Stage 1 Stage 2 Stage 3 !" !# !$

  • Desiderata #1: !", !#, !$ as close to each other as possible
  • Compute resources seldom idle → better hardware efficiency
  • Desiderata #2: !"→#

comm and !#→$ comm minimized

  • Less communication → better hardware efficiency

!"→# comm !#→$ comm

slide-11
SLIDE 11

How do we assign operators to pipeline stages?

11

Replication of stages helps load balance computation and reduce communication between workers

Compute time = 2 Compute time = 1 Throughput = 1 Compute time = 2 !int %

&

'

&

For some operators, ∑& '

& < 2!int

Throughput = (1 / 2) × 2 = 1

Better load balancing across stages Data-parallel communication small

slide-12
SLIDE 12

Example PipeDream configuration

12

Stages can have different replication factors

Configuration: 2-3-2-1

slide-13
SLIDE 13

PipeDream Profiler and Optimizer

13

Computational graph with profile Input DNN Deployment constraints such as number of accelerators, memory and interconnect characteristics Optimizer Profiler

Determines a partitioning of operators amongst workers, while also deciding replication factors Generalizes along many axes

  • Hardware topologies
  • Model structures
  • Memory capacities of workers

See paper for details of algorithm!

slide-14
SLIDE 14

Outline

14

  • Background and Motivation
  • Challenges for effective pipeline-parallel training
  • Partitioning and load balancing operators across workers
  • Scheduling of forward and backward passes of different inputs
  • Managing weights and activation versions for effective learning
  • Evaluation
slide-15
SLIDE 15

1F1B Scheduling

Workers alternate between forward and backward passes

  • Workers always utilized
  • Gradients used to update model immediately

15

To support stage replication, need to modify this mechanism slightly – see paper for details!

slide-16
SLIDE 16

Outline

16

  • Background and Motivation
  • Challenges for effective pipeline-parallel training
  • Partitioning and load balancing operators across workers
  • Scheduling of forward and backward passes of different inputs
  • Managing weights and activation versions for effective learning
  • Evaluation
slide-17
SLIDE 17

Naïve pipelining leads to weight version mismatches

Naïve pipelining leads to mismatch in weight versions

Input ! sees updates in backward pass not seen in the forward pass, leading to incorrect gradients

17

"

#

$# %# Forward pass "

#&'

∇$# ∇%# Backward pass "

#&)

slide-18
SLIDE 18

!

"

#" $" Forward pass %& ∇#" ∇$" Backward pass !

"()

1F1B Scheduling + Weight Stashing

Naïve pipelining leads to mismatch in weight versions Store multiple <weight, activation> versions

  • Ensures same weight versions used in both forward and backward pass
  • Worst case memory footprint similar to data parallelism (= + ⋅
  • ( / ( 0 ) ")

18

!

"

!

"() ! "(2

Stashed weights

slide-19
SLIDE 19

Outline

19

  • Background and Motivation
  • Challenges for effective pipeline-parallel training
  • Evaluation
  • Setup
  • Comparison to Data Parallelism on Time-to-Accuracy
  • Communication Overhead of Pipeline Parallelism
  • Comparison to Model Parallelism and Hybrid Parallelism on Throughput
  • PipeDream’s Memory Footprint
slide-20
SLIDE 20

Evaluation Setup

  • Integrated PipeDream with PyTorch in ~3000 lines of Python code
  • Integrated with PyTorch’s communication library
  • NCCL backend for Data Parallelism baselines
  • Gloo backend for PipeDream
  • Experiments run on three different server types
  • Cluster A: 4xV100 GPUs, PCIe intra-server, and 10 Gbps inter-server (Azure)
  • Cluster B: 8xV100 GPUs, NVLink intra-server, and 25 Gbps inter-server (AWS)
  • Cluster C: 1xTitan X, and 40 Gbps inter-server (private)

20

slide-21
SLIDE 21

5.28x faster 2.46x faster

21

PipeDream > Data Parallelism (DP) end-to-end

slide-22
SLIDE 22

22

PipeDream vs. Data Parallelism on Time-to-Accuracy

slide-23
SLIDE 23

23

PipeDream vs. Data Parallelism on Time-to-Accuracy

Experiments on 4 different tasks: image classification, translation, language modeling, video captioning

slide-24
SLIDE 24

24

PipeDream vs. Data Parallelism on Time-to-Accuracy

With the same number of GPUs, PipeDream up to 5.3x faster than Data Parallelism

slide-25
SLIDE 25

25

PipeDream vs. Data Parallelism on Time-to-Accuracy

Optimizer recommends a number of different configurations like 15-1, Straight, and a fully data-parallel setup

slide-26
SLIDE 26

PipeDream reduces communication overhead

For many models, intermediate activations and gradients order of magnitude smaller than communication with Data Parallelism (DP)

26

slide-27
SLIDE 27

Conclusion

https://cs.stanford.edu/~deepakn/

  • Model and data parallelism often suffer from high communication overhead

and low resource utilization for certain models and deployments

  • PipeDream shows pipelining can be used to accelerate DNN training
  • Pipelining, when combined with data and model parallelism in a principled

way, achieves end-to-end speedups of up to 5.3x Code available at https://github.com/msr-fiddle/pipedream

27