[PPT] - HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU PowerPoint Presentation

SLIDE 1

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi †, Sam H. Noh, and Young-ri Choi

†

SLIDE 2

▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe ▪ Evaluation ▪ Conclusion

2

▪ DNN (Deep Neural Network) models continue to grow

3

Motivation

Need more powerful GPUs for training!

SLIDE 4

▪ Short release cycle of new GPU architectures

4

Motivation

Use of heterogeneous GPUs is inevitable!
What to do with whimpy GPUs?

Whimpy GPUs

SLIDE 5

5

DNN Training Forward Pass 𝒋 Backward Pass 𝒋

𝒙𝒋+𝟐 = 𝒙𝒋 − 𝜽 ∙ 𝒗𝒋 Minibatch 𝒋 (Training Data) Loss Cat?

Weight Parameter 𝒙

SLIDE 6

▪ Model parallelism (MP) ▪ Data parallelism (DP)

6

Parallelizing DNN Training

Low GPU utilization

Weights synchronized through PS or AllReduce

GPU memory limitation

Worker 1

…

Parameter Server (PS)

1 1 Worker 𝒐 𝑜 𝑜

Forward pass Backward pass

SLIDE 7

▪ Attempts to improve MP utilization

Pipelined model parallelism (PMP)

7

Parallelizing DNN Training

Designed for homogeneous GPUs
Designed for a single PMP worker

Forward pass Backward pass

PMP Worker

PipeDream [SOSP’19]
GPipe [NIPS’19]

SLIDE 8

8

HetPipe in a Nutshell Virtual Worker (VW) 1

Parameter Server

VW 𝒐

…

Integrates PMP + DP

WSP (Wave Synchronous Parallel) Support Heterogeneous GPUs

VW: A group of multiple GPUs

GPU GPU GPU GPU GPU GPU GPU GPU

R R G G R G

GPU GPU GPU GPU GPU GPU GPU GPU

V V Q Q V Q

PMP DP PMP

SLIDE 9

9

Challenges in integration PMP+DP in Heterogeneous GPUs

What weight version should be used

by each VW to synchronize with other VWs?

Parameter Server

…

How do we reduce virtual worker stragglers

when we consider DP?

Many more in the paper

…

SLIDE 10

10

HetPipe Contributions

Integrates PMP + DP

Novel parameter synchronization model WSP (Wave Synchronous Parallel)

Enable Large DNN Training on Heterogeneous GPUs

Aggregate heterogeneous resources Reduce the straggler problem

Proof of WSP Convergence

SLIDE 11

11

HetPipe Workflow Model Partitioner DNN Model Resource Allocator Cluster Configuration

P2 P3 P4 P1 Assign k GPUs to each virtual worker

VW 1

…

Divide model into k partitions

P1 P2 P3 P4

VW 𝒐

P1’ P2’ P3’ P4’

PS

… …

Time V V Q Q R R G G

P2’ P3’ P4’ P1’ VW 1 VW 𝒐

SLIDE 12

12

HetPipe Workflow Model Partitioner DNN Model Resource Allocator Cluster Configuration

Assign k GPUs to each virtual worker

VW 1

…

Divide model into k partitions

VW 𝒐

PS

… …

P2 P3 P4 P1

…

P1 P2 P3 P4 P1’ P2’ P3’ P4’

P2’ P3’ P4’ P1’ VW 1 VW 𝒐

Global Local Local Staleness

V V Q Q R R G G

SLIDE 13

▪ Motivation & Background ▪ HetPipe in a Nutshell

▪ Our System: HetPipe

Pipelined Model Parallelism Within a VW
Data Parallelism with Multiple VWs

▪ Evaluation ▪ Conclusion

13

Outline

SLIDE 14

▪ Execution of a virtual worker

14

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

GPU1 GPU2 GPU3 GPU4

Time

Forward pass Backward pass

𝒙𝒎𝒑𝒅𝒃𝒎

𝑿𝒎𝒑𝒅𝒃𝒎 is a consistent version of weights within a VW 𝑶𝒏 minibatches processed concurrently in pipeline manner

…

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

SLIDE 15

▪ Weight management procedure

15

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

GPU1 GPU2 GPU3 GPU4

Time

…

Forward pass Backward pass

𝒙𝒎𝒑𝒅𝒃𝒎

Update 𝒗𝟐 𝒙𝟔 ← 𝒙𝒎𝒑𝒅𝒃𝒎 Initial weight version (𝒙𝟏) 𝑥𝑚𝑝𝑑𝑏𝑚=𝑥0=𝑥1=𝑥2=𝑥3=𝑥4

𝑥1 𝑥2 𝑥3 𝑥4

𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟐

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

SLIDE 16

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎): maximum missing updates

16

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

GPU1 GPU2 GPU3 GPU4

Time

𝒙𝒎𝒑𝒅𝒃𝒎

…

Forward pass Backward pass

𝒙𝟔 missing updates of minibatches 2 to 4

𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟐 𝒙𝟔 ← 𝒙𝒎𝒑𝒅𝒃𝒎

𝑻𝒎𝒑𝒅𝒃𝒎 = 3

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

SLIDE 17

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎): maximum missing updates

17

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 4 4 5 1 2 3 4 1 5 2 5 5 1 6 2 3

GPU1 GPU2 GPU3 GPU4

Time

Forward pass Backward pass

𝒙𝒎𝒑𝒅𝒃𝒎

Update 𝒗𝟑 𝒙𝟕 ← 𝒙𝒎𝒑𝒅𝒃𝒎 𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟑

𝒙𝟕 missing updates of minibatches 3 to 5

…

5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

𝒙𝟏 + 𝒗𝟐

𝑻𝒎𝒑𝒅𝒃𝒎 = 3

SLIDE 18

▪ Motivation & Background ▪ HetPipe in a Nutshell

▪ Our System: HetPipe

Pipelined Model Parallelism Within a VW
Data Parallelism with Multiple VWs

▪ Evaluation ▪ Conclusion

18

Outline

SLIDE 19

Data Parallelism with Multiple VWs

Minibatch 1 Minibatch 2 Minibatch 3 Minibatch 4

VW 1

Clock

5 6 7 8

…

Wave 0 Wave 1 1 2 …

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Progress of minibatch execution

VW 𝒐

…

Push & Pull

𝑶𝒏 Wave: Sequence of concurrently executing 𝑂𝑛 minibatches

19

SLIDE 20

▪ Push occurs every clock

20

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1

…

VW 𝒐

…

8 Blocked minibatch 8 𝒙𝒉𝒎𝒑𝒄𝒃𝒎 ← 𝒙𝒉𝒎𝒑𝒄𝒃𝒎 + ෥ 𝒗 Push aggregated updates of wave0 (෤ 𝑣) ෤ 𝑣 = 𝑣1+𝑣2+𝑣3+𝑣4

5 6 7

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull

Wave 0

SLIDE 21

▪ Pull occurs intermittently - Depending on user defined clock distance D

21

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

If D = 0 pull occurs every clock

VW1 waits before pull until VW2 pushes Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 5 6 7

SLIDE 22

1 2 3 4 5 6 7

▪ Pull occurs intermittently - Depending on user defined clock distance D

22

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

8 If D = 0 Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 7 5 6 7 5 6

VW2 Push aggregated updates (෤ 𝑣) VW1 waits before pull until VW2 pushes 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 ← 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 + ෤ 𝑣

SLIDE 23

▪ Pull occurs intermittently - Depending on user defined clock distance D

23

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

Pull occurs after all VWs have been pushed

8 If D = 0 𝑥𝑚𝑝𝑑𝑏𝑚 ← 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 7 5 6 7 5 6

SLIDE 24

▪ Pull occurs intermittently - Depending on user defined clock distance D

24

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

Minibatch 8 starts with 𝑥8

8 If D = 0 𝑥8 = 𝑥0+(𝑣1+ 𝑣2+ 𝑣3+ 𝑣4)vw1,vw2 Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 7 5 6 7 5 6

SLIDE 25

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎) and global staleness (𝑻𝒉𝒎𝒑𝒄𝒃𝒎) with WSP

25

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 1 2 3 4

VW 2

2 𝑥11 (𝑣8+ 𝑣9+ 𝑣10)vw1 𝑻𝒉𝒎𝒑𝒄𝒃𝒎 𝑻𝒎𝒑𝒅𝒃𝒎 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 =𝑥0+ (𝑣1+ 𝑣2+ 𝑣3+ 𝑣4)vw1,vw2 + (𝑣1+ 𝑣2+ 𝑣3+ 𝑣4)vw1,vw2 =𝑥0 + (𝑣5+ 𝑣6+ 𝑣7)vw1

8 8 7 5 6 7 5 6 5 6 7 8 5 6 7 8 9 10 11 12 11

(𝑣5+ 𝑣6+ 𝑣7)vw2

SLIDE 26

5 ▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎) and global staleness (𝑻𝒉𝒎𝒑𝒄𝒃𝒎) with WSP

26

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 1 2 3 4

VW 2

6 7 8 5 6 7 8 2 9 10 11 12

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎 Minibatch 12 has to wait

SLIDE 27

▪ Example of clock distance threshold D

27

Data Parallelism with Multiple VWs

If D = 1 Can start minibatch 8 without pull

1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

5 6 7

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull

SLIDE 28

▪ Example of clock distance threshold D

28

Data Parallelism with Multiple VWs

Minibatch 12 has to wait

1 2 3 4

VW 1

Clock

1 1 2 3 4 5 6 7 8 9 10 11 12 2

𝑥11 =𝑥0+(𝑣1+ 𝑣2+ 𝑣3+ 𝑣4+ 𝑣5+ 𝑣6+ 𝑣7)vw1

(𝑣8+ 𝑣9+𝑣10)vw1 𝑻𝒉𝒎𝒑𝒄𝒃𝒎 𝑻𝒎𝒑𝒅𝒃𝒎 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 =𝑥0 If D = 1

11 11

VW 2

(𝑣1+ 𝑣2+𝑣3+𝑣4+𝑣5+𝑣6+𝑣7)vw2

5 6 7

SLIDE 29

▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe

▪ Evaluation

Setup
Resource Allocation for Virtual Workers
Results

▪ Conclusion

29

Outline

SLIDE 30

InfiniBand (56 Gbps)

▪ Cluster setup - 4 heterogeneous GPU nodes ▪ Two DNN models

30

Evaluation Setup

ResNet-152 VGG-19 Dataset, minibatch size ImageNet, 32 Model parameter size 230 MB 548 MB Characteristic Large activation output Large parameter size

Node 𝟐 Node 𝟑 Node 𝟒 Node 𝟓

V0 V1 V2 V3 R0 R1 R2 R3 G0 G1 G2 G3 Q0 Q1 Q2 Q3 TITAN V TITAN RTX GeForce RTX 2060 Quadro P4000 V R G Q

Computation power

> > >

V R G Q

Memory size

> > >

SLIDE 31

▪ NP (Node Partition)

31

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟐 V0 V1 V2 V3

VW 𝟐

Node 𝟑 Q0 Q1 Q2 Q3

VW 𝟑

Node 𝟒 R0 R1 R2 R3

VW 𝟒

Node 𝟓 G0 G1 G2 G3

VW 4

Minimum communication overhead within VW
Performance of each virtual worker varies
Straggler may degrade performance with DP

SLIDE 32

▪ ED (Equal Distribution)

32

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟐 V0 V1 V2 V3 Node 𝟑 Q0 Q1 Q2 Q3 Node 𝟒 R0 R1 R2 R3 Node 𝟓 G0 G1 G2 G3

VW 𝟐 VW 𝟑 VW 𝟒 VW 𝟓

Performance will be the same across the VWs
Mitigates the straggler problem
High communication overhead within each VW

SLIDE 33

▪ HD (Hybrid Distribution)

33

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟐 V0 V1 V2 V3 Node 𝟑 Q0 Q1 Q2 Q3 Node 𝟒 R0 R1 R2 R3 Node 𝟓 G0 G1 G2 G3

VW 𝟐 VW 𝟑 VW 𝟒 VW 4

Mitigates the straggler problem
Reduces communication overhead within each VW

V R G Q

Computation power

> > >

V R G Q

Memory size

> > >

SLIDE 34

▪ Round-robin policy (default)

Can be used in all three policies: NP, ED, and HD

Parameter Placement

Node 𝟐 V3 Node 𝟑 Q3 Node 𝟒 R3 Node 𝟓 G3

VW 𝟐 VW 𝟓

… … … … …

Example: ED Parameters of each layer:

34

SLIDE 35

▪ Local placement policy

ED-local

Parameter Placement

Node 𝟐 V3 Node 𝟑 Q3 Node 𝟒 R3 Node 𝟓 G3

VW 𝟐 VW 𝟓

… … … … …

Parameters of each layer:

Significantly reduces

communication overhead

ED

Parameter

communication

ccurs

35

SLIDE 36

▪ Baseline Horovod

State-of-the-art DP using AllReduce

36

Compare Throughput with Horovod ResNet-152 VGG-19

1.4 X 1.8 X

ED: reduces the straggler

problem

ED-local: significantly

reduces communication

verhead
For ResNet-152,

the whole model is too large to be loaded into a single G type GPU (batch size = 32)

SLIDE 37

37

Performance Improvement of Adding Whimpy GPUs Adding whimpy GPUs

V R R R R Q Q Q Q G G G G

+ + +

With additional GPUs, HetPipe

achieves up to 2.3X speed up

2.3 X

Additional whimpy systems allow

for faster training

V V V V

SLIDE 38

▪ ResNet-152

38

Convergence Results

HetPipe reduces straggler problem in

heterogeneous environment

Target accuracy: 74%

Up to 39% faster

Adding four more whimpy G GPUs,

performance improves even more 7% faster

12GPUs 16GPUs

SLIDE 39

▪ VGG-19

39

Convergence Results

Target accuracy: 67%

HetPipe (D=0) is 29% faster than Horovod

Up to 49% faster

D=4 D=32 D=0 4.7% slower

Higher global staleness (i.e., 32) can

degrade convergence performance

29% faster

SLIDE 40

▪ Provide convergence proof of WSP ▪ Partitioning algorithm ▪ Performance of a single virtual worker ▪ Comparison to PipeDream

40

Not Presented But Discussed in Paper

SLIDE 41

▪ HetPipe makes it possible to efficiently train large DNN models with heterogeneous GPUs ▪ Integrate pipelined model parallelism with data parallelism ▪ Propose a novel parameter synchronization model: WSP ▪ DNN models converge up to 49% faster with HetPipe

41

Conclusion

SLIDE 42

42

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe ▪ Evaluation ▪ Conclusion

Contents

▪ DNN (Deep Neural Network) models continue to grow

Motivation

▪ Short release cycle of new GPU architectures

Motivation

Whimpy GPUs

DNN Training Forward Pass 𝒋 Backward Pass 𝒋

𝒙𝒋+𝟐 = 𝒙𝒋 − 𝜽 ∙ 𝒗𝒋 Minibatch 𝒋 (Training Data) Loss Cat?

Weight Parameter 𝒙

▪ Model parallelism (MP) ▪ Data parallelism (DP)

Parallelizing DNN Training

…

▪ Attempts to improve MP utilization

Parallelizing DNN Training

HetPipe in a Nutshell Virtual Worker (VW) 1

VW 𝒐

…

Integrates PMP + DP

WSP (Wave Synchronous Parallel) Support Heterogeneous GPUs

PMP DP PMP

Challenges in integration PMP+DP in Heterogeneous GPUs

by each VW to synchronize with other VWs?

…

when we consider DP?

Many more in the paper

…

HetPipe Contributions

Integrates PMP + DP

Novel parameter synchronization model WSP (Wave Synchronous Parallel)

Enable Large DNN Training on Heterogeneous GPUs

Aggregate heterogeneous resources Reduce the straggler problem

Proof of WSP Convergence

HetPipe Workflow Model Partitioner DNN Model Resource Allocator Cluster Configuration

…

… …

HetPipe Workflow Model Partitioner DNN Model Resource Allocator Cluster Configuration

…

… …

…

Global Local Local Staleness

▪ Motivation & Background ▪ HetPipe in a Nutshell

▪ Our System: HetPipe

▪ Evaluation ▪ Conclusion

Outline

▪ Execution of a virtual worker

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

Time

𝒙𝒎𝒑𝒅𝒃𝒎

𝑿𝒎𝒑𝒅𝒃𝒎 is a consistent version of weights within a VW 𝑶𝒏 minibatches processed concurrently in pipeline manner

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

▪ Weight management procedure

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

Time

𝒙𝒎𝒑𝒅𝒃𝒎

Update 𝒗𝟐 𝒙𝟔 ← 𝒙𝒎𝒑𝒅𝒃𝒎 Initial weight version (𝒙𝟏) 𝑥𝑚𝑝𝑑𝑏𝑚=𝑥0=𝑥1=𝑥2=𝑥3=𝑥4

𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟐

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎): maximum missing updates

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

Time

𝒙𝒎𝒑𝒅𝒃𝒎

𝒙𝟔 missing updates of minibatches 2 to 4

𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟐 𝒙𝟔 ← 𝒙𝒎𝒑𝒅𝒃𝒎

𝑻𝒎𝒑𝒅𝒃𝒎 = 3

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎): maximum missing updates

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 4 4 5 1 2 3 4 1 5 2 5 5 1 6 2 3

Time

𝒙𝒎𝒑𝒅𝒃𝒎

Update 𝒗𝟑 𝒙𝟕 ← 𝒙𝒎𝒑𝒅𝒃𝒎 𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟑

𝒙𝟕 missing updates of minibatches 3 to 5

5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

𝒙𝟏 + 𝒗𝟐

𝑻𝒎𝒑𝒅𝒃𝒎 = 3