HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU - - PowerPoint PPT Presentation

hetpipe enabling large dnn training on whimpy
SMART_READER_LITE
LIVE PREVIEW

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU - - PowerPoint PPT Presentation

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi , Sam


slide-1
SLIDE 1

HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism

Jay H. Park, Gyeongchan Yun, Chang M. Yi, Nguyen T. Nguyen, Seungmin Lee, Jaesik Choi †, Sam H. Noh, and Young-ri Choi

slide-2
SLIDE 2

▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe ▪ Evaluation ▪ Conclusion

2

Contents

slide-3
SLIDE 3

▪ DNN (Deep Neural Network) models continue to grow

3

Motivation

  • Need more powerful GPUs for training!
slide-4
SLIDE 4

▪ Short release cycle of new GPU architectures

4

Motivation

  • Use of heterogeneous GPUs is inevitable!
  • What to do with whimpy GPUs?

Whimpy GPUs

slide-5
SLIDE 5

5

DNN Training Forward Pass 𝒋 Backward Pass 𝒋

𝒙𝒋+𝟐 = 𝒙𝒋 − 𝜽 ∙ 𝒗𝒋 Minibatch 𝒋 (Training Data) Loss Cat?

Weight Parameter 𝒙

slide-6
SLIDE 6

▪ Model parallelism (MP) ▪ Data parallelism (DP)

6

Parallelizing DNN Training

  • Low GPU utilization

Weights synchronized through PS or AllReduce

  • GPU memory limitation

Worker 1

Parameter Server (PS)

1 1 Worker 𝒐 𝑜 𝑜

Forward pass Backward pass

slide-7
SLIDE 7

▪ Attempts to improve MP utilization

  • Pipelined model parallelism (PMP)

7

Parallelizing DNN Training

  • Designed for homogeneous GPUs
  • Designed for a single PMP worker

Forward pass Backward pass

PMP Worker

  • PipeDream [SOSP’19]
  • GPipe [NIPS’19]
slide-8
SLIDE 8

8

HetPipe in a Nutshell Virtual Worker (VW) 1

Parameter Server

VW 𝒐

Integrates PMP + DP

WSP (Wave Synchronous Parallel) Support Heterogeneous GPUs

VW: A group of multiple GPUs

GPU GPU GPU GPU GPU GPU GPU GPU

R R G G R G

GPU GPU GPU GPU GPU GPU GPU GPU

V V Q Q V Q

PMP DP PMP

slide-9
SLIDE 9

9

Challenges in integration PMP+DP in Heterogeneous GPUs

  • What weight version should be used

by each VW to synchronize with other VWs?

Parameter Server

  • How do we reduce virtual worker stragglers

when we consider DP?

Many more in the paper

slide-10
SLIDE 10

10

HetPipe Contributions

Integrates PMP + DP

Novel parameter synchronization model WSP (Wave Synchronous Parallel)

Enable Large DNN Training on Heterogeneous GPUs

Aggregate heterogeneous resources Reduce the straggler problem

Proof of WSP Convergence

slide-11
SLIDE 11

11

HetPipe Workflow Model Partitioner DNN Model Resource Allocator Cluster Configuration

P2 P3 P4 P1 Assign k GPUs to each virtual worker

VW 1

Divide model into k partitions

P1 P2 P3 P4

VW 𝒐

P1’ P2’ P3’ P4’

PS

… …

Time V V Q Q R R G G

P2’ P3’ P4’ P1’ VW 1 VW 𝒐

slide-12
SLIDE 12

12

HetPipe Workflow Model Partitioner DNN Model Resource Allocator Cluster Configuration

Assign k GPUs to each virtual worker

VW 1

Divide model into k partitions

VW 𝒐

PS

… …

P2 P3 P4 P1

P1 P2 P3 P4 P1’ P2’ P3’ P4’

P2’ P3’ P4’ P1’ VW 1 VW 𝒐

Global Local Local Staleness

V V Q Q R R G G

slide-13
SLIDE 13

▪ Motivation & Background ▪ HetPipe in a Nutshell

▪ Our System: HetPipe

  • Pipelined Model Parallelism Within a VW
  • Data Parallelism with Multiple VWs

▪ Evaluation ▪ Conclusion

13

Outline

slide-14
SLIDE 14

▪ Execution of a virtual worker

14

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

GPU1 GPU2 GPU3 GPU4

Time

Forward pass Backward pass

𝒙𝒎𝒑𝒅𝒃𝒎

𝑿𝒎𝒑𝒅𝒃𝒎 is a consistent version of weights within a VW 𝑶𝒏 minibatches processed concurrently in pipeline manner

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

slide-15
SLIDE 15

▪ Weight management procedure

15

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

GPU1 GPU2 GPU3 GPU4

Time

Forward pass Backward pass

𝒙𝒎𝒑𝒅𝒃𝒎

Update 𝒗𝟐 𝒙𝟔 ← 𝒙𝒎𝒑𝒅𝒃𝒎 Initial weight version (𝒙𝟏) 𝑥𝑚𝑝𝑑𝑏𝑚=𝑥0=𝑥1=𝑥2=𝑥3=𝑥4

𝑥1 𝑥2 𝑥3 𝑥4

𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟐

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

slide-16
SLIDE 16

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎): maximum missing updates

16

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 1 2 1 1

GPU1 GPU2 GPU3 GPU4

Time

𝒙𝒎𝒑𝒅𝒃𝒎

Forward pass Backward pass

𝒙𝟔 missing updates of minibatches 2 to 4

𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟐 𝒙𝟔 ← 𝒙𝒎𝒑𝒅𝒃𝒎

𝑻𝒎𝒑𝒅𝒃𝒎 = 3

4 4 5 3 4 5 2 5 5 6 2 3 5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

slide-17
SLIDE 17

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎): maximum missing updates

17

Pipelined Model Parallelism Within a VW

1 1 2 3 4 1 2 3 4 1 2 3 4 1 2 2 3 3 4 4 5 1 2 3 4 1 5 2 5 5 1 6 2 3

GPU1 GPU2 GPU3 GPU4

Time

Forward pass Backward pass

𝒙𝒎𝒑𝒅𝒃𝒎

Update 𝒗𝟑 𝒙𝟕 ← 𝒙𝒎𝒑𝒅𝒃𝒎 𝒙𝒎𝒑𝒅𝒃𝒎 ← 𝒙𝒎𝒑𝒅𝒃𝒎 + 𝒗𝟑

𝒙𝟕 missing updates of minibatches 3 to 5

5 6 6 7 7 8 8 4 5 6 7 6 4 7 7 3 5 8 8 4 6 9 9 5 7 6 6 7 8 9

𝒙𝟏 + 𝒗𝟐

𝑻𝒎𝒑𝒅𝒃𝒎 = 3

slide-18
SLIDE 18

▪ Motivation & Background ▪ HetPipe in a Nutshell

▪ Our System: HetPipe

  • Pipelined Model Parallelism Within a VW
  • Data Parallelism with Multiple VWs

▪ Evaluation ▪ Conclusion

18

Outline

slide-19
SLIDE 19

Data Parallelism with Multiple VWs

Minibatch 1 Minibatch 2 Minibatch 3 Minibatch 4

VW 1

Clock

5 6 7 8

Wave 0 Wave 1 1 2 …

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Progress of minibatch execution

VW 𝒐

Push & Pull

𝑶𝒏 Wave: Sequence of concurrently executing 𝑂𝑛 minibatches

19

slide-20
SLIDE 20

▪ Push occurs every clock

20

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1

VW 𝒐

8

Blocked minibatch 8 𝒙𝒉𝒎𝒑𝒄𝒃𝒎 ← 𝒙𝒉𝒎𝒑𝒄𝒃𝒎 + ෥ 𝒗 Push aggregated updates of wave0 (෤ 𝑣) ෤ 𝑣 = 𝑣1+𝑣2+𝑣3+𝑣4

5 6 7

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull

Wave 0

slide-21
SLIDE 21

▪ Pull occurs intermittently - Depending on user defined clock distance D

21

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

  • If D = 0 pull occurs every clock

VW1 waits before pull until VW2 pushes Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 5 6 7

slide-22
SLIDE 22

1 2 3 4 5 6 7

▪ Pull occurs intermittently - Depending on user defined clock distance D

22

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

8

If D = 0 Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 7 5 6 7 5 6

VW2 Push aggregated updates (෤ 𝑣) VW1 waits before pull until VW2 pushes 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 ← 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 + ෤ 𝑣

slide-23
SLIDE 23

▪ Pull occurs intermittently - Depending on user defined clock distance D

23

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

Pull occurs after all VWs have been pushed

8

If D = 0 𝑥𝑚𝑝𝑑𝑏𝑚 ← 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 7 5 6 7 5 6

slide-24
SLIDE 24

▪ Pull occurs intermittently - Depending on user defined clock distance D

24

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

Minibatch 8 starts with 𝑥8

8

If D = 0 𝑥8 = 𝑥0+(𝑣1+ 𝑣2+ 𝑣3+ 𝑣4)vw1,vw2 Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull 7 5 6 7 5 6

slide-25
SLIDE 25

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎) and global staleness (𝑻𝒉𝒎𝒑𝒄𝒃𝒎) with WSP

25

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 1 2 3 4

VW 2

2

𝑥11 (𝑣8+ 𝑣9+ 𝑣10)vw1 𝑻𝒉𝒎𝒑𝒄𝒃𝒎 𝑻𝒎𝒑𝒅𝒃𝒎 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 =𝑥0+ (𝑣1+ 𝑣2+ 𝑣3+ 𝑣4)vw1,vw2 + (𝑣1+ 𝑣2+ 𝑣3+ 𝑣4)vw1,vw2 =𝑥0 + (𝑣5+ 𝑣6+ 𝑣7)vw1

8 8 7 5 6 7 5 6 5 6 7 8 5 6 7 8 9 10 11 12 11

(𝑣5+ 𝑣6+ 𝑣7)vw2

slide-26
SLIDE 26

5

▪ Local staleness (𝑻𝒎𝒑𝒅𝒃𝒎) and global staleness (𝑻𝒉𝒎𝒑𝒄𝒃𝒎) with WSP

26

Data Parallelism with Multiple VWs 1 2 3 4

VW 1

Clock

1 1 2 3 4

VW 2

6 7 8 5 6 7 8 2 9 10 11 12

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎 Minibatch 12 has to wait

slide-27
SLIDE 27

▪ Example of clock distance threshold D

27

Data Parallelism with Multiple VWs

If D = 1 Can start minibatch 8 without pull

1 2 3 4

VW 1

Clock

1 8 1 2 3 4

VW 2

5 6 7

Parameter Server: 𝒙𝒉𝒎𝒑𝒄𝒃𝒎

Push & Pull

slide-28
SLIDE 28

▪ Example of clock distance threshold D

28

Data Parallelism with Multiple VWs

Minibatch 12 has to wait

1 2 3 4

VW 1

Clock

1 1 2 3 4 5 6 7 8 9 10 11 12 2

𝑥11 =𝑥0+(𝑣1+ 𝑣2+ 𝑣3+ 𝑣4+ 𝑣5+ 𝑣6+ 𝑣7)vw1

(𝑣8+ 𝑣9+𝑣10)vw1 𝑻𝒉𝒎𝒑𝒄𝒃𝒎 𝑻𝒎𝒑𝒅𝒃𝒎 𝑥𝑕𝑚𝑝𝑐𝑏𝑚 =𝑥0 If D = 1

11 11

VW 2

(𝑣1+ 𝑣2+𝑣3+𝑣4+𝑣5+𝑣6+𝑣7)vw2

5 6 7

slide-29
SLIDE 29

▪ Motivation & Background ▪ HetPipe in a Nutshell ▪ Our System: HetPipe

▪ Evaluation

  • Setup
  • Resource Allocation for Virtual Workers
  • Results

▪ Conclusion

29

Outline

slide-30
SLIDE 30

InfiniBand (56 Gbps)

▪ Cluster setup - 4 heterogeneous GPU nodes ▪ Two DNN models

30

Evaluation Setup

ResNet-152 VGG-19 Dataset, minibatch size ImageNet, 32 Model parameter size 230 MB 548 MB Characteristic Large activation output Large parameter size

Node 𝟐 Node 𝟑 Node 𝟒 Node 𝟓

V0 V1 V2 V3 R0 R1 R2 R3 G0 G1 G2 G3 Q0 Q1 Q2 Q3 TITAN V TITAN RTX GeForce RTX 2060 Quadro P4000 V R G Q

Computation power

> > >

V R G Q

Memory size

> > >

slide-31
SLIDE 31

▪ NP (Node Partition)

31

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟐 V0 V1 V2 V3

VW 𝟐

Node 𝟑 Q0 Q1 Q2 Q3

VW 𝟑

Node 𝟒 R0 R1 R2 R3

VW 𝟒

Node 𝟓 G0 G1 G2 G3

VW 4

  • Minimum communication overhead within VW
  • Performance of each virtual worker varies
  • Straggler may degrade performance with DP
slide-32
SLIDE 32

▪ ED (Equal Distribution)

32

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟐 V0 V1 V2 V3 Node 𝟑 Q0 Q1 Q2 Q3 Node 𝟒 R0 R1 R2 R3 Node 𝟓 G0 G1 G2 G3

VW 𝟐 VW 𝟑 VW 𝟒 VW 𝟓

  • Performance will be the same across the VWs
  • Mitigates the straggler problem
  • High communication overhead within each VW
slide-33
SLIDE 33

▪ HD (Hybrid Distribution)

33

Resource Allocation for Virtual Workers: NP, ED, HD

Node 𝟐 V0 V1 V2 V3 Node 𝟑 Q0 Q1 Q2 Q3 Node 𝟒 R0 R1 R2 R3 Node 𝟓 G0 G1 G2 G3

VW 𝟐 VW 𝟑 VW 𝟒 VW 4

  • Mitigates the straggler problem
  • Reduces communication overhead within each VW

V R G Q

Computation power

> > >

V R G Q

Memory size

> > >

slide-34
SLIDE 34

▪ Round-robin policy (default)

  • Can be used in all three policies: NP, ED, and HD

Parameter Placement

Node 𝟐 V3 Node 𝟑 Q3 Node 𝟒 R3 Node 𝟓 G3

VW 𝟐 VW 𝟓

… … … … …

Example: ED Parameters of each layer:

34

slide-35
SLIDE 35

▪ Local placement policy

  • ED-local

Parameter Placement

Node 𝟐 V3 Node 𝟑 Q3 Node 𝟒 R3 Node 𝟓 G3

VW 𝟐 VW 𝟓

… … … … …

Parameters of each layer:

  • Significantly reduces

communication overhead

ED

  • Parameter

communication

  • ccurs

35

slide-36
SLIDE 36

▪ Baseline Horovod

  • State-of-the-art DP using AllReduce

36

Compare Throughput with Horovod ResNet-152 VGG-19

1.4 X 1.8 X

  • ED: reduces the straggler

problem

  • ED-local: significantly

reduces communication

  • verhead
  • For ResNet-152,

the whole model is too large to be loaded into a single G type GPU (batch size = 32)

slide-37
SLIDE 37

37

Performance Improvement of Adding Whimpy GPUs Adding whimpy GPUs

V R R R R Q Q Q Q G G G G

+ + +

  • With additional GPUs, HetPipe

achieves up to 2.3X speed up

2.3 X

  • Additional whimpy systems allow

for faster training

V V V V

slide-38
SLIDE 38

▪ ResNet-152

38

Convergence Results

  • HetPipe reduces straggler problem in

heterogeneous environment

Target accuracy: 74%

Up to 39% faster

  • Adding four more whimpy G GPUs,

performance improves even more 7% faster

12GPUs 16GPUs

slide-39
SLIDE 39

▪ VGG-19

39

Convergence Results

Target accuracy: 67%

  • HetPipe (D=0) is 29% faster than Horovod

Up to 49% faster

D=4 D=32 D=0 4.7% slower

  • Higher global staleness (i.e., 32) can

degrade convergence performance

29% faster

slide-40
SLIDE 40

▪ Provide convergence proof of WSP ▪ Partitioning algorithm ▪ Performance of a single virtual worker ▪ Comparison to PipeDream

40

Not Presented But Discussed in Paper

slide-41
SLIDE 41

▪ HetPipe makes it possible to efficiently train large DNN models with heterogeneous GPUs ▪ Integrate pipelined model parallelism with data parallelism ▪ Propose a novel parameter synchronization model: WSP ▪ DNN models converge up to 49% faster with HetPipe

41

Conclusion

slide-42
SLIDE 42

42

Thank you!