Spotnik Designing Distributed M a chine Le a rning for Tr a nsient - - PowerPoint PPT Presentation

spotnik
SMART_READER_LITE
LIVE PREVIEW

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient - - PowerPoint PPT Presentation

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a genl nder, Luo M a i, Guo Li, Peter Pietzuch L a rge-Sc a le D a t a & Systems (LSDS) Group Imperial College London Distributed ML Tr a in a m a


slide-1
SLIDE 1

Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group

Imperial College London

Spotnik

Designing Distributed Machine Learning for Transient Cloud Resources

slide-2
SLIDE 2

Distributed ML

2

Train a machine learning model

worker 0 worker 1 worker 2

Δ Δ Δ

slide-3
SLIDE 3

Distributed ML

3

Train a machine learning model

worker 0 worker 1 worker 2

Δ Δ Δ

Learn a model

slide-4
SLIDE 4

Distributed ML

4

Train a machine learning model

worker 0 worker 1 worker 2

Δ Δ Δ

Data parallelism

slide-5
SLIDE 5

Distributed ML

5

Train a machine learning model

worker 0 worker 1 worker 2

Δ Δ Δ

Ring all-reduce

slide-6
SLIDE 6

Challenges of distributed ML

  • Distributed ML is resource-

hungry

  • Accelerated resources are

expensive Example Megatron-LM3

  • Training of BERT-like model
  • 512 V100 GPUs
  • One epoch (68,507 iterations)

takes 2.1 days

  • Cost on Azure: $92,613

6

3Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017

slide-7
SLIDE 7

Transient cloud resources

1https://azure.microsoft.com/en-us/pricing/spot/

7

  • Examples: AWS Spot instances, Azure Spot VMs
  • Follows the law of a free market
  • Revocations
  • Notifications
  • Economic incentive
  • Offers a cost reduction of up to 90%1

A Megatron-LM epoch would drop from $92,613 to $15,152

worker 2

90%

slide-8
SLIDE 8

Implications of transient resources

  • New workers become available or old workers get revoked

➙ System must cope with disappearing resources

  • Changes can happen at any time

➙ System must ensure consistency of updates

8

worker 0 worker 1 worker 2 worker 3 worker 4 Cluster Size

slide-9
SLIDE 9

Implications of transient resources

  • New workers become available or old workers get revoked

➙ System must cope with disappearing resources

  • Changes can happen at any time

➙ System must ensure consistency of updates

  • Cluster sizes are unknown beforehand

➙ System must adapt to different conditions

9

HogWild! AD-PSGD EA-SGD SMA S-SGD Network efgiciency Model accuracy

slide-10
SLIDE 10

Current approach: Checkpoint & recovery

  • Tensorflow and Pytorch
  • Changes to the cluster are not considered
  • Recovery takes about 20 seconds with ResNet50 and ImageNet

Recovery Training

10

Checkpoint Training

slide-11
SLIDE 11
  • Mix dedicated resources with transient resources
  • Proteus2: Placement of parameter server on dedicated resources so that the

training state is save

Parameter server Worker 0 Worker 1 Worker 2

Dedicated resources Transient resources

2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic re- source markets. EuroSys, 2017

11

Current approaches: Hybrid

slide-12
SLIDE 12

Spotnik

12

Challenges Solutions Workers become available

  • r get revoked

Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers

slide-13
SLIDE 13

Spotnik

13

Challenges Solutions Workers become available

  • r get revoked

Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers

slide-14
SLIDE 14

Revocation recovery algorithm

  • Handle revocations within a ring topology
  • Number of total messages is bounded by

messages

  • K is the number of simultaneous revocations
  • N is the number of workers

➙ Scale to many transient resources

  • No reliance on revocation notifications

O(N ⋅ K)

14

slide-15
SLIDE 15

Revocation recovery algorithm

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 1, 2, 3, 4, 5]

1

[0, 1, 2, 3, 4, 5]

2

[0, 1, 2, 3, 4, 5]

3

[0, 1, 2, 3, 4, 5]

4

[0, 1, 2, 3, 4, 5]

5

[0, 1, 2, 3, 4, 5]

15

slide-16
SLIDE 16

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 1, 2, 3, 4, 5]

2

[0, 1, 2, 3, 4, 5]

3

[0, 1, 2, 3, 4, 5]

4

[0, 1, 2, 3, 4, 5]

5

[0, 1, 2, 3, 4, 5]

16

Reconcile

Revocation

Revocation recovery algorithm

slide-17
SLIDE 17

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 4, 5]

2

[0, 2, 3, 4, 5]

3

[0, 1, 2, 3, 4, 5]

4

[0, 1, 2, 3, 4, 5]

5

[0, 1, 2, 3, 4, 5]

Reconcile

Bypass

Revocation recovery algorithm

17

slide-18
SLIDE 18

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 4, 5]

2

[0, 2, 3, 4, 5]

3

[0, 2, 3, 4, 5]

4

[0, 1, 2, 3, 4, 5]

5

[0, 1, 2, 3, 4, 5]

18

Reconcile

Revocation recovery algorithm

slide-19
SLIDE 19

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 4, 5]

2

[0, 2, 3, 4, 5]

3

[0, 2, 3, 4, 5]

4 5

[0, 1, 2, 3, 4, 5]

19

Reconcile

Revocation

Revocation recovery algorithm

slide-20
SLIDE 20

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 4, 5]

2

[0, 2, 3, 4, 5]

3

[0, 2, 3, 5]

4 5

[0, 2, 3, 5]

20

Reconcile

Bypass

Revocation recovery algorithm

slide-21
SLIDE 21

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 5]

2

[0, 2, 3, 4, 5]

3

[0, 2, 3, 5]

4 5

[0, 2, 3, 5]

21

Reconcile

Revocation recovery algorithm

slide-22
SLIDE 22

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 5]

2

[0, 2, 3, 5]

3

[0, 2, 3, 5]

4 5

[0, 2, 3, 5]

22

Reconcile

Revocation recovery algorithm

slide-23
SLIDE 23

Repairing a broken all-reduce ring

W0 W1 W2 W3 W4 W5

Worker Membership

[0, 2, 3, 5]

2

[0, 2, 3, 5]

3

[0, 2, 3, 5]

4 5

[0, 2, 3, 5]

23

Accept

Revocation recovery algorithm

slide-24
SLIDE 24

Repairing a broken all-reduce ring

W0 W2 W3 W5

Worker Membership

[0, 2, 3, 5]

2

[0, 2, 3, 5]

3

[0, 2, 3, 5]

5

[0, 2, 3, 5]

24

Restart

Revocation recovery algorithm

slide-25
SLIDE 25

Spotnik

25

Challenges Solutions Workers become available

  • r get revoked

Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers

slide-26
SLIDE 26

Atomic worker state update

  • Pipelined synchronisation

26 26

Parameter Parameter Sync. Sync. Parameter Sync. Update Update Update

slide-27
SLIDE 27

Atomic worker state update

  • Pipelined synchronisation
  • Revocations can happen meanwhile

➙ Partial update leads to inconsistency

27 27

Parameter Parameter Sync. Sync. Parameter Sync. Update Update Update

slide-28
SLIDE 28

Atomic worker state update

  • Atomicity: Wait for all synchronisation communications to finish
  • Discard updates if communication fails

28 28

Parameter Parameter Sync. Sync. Parameter Sync. Update Update Update

Barrier

slide-29
SLIDE 29

Spotnik

29

Challenges Solutions Workers become available

  • r get revoked

Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers

slide-30
SLIDE 30

Adaptive synchronisation strategies

  • Support a range of synchronisation primitives
  • collective and point-to-point synchronisation
  • Monitor a metric
  • Number of workers

30

slide-31
SLIDE 31

Adaptive synchronisation strategies

  • Support a range of synchronisation primitives
  • collective and point-to-point synchronisation
  • Monitor a metric
  • Number of workers
  • Define a policy in the beginning
  • When to choose which sync strategy

31

W0 W1 W2 3 workers W0 W3 W2 6 workers W1 W4 W5

AD-PSGD S-SGD

slide-32
SLIDE 32

Evaluation

How does the recovery latency change with increasing number of revocations?

32

Cluster

  • 16 workers

Hardware

  • Azure NC6 VMs
  • Nvidia K80

Software

  • KungFu 0.2.1
  • Tensorflow 1.15

ML

  • ResNet50
  • ImageNet

No significant increase of recovery latency if the number of revocation increases

slide-33
SLIDE 33

Evaluation

How much does the training slow down if we use atomic worker state updates?

33

Cluster

  • 32 workers

Hardware

  • Azure NC6 VMs
  • Nvidia K80

Software

  • KungFu 0.2.1
  • Tensorflow 1.15

*different Setup Cluster

  • 16 workers

Hardware

  • Huawei ModelArts
  • Nvidia V100
  • InfiniBand

Software

  • KungFu 0.2.1
  • Tensorflow 1.12

Throughput decrease is small

slide-34
SLIDE 34

Evaluation

How does the throughput change, if the cluster changes?

34

Cluster

  • up to 32 workers

Hardware

  • Azure NC6 VMs
  • Nvidia K80

Software

  • KungFu 0.2.1
  • Tensorflow 1.15

ML

  • ResNet50
  • ImageNet

Cluster size

5 10 15 20 25 30

slide-35
SLIDE 35

Evaluation

How does the throughput change, if the cluster changes?

35

Cluster

  • up to 32 workers

Hardware

  • Azure NC6 VMs
  • Nvidia K80

Software

  • KungFu 0.2.1
  • Tensorflow 1.15

ML

  • ResNet50
  • ImageNet

Cluster size

5 10 15 20 25 30

slide-36
SLIDE 36

Evaluation

How does the throughput change, if the cluster changes?

Switching from S-SGD to AD-SGD

36

Cluster

  • up to 32 workers

Hardware

  • Azure NC6 VMs
  • Nvidia K80

Software

  • KungFu 0.2.1
  • Tensorflow 1.15

ML

  • ResNet50
  • ImageNet

Changing clusters need adaptation

Cluster size

5 10 15 20 25 30

slide-37
SLIDE 37

Conclusion

37

  • Transient cloud resources offer potential to save money for ML training
  • No system that runs exclusively on transient resources and has low overhead
slide-38
SLIDE 38

Conclusion

38

  • Transient cloud resources offer potential to save money for ML training
  • No system that runs exclusively on transient resources and has low overhead

Spotnik

  • Repair cluster with low overhead
  • Ensure consistent model updates
  • Adapt to changes of the cluster
slide-39
SLIDE 39

Conclusion

39

  • Transient cloud resources offer potential to save money for ML training
  • No system that runs exclusively on transient resources and has low overhead

Spotnik

  • Repair cluster with low overhead
  • Ensure consistent model updates
  • Adapt to changes of the cluster

KungFu github.com/lsds/KungFu

slide-40
SLIDE 40

Conclusion

40

  • Transient cloud resources offer potential to save money for ML training
  • No system that runs exclusively on transient resources and has low overhead
  • Repair cluster with low overhead
  • Ensure consistent model updates
  • Adapt to changes of the cluster

Website lsds.doc.ic.ac.uk | E-Mail marcel.wagenlander19@imperial.ac.uk Twitter @marwage

Spotnik

KungFu github.com/lsds/KungFu