Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group
Imperial College London
Spotnik
Designing Distributed Machine Learning for Transient Cloud Resources
Spotnik Designing Distributed M a chine Le a rning for Tr a nsient - - PowerPoint PPT Presentation
Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a genl nder, Luo M a i, Guo Li, Peter Pietzuch L a rge-Sc a le D a t a & Systems (LSDS) Group Imperial College London Distributed ML Tr a in a m a
Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch Large-Scale Data & Systems (LSDS) Group
Imperial College London
Designing Distributed Machine Learning for Transient Cloud Resources
2
Train a machine learning model
worker 0 worker 1 worker 2
Δ Δ Δ
3
Train a machine learning model
worker 0 worker 1 worker 2
Δ Δ Δ
Learn a model
4
Train a machine learning model
worker 0 worker 1 worker 2
Δ Δ Δ
Data parallelism
5
Train a machine learning model
worker 0 worker 1 worker 2
Δ Δ Δ
Ring all-reduce
hungry
expensive Example Megatron-LM3
takes 2.1 days
6
3Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017
1https://azure.microsoft.com/en-us/pricing/spot/
7
A Megatron-LM epoch would drop from $92,613 to $15,152
worker 2
90%
➙ System must cope with disappearing resources
➙ System must ensure consistency of updates
8
worker 0 worker 1 worker 2 worker 3 worker 4 Cluster Size
➙ System must cope with disappearing resources
➙ System must ensure consistency of updates
➙ System must adapt to different conditions
9
HogWild! AD-PSGD EA-SGD SMA S-SGD Network efgiciency Model accuracy
Recovery Training
10
Checkpoint Training
training state is save
Parameter server Worker 0 Worker 1 Worker 2
Dedicated resources Transient resources
2Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic re- source markets. EuroSys, 2017
11
12
Challenges Solutions Workers become available
Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers
13
Challenges Solutions Workers become available
Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers
messages
➙ Scale to many transient resources
O(N ⋅ K)
14
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 1, 2, 3, 4, 5]
1
[0, 1, 2, 3, 4, 5]
2
[0, 1, 2, 3, 4, 5]
3
[0, 1, 2, 3, 4, 5]
4
[0, 1, 2, 3, 4, 5]
5
[0, 1, 2, 3, 4, 5]
15
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 1, 2, 3, 4, 5]
2
[0, 1, 2, 3, 4, 5]
3
[0, 1, 2, 3, 4, 5]
4
[0, 1, 2, 3, 4, 5]
5
[0, 1, 2, 3, 4, 5]
16
Revocation
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 4, 5]
2
[0, 2, 3, 4, 5]
3
[0, 1, 2, 3, 4, 5]
4
[0, 1, 2, 3, 4, 5]
5
[0, 1, 2, 3, 4, 5]
Bypass
17
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 4, 5]
2
[0, 2, 3, 4, 5]
3
[0, 2, 3, 4, 5]
4
[0, 1, 2, 3, 4, 5]
5
[0, 1, 2, 3, 4, 5]
18
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 4, 5]
2
[0, 2, 3, 4, 5]
3
[0, 2, 3, 4, 5]
4 5
[0, 1, 2, 3, 4, 5]
19
Revocation
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 4, 5]
2
[0, 2, 3, 4, 5]
3
[0, 2, 3, 5]
4 5
[0, 2, 3, 5]
20
Bypass
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 5]
2
[0, 2, 3, 4, 5]
3
[0, 2, 3, 5]
4 5
[0, 2, 3, 5]
21
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 5]
2
[0, 2, 3, 5]
3
[0, 2, 3, 5]
4 5
[0, 2, 3, 5]
22
Repairing a broken all-reduce ring
W0 W1 W2 W3 W4 W5
Worker Membership
[0, 2, 3, 5]
2
[0, 2, 3, 5]
3
[0, 2, 3, 5]
4 5
[0, 2, 3, 5]
23
Repairing a broken all-reduce ring
W0 W2 W3 W5
Worker Membership
[0, 2, 3, 5]
2
[0, 2, 3, 5]
3
[0, 2, 3, 5]
5
[0, 2, 3, 5]
24
25
Challenges Solutions Workers become available
Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers
26 26
Parameter Parameter Sync. Sync. Parameter Sync. Update Update Update
➙ Partial update leads to inconsistency
27 27
Parameter Parameter Sync. Sync. Parameter Sync. Update Update Update
28 28
Parameter Parameter Sync. Sync. Parameter Sync. Update Update Update
Barrier
29
Challenges Solutions Workers become available
Reuse communication channels for synchronisation to repair the cluster Changes can happen at any time Ensure atomic model updates by waiting for all synchronisations to finish first Cluster sizes are unknown beforehand Change the synchronisation strategy based on the number of workers
30
31
W0 W1 W2 3 workers W0 W3 W2 6 workers W1 W4 W5
AD-PSGD S-SGD
How does the recovery latency change with increasing number of revocations?
32
Cluster
Hardware
Software
ML
No significant increase of recovery latency if the number of revocation increases
How much does the training slow down if we use atomic worker state updates?
33
Cluster
Hardware
Software
*different Setup Cluster
Hardware
Software
Throughput decrease is small
How does the throughput change, if the cluster changes?
34
Cluster
Hardware
Software
ML
Cluster size
5 10 15 20 25 30
How does the throughput change, if the cluster changes?
35
Cluster
Hardware
Software
ML
Cluster size
5 10 15 20 25 30
How does the throughput change, if the cluster changes?
Switching from S-SGD to AD-SGD
36
Cluster
Hardware
Software
ML
Changing clusters need adaptation
Cluster size
5 10 15 20 25 30
37
38
39
KungFu github.com/lsds/KungFu
40
Website lsds.doc.ic.ac.uk | E-Mail marcel.wagenlander19@imperial.ac.uk Twitter @marwage
KungFu github.com/lsds/KungFu