spotnik
play

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient - PowerPoint PPT Presentation

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a genl nder, Luo M a i, Guo Li, Peter Pietzuch L a rge-Sc a le D a t a & Systems (LSDS) Group Imperial College London Distributed ML Tr a in a m a


  1. Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a genl ä nder, Luo M a i, Guo Li, Peter Pietzuch L a rge-Sc a le D a t a & Systems (LSDS) Group Imperial College London

  2. Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ worker 2 Δ Δ 2

  3. Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ Learn a model worker 2 Δ Δ 3

  4. Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ Data parallelism worker 2 Δ Δ 4

  5. Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ worker 2 Ring all-reduce Δ Δ 5

  6. Challenges of distributed ML • Distributed ML is resource- Example Megatron-LM 3 hungry • Training of BERT-like model • Accelerated resources are • 512 V100 GPUs expensive • One epoch (68,507 iterations) takes 2.1 days • Cost on Azure: $92,613 6 3 Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017

  7. Transient cloud resources • Examples: AWS Spot instances, Azure Spot VMs • Follows the law of a free market • Revocations worker 2 • Noti f ications • Economic incentive • O ff ers a cost reduction of up to 90% 1 90% A Megatron-LM epoch would drop from $92,613 to $15,152 7 1 https://azure.microsoft.com/en-us/pricing/spot/

  8. Implications of transient resources Cluster worker 0 • New workers become available or old workers get revoked worker 1 ➙ System must cope with disappearing resources • Changes can happen at any time worker 2 Size ➙ System must ensure consistency of updates worker 3 worker 4 8

  9. Implications of transient resources • New workers become available or old workers get revoked Network e fg iciency ➙ System must cope with disappearing resources • Changes can happen at any time HogWild! ➙ System must ensure consistency of updates AD - PSGD • Cluster sizes are unknown beforehand EA - SGD SMA ➙ System must adapt to di ff erent conditions S - SGD Model accuracy 9

  10. Current approach: Checkpoint & recovery • Tensor f low and Pytorch • Changes to the cluster are not considered • Recovery takes about 20 seconds with ResNet50 and ImageNet Training Checkpoint Recovery Training 10

  11. Current approaches: Hybrid • Mix dedicated resources with transient resources • Proteus 2 : Placement of parameter server on dedicated resources so that the training state is save Dedicated resources Parameter server Transient resources Worker 0 Worker 1 Worker 2 11 2 Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic re- source markets. EuroSys, 2017

  12. Spotnik Challenges Solutions Workers become available Reuse communication channels for or get revoked synchronisation to repair the cluster Changes can happen at any Ensure atomic model updates by waiting for all time synchronisations to f inish f irst Cluster sizes are unknown Change the synchronisation strategy based on beforehand the number of workers 12

  13. Spotnik Challenges Solutions Workers become available Reuse communication channels for or get revoked synchronisation to repair the cluster Changes can happen at any Ensure atomic model updates by waiting for all time synchronisations to f inish f irst Cluster sizes are unknown Change the synchronisation strategy based on beforehand the number of workers 13

  14. Revocation recovery algorithm • Handle revocations within a ring topology • Number of total messages is bounded by O ( N ⋅ K ) messages • K is the number of simultaneous revocations • N is the number of workers ➙ Scale to many transient resources • No reliance on revocation noti f ications 14

  15. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Worker Membership W 0 [0, 1, 2, 3, 4, 5] 0 W 1 W 5 [0, 1, 2, 3, 4, 5] 1 [0, 1, 2, 3, 4, 5] 2 [0, 1, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 15

  16. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 Revocation [0, 1, 2, 3, 4, 5] 0 W 1 W 5 [0, 1, 2, 3, 4, 5] 2 [0, 1, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 16

  17. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 Bypass [0, 1, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 17

  18. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 [0, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 18

  19. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 [0, 2, 3, 4, 5] 3 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 Revocation W 3 19

  20. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 Bypass [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 20

  21. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 21

  22. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 5] 0 W 1 W 5 [0, 2, 3, 5] 2 [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 22

  23. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Accept Worker Membership W 0 [0, 2, 3, 5] 0 W 1 W 5 [0, 2, 3, 5] 2 [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 23

  24. Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Restart Worker Membership W 0 [0, 2, 3, 5] 0 [0, 2, 3, 5] 2 [0, 2, 3, 5] 3 W 5 W 2 [0, 2, 3, 5] 5 W 3 24

  25. Spotnik Challenges Solutions Workers become available Reuse communication channels for or get revoked synchronisation to repair the cluster Changes can happen at any Ensure atomic model updates by waiting for all time synchronisations to f inish f irst Cluster sizes are unknown Change the synchronisation strategy based on beforehand the number of workers 25

  26. Atomic worker state update • Pipelined synchronisation Parameter Parameter Parameter Sync. Sync. Sync. Update Update Update 26 26

  27. Atomic worker state update • Pipelined synchronisation • Revocations can happen meanwhile ➙ Partial update leads to inconsistency Parameter Parameter Parameter Sync. Sync. Sync. Update Update Update 27 27

  28. Atomic worker state update • Atomicity: Wait for all synchronisation communications to f inish • Discard updates if communication fails Parameter Parameter Parameter Sync. Sync. Sync. Barrier Update Update Update 28 28

  29. Spotnik Challenges Solutions Workers become available Reuse communication channels for or get revoked synchronisation to repair the cluster Changes can happen at any Ensure atomic model updates by waiting for all time synchronisations to f inish f irst Cluster sizes are unknown Change the synchronisation strategy based on beforehand the number of workers 29

  30. Adaptive synchronisation strategies • Support a range of synchronisation primitives • collective and point-to-point synchronisation • Monitor a metric • Number of workers 30

  31. Adaptive synchronisation strategies • Support a range of synchronisation primitives • collective and point-to-point synchronisation • Monitor a metric • Number of workers • De f ine a policy in the beginning • When to choose which sync strategy AD - PSGD S - SGD W 0 W 0 W 5 W 1 6 workers 3 workers W 4 W 2 W 2 W 1 W 3 31

  32. Evaluation How does the recovery l a tency ch a nge with incre a sing number of revoc a tions? Cluster • 16 workers Hardware • Azure NC6 VMs • Nvidia K80 Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • ImageNet No signi f icant increase of recovery latency if the number of revocation increases 32

  33. Evaluation How much does the tr a ining slow down if we use a tomic worker st a te upd a tes? *di ff erent Setup Cluster Cluster • • 32 workers 16 workers Hardware Hardware • • Huawei ModelArts Azure NC6 VMs • • Nvidia V100 Nvidia K80 • In f iniBand Software • Software KungFu 0.2.1 • • KungFu 0.2.1 Tensor f low 1.15 • Tensor f low 1.12 Throughput decrease is small 33

  34. Evaluation How does the throughput ch a nge, if the cluster ch a nges? Cluster • up to 32 workers Hardware • Azure NC6 VMs • Nvidia K80 Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • 5 10 15 20 25 30 ImageNet Cluster size 34

  35. Evaluation How does the throughput ch a nge, if the cluster ch a nges? Cluster • up to 32 workers Hardware • Azure NC6 VMs • Nvidia K80 Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • 5 10 15 20 25 30 ImageNet Cluster size 35

  36. Evaluation How does the throughput ch a nge, if the cluster ch a nges? Cluster • up to 32 workers Hardware • Azure NC6 VMs • Switching from S-SGD Nvidia K80 to AD-�SGD Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • 5 10 15 20 25 30 ImageNet Cluster size Changing clusters need adaptation 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend