Spotnik Designing Distributed M a chine Le a rning for Tr a nsient - PowerPoint PPT Presentation

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a genl ä nder, Luo M a i, Guo Li, Peter Pietzuch L a rge-Sc a le D a t a & Systems (LSDS) Group Imperial College London

Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ worker 2 Δ Δ 2

Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ Learn a model worker 2 Δ Δ 3

Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ Data parallelism worker 2 Δ Δ 4

Distributed ML Tr a in a m a chine le a rning model worker 0 worker 1 Δ worker 2 Ring all-reduce Δ Δ 5

Challenges of distributed ML • Distributed ML is resource- Example Megatron-LM 3 hungry • Training of BERT-like model • Accelerated resources are • 512 V100 GPUs expensive • One epoch (68,507 iterations) takes 2.1 days • Cost on Azure: $92,613 6 3 Shoeybi, Mohammad, et al. Megatron-LM: Training multi-billion parameter language models using gpu model parallelism, 2017

Transient cloud resources • Examples: AWS Spot instances, Azure Spot VMs • Follows the law of a free market • Revocations worker 2 • Noti f ications • Economic incentive • O ff ers a cost reduction of up to 90% 1 90% A Megatron-LM epoch would drop from $92,613 to $15,152 7 1 https://azure.microsoft.com/en-us/pricing/spot/

Implications of transient resources Cluster worker 0 • New workers become available or old workers get revoked worker 1 ➙ System must cope with disappearing resources • Changes can happen at any time worker 2 Size ➙ System must ensure consistency of updates worker 3 worker 4 8

Implications of transient resources • New workers become available or old workers get revoked Network e fg iciency ➙ System must cope with disappearing resources • Changes can happen at any time HogWild! ➙ System must ensure consistency of updates AD - PSGD • Cluster sizes are unknown beforehand EA - SGD SMA ➙ System must adapt to di ff erent conditions S - SGD Model accuracy 9

Current approach: Checkpoint & recovery • Tensor f low and Pytorch • Changes to the cluster are not considered • Recovery takes about 20 seconds with ResNet50 and ImageNet Training Checkpoint Recovery Training 10

Current approaches: Hybrid • Mix dedicated resources with transient resources • Proteus 2 : Placement of parameter server on dedicated resources so that the training state is save Dedicated resources Parameter server Transient resources Worker 0 Worker 1 Worker 2 11 2 Harlap et al. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. EuroSys, 2017

Spotnik Challenges Solutions Workers become available Reuse communication channels for or get revoked synchronisation to repair the cluster Changes can happen at any Ensure atomic model updates by waiting for all time synchronisations to f inish f irst Cluster sizes are unknown Change the synchronisation strategy based on beforehand the number of workers 12

Revocation recovery algorithm • Handle revocations within a ring topology • Number of total messages is bounded by O ( N ⋅ K ) messages • K is the number of simultaneous revocations • N is the number of workers ➙ Scale to many transient resources • No reliance on revocation noti f ications 14

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Worker Membership W 0 [0, 1, 2, 3, 4, 5] 0 W 1 W 5 [0, 1, 2, 3, 4, 5] 1 [0, 1, 2, 3, 4, 5] 2 [0, 1, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 15

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 Revocation [0, 1, 2, 3, 4, 5] 0 W 1 W 5 [0, 1, 2, 3, 4, 5] 2 [0, 1, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 16

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 Bypass [0, 1, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 17

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 [0, 2, 3, 4, 5] 3 [0, 1, 2, 3, 4, 5] 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 W 3 18

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 [0, 2, 3, 4, 5] 3 4 W 4 W 2 [0, 1, 2, 3, 4, 5] 5 Revocation W 3 19

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 4, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 Bypass [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 20

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 5] 0 W 1 W 5 [0, 2, 3, 4, 5] 2 [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 21

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Reconcile Worker Membership W 0 [0, 2, 3, 5] 0 W 1 W 5 [0, 2, 3, 5] 2 [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 22

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Accept Worker Membership W 0 [0, 2, 3, 5] 0 W 1 W 5 [0, 2, 3, 5] 2 [0, 2, 3, 5] 3 4 W 4 W 2 [0, 2, 3, 5] 5 W 3 23

Revocation recovery algorithm Rep a iring a broken a ll-reduce ring Restart Worker Membership W 0 [0, 2, 3, 5] 0 [0, 2, 3, 5] 2 [0, 2, 3, 5] 3 W 5 W 2 [0, 2, 3, 5] 5 W 3 24

Atomic worker state update • Pipelined synchronisation Parameter Parameter Parameter Sync. Sync. Sync. Update Update Update 26 26

Atomic worker state update • Pipelined synchronisation • Revocations can happen meanwhile ➙ Partial update leads to inconsistency Parameter Parameter Parameter Sync. Sync. Sync. Update Update Update 27 27

Atomic worker state update • Atomicity: Wait for all synchronisation communications to f inish • Discard updates if communication fails Parameter Parameter Parameter Sync. Sync. Sync. Barrier Update Update Update 28 28

Adaptive synchronisation strategies • Support a range of synchronisation primitives • collective and point-to-point synchronisation • Monitor a metric • Number of workers 30

Adaptive synchronisation strategies • Support a range of synchronisation primitives • collective and point-to-point synchronisation • Monitor a metric • Number of workers • De f ine a policy in the beginning • When to choose which sync strategy AD - PSGD S - SGD W 0 W 0 W 5 W 1 6 workers 3 workers W 4 W 2 W 2 W 1 W 3 31

Evaluation How does the recovery l a tency ch a nge with incre a sing number of revoc a tions? Cluster • 16 workers Hardware • Azure NC6 VMs • Nvidia K80 Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • ImageNet No signi f icant increase of recovery latency if the number of revocation increases 32

Evaluation How much does the tr a ining slow down if we use a tomic worker st a te upd a tes? *di ff erent Setup Cluster Cluster • • 32 workers 16 workers Hardware Hardware • • Huawei ModelArts Azure NC6 VMs • • Nvidia V100 Nvidia K80 • In f iniBand Software • Software KungFu 0.2.1 • • KungFu 0.2.1 Tensor f low 1.15 • Tensor f low 1.12 Throughput decrease is small 33

Evaluation How does the throughput ch a nge, if the cluster ch a nges? Cluster • up to 32 workers Hardware • Azure NC6 VMs • Nvidia K80 Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • 5 10 15 20 25 30 ImageNet Cluster size 34

Evaluation How does the throughput ch a nge, if the cluster ch a nges? Cluster • up to 32 workers Hardware • Azure NC6 VMs • Nvidia K80 Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • 5 10 15 20 25 30 ImageNet Cluster size 35

Evaluation How does the throughput ch a nge, if the cluster ch a nges? Cluster • up to 32 workers Hardware • Azure NC6 VMs • Switching from S-SGD Nvidia K80 to AD-�SGD Software • KungFu 0.2.1 • Tensor f low 1.15 ML • ResNet50 • 5 10 15 20 25 30 ImageNet Cluster size Changing clusters need adaptation 36

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient - PowerPoint PPT Presentation

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a genl nder, Luo M a i, Guo Li, Peter Pietzuch L a rge-Sc a le D a t a & Systems (LSDS) Group Imperial College London Distributed ML Tr a in a m a

Stable Matching CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Stable Matching A type of

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina

CSCI 3110 Fun with Algorithms Norbert Zeh nzeh@cs.dal.ca Faculty of Computer Science Dalhousie

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

A polynomial-time partitioning algorithm for weighted cactus graphs Maike Buchin, Leonie Selbach

Workshop 3 Medication Access and Adherence: New Partnership Opportunities for Pharmacists

Minimizing within convex bodies using a convex hull method Edouard Oudet Thomas

Variables Variables are universally quantified in the scope of a clause. A variable assignment is

Small Progress Measures for Solving Parity Games Marcin Jurdzi nski BRICS University of

EI331 Signals and Systems Lecture 13 Bo Jiang John Hopcroft Center for Computer Science

XHTML vs. HTML XHTML Validation Web Markup Languages HTML 2.0 HTML 4.01 XHTML

Security : a snapshot from W3C Virginie GALINDO July 2014 Menu ? 30 minutes to taste web,

PR PROVEN VENAN ANCE E @ IVO VOA Kristin Riebe, Anastasia Galkin, Ole Streicher, AIP

W3C Print Symposium 2006 Welcome! Willkommen! Bienvenue! Who is W3C? Non-Profit Open

Breakfasts 2019 Welcome to Octobers BIC Breakfast - Listen Up! Understanding the Digital

Multiparty Session Types and their Applications http://mrg.doc.ic.ac.uk/ Nobuko Yoshida, Rumyana

Meet Crosswalk New HTML5 Runtime Sakari Poussa Intel Outline What is Crosswalk and why do

Why C++? a better C type safe, e.g., I/O streams better support for ADTs, encapsulation

Working w ith a DataSet to Create DataFrames W OR K IN G W ITH TH E C L ASS SYSTE M IN P YTH

Gate project Timo Savola FOSDEM 2020 Portable execution state Migrate live programs between

Understanding Roles, Constraints And Classes Jonathan Worthington French Perl Workshop 2007

Transformations V Week 3, Wed Jan 24 http://www.ugrad.cs.ubc.ca/~cs314/Vjan2007 Reading for Next

Graphics in R STAT 133 Gaston Sanchez Department of Statistics, UCBerkeley gastonsanchez.com

Vertical Interaction In Open Software Engineering Communities Ph.D. Thesis Proposal Engineering

Sambuz

Useful Links

Newsletter

Mail Us