SLIDE 24 spcl.inf.ethz.ch @spcl_eth
โช Parameter exchange frequency can be controlled, while still attaining convergence: โช Using Gossip Algorithms [Jin et al. 2016] or Partial Collectives [Li et al. 2020]
33
Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous
Training Agent Training Agent Training Agent Training Agent
collective allreduce of ๐
Time
All- Reduce
Agent 1 Agent m
. . .
โฆ โฆ . . . Merge
๐ฅ 1,1
๐ฅ 1,๐ ๐ฅ 2,๐
๐ฅ(0) ๐ฅ(๐)
๐ฅ 2,1 ๐ฅ 3,1 ๐ฅ 4,1
All- Reduce ๐ฅ 1
Time ๐ฅ(0)
All- Reduce ๐ฅ ๐ ๐ฅ 2 ๐ฅ 2
Agent 1 Agent m
. . .
๐ฅ 1 ๐ฅ ๐
โฆ โฆ
All- Reduce
Time
Agent 1 Agent m
๐ฅ 1,๐ ๐ฅ 2,๐ ๐ฅ 2,1 ๐ฅ 1,1 ๐ฅ 3,1 ๐ฅ 3,๐
Agent r Agent k
๐ฅ 1,๐ ๐ฅ 2,๐ ๐ฅ 3,๐ ๐ฅ 4,๐ ๐ฅ 5,๐ ๐ฅ 1,๐ ๐ฅ 2,๐ ๐ฅ 3,๐
Peter H. Jin et al., โHow to scale distributed deep learning?โ, NIPS MLSystems 2016 Shigang Li et al., โTaming unbalanced training workloads in deep learning with partial collective operationsโ, PPoPP 2020
Parameter (and Model) consistency - decentralized
Inconsistent Ensemble Learning Synchronous SGD Consistent Stale-Synchronous SGD Model Averaging (e.g., elastic) Asynchronous SGD (HOGWILD!)