SLIDE 20 spcl.inf.ethz.ch @spcl_eth
▪ Parameter exchange frequency can be controlled, while still attaining convergence: ▪ May also consider limited/slower distribution – gossip [Jin et al. 2016]
20
Synchronous Stale Synchronous / Bounded Asynchronous Asynchronous
Training Agent Training Agent Training Agent Training Agent
collective allreduce of 𝒙
Time
All- Reduce
Agent 1 Agent m
. . .
… … . . . Merge
𝑥 1,1
𝑥 1,𝑛 𝑥 2,𝑛
𝑥(0) 𝑥(𝑈)
𝑥 2,1 𝑥 3,1 𝑥 4,1
All- Reduce 𝑥 1
Time 𝑥(0)
All- Reduce 𝑥 𝑈 𝑥 2 𝑥 2
Agent 1 Agent m
. . .
𝑥 1 𝑥 𝑈
… …
All- Reduce
Time
Agent 1 Agent m
𝑥 1,𝑛 𝑥 2,𝑛 𝑥 2,1 𝑥 1,1 𝑥 3,1 𝑥 3,𝑛
Agent r Agent k
𝑥 1,𝑠 𝑥 2,𝑠 𝑥 3,𝑠 𝑥 4,𝑠 𝑥 5,𝑠 𝑥 1,𝑙 𝑥 2,𝑙 𝑥 3,𝑙
Peter H. Jin et al., “How to scale distributed deep learning?”, NIPS MLSystems 2016
Parameter (and Model) consistency - decentralized