Stochastic Gradient Push for Distributed Deep Learning Mido Assran, - - PowerPoint PPT Presentation

stochastic gradient push for distributed deep learning
SMART_READER_LITE
LIVE PREVIEW

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, - - PowerPoint PPT Presentation

Stochastic Gradient Push for Distributed Deep Learning Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat 1 Data Parallel Training parallel Stochastic Gradient Descent x , f 2 ( x ) x , f 1 ( x ) ( f i ( x ) ) n 1 x


slide-1
SLIDE 1

Stochastic Gradient Push for Distributed Deep Learning

Mido Assran, Nicolas Loizou, Nicolas Ballas, Mike Rabbat

  • 1
slide-2
SLIDE 2

2

parallel Stochastic Gradient Descent x(k+1) = x(k) − γ(k) ( 1 n

n

i=1

∇ ˜ fi(x)) inter-node average x, ∇ ˜ f1(x) x, ∇ ˜ f2(x) x, ∇ ˜ f3(x) x, ∇ ˜ f4(x)

Data Parallel Training

x(k+1) = 1 n

n

i=1

(x(k) − γ(k)∇ ˜ fi(x))

slide-3
SLIDE 3

3

1. Parallel SGD (AllReduce gradient aggregation, all nodes)

Existing Approaches

Data Parallel Training

slide-4
SLIDE 4

4

1. Parallel SGD (AllReduce gradient aggregation, all nodes)

Existing Approaches Blocks all nodes

Data Parallel Training

slide-5
SLIDE 5

5

1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes)

Existing Approaches

  • 1. Goyal et al., "Accurate, large minibatch sgd: training imagenet in 1 hour," preprint arXiv:1706.02677, 2017.
  • 2. Lian et al., "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic

gradient descent," NeurIPS, 2017.

  • 3. Lian et al., "Asynchronous decentralized parallel stochastic gradient descent," ICML, 2018.

Blocks all nodes

Data Parallel Training

slide-6
SLIDE 6

6

Blocks subsets of nodes and requires deadlock avoidance

1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes)

Existing Approaches Blocks all nodes

  • 1. Goyal et al., "Accurate, large minibatch sgd: training imagenet in 1 hour," preprint arXiv:1706.02677, 2017.
  • 2. Lian et al., "Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic

gradient descent," NeurIPS, 2017.

  • 3. Lian et al., "Asynchronous decentralized parallel stochastic gradient descent," ICML, 2018.

Data Parallel Training

slide-7
SLIDE 7

7

1. Parallel SGD (AllReduce gradient aggregation, all nodes) 2. D-PSGD (PushPull parameter aggregation, neighboring nodes) 3. AD-PSGD (PushPull parameter aggregation, pairs of nodes)

Existing Approaches

Stochastic Gradient Push (PushSum parameter aggregation)

Proposed Approach Blocks all nodes Blocks subsets of nodes and requires deadlock avoidance nonblocking, no deadlock avoidance required

Data Parallel Training

slide-8
SLIDE 8

8

Stochastic Gradient Push

Enables optimization over directed and time-varying graphs

  • 1. Nedic, A. and Olshevsky, A. "Stochastic gradient-push for strongly convex functions on time-varying directed graphs," IEEE
  • Trans. Automatic Control, 2016.
slide-9
SLIDE 9

8

Stochastic Gradient Push

Enables optimization over directed and time-varying graphs ... naturally enables asynchronous implementations

  • 1. Nedic, A. and Olshevsky, A. "Stochastic gradient-push for strongly convex functions on time-varying directed graphs," IEEE
  • Trans. Automatic Control, 2016.
slide-10
SLIDE 10

Nonblocking Communication Nonblocking Communication Nonblocking Communication Nonblocking Communication

9

Stochastic Gradient Push

Local Optimization

Time

Local Optimization Local Optimization Local Optimization Local Optimization

slide-11
SLIDE 11

10

32 nodes (256 GPUs) interconnected via 10 Gbps Ethernet

Distributed Stochastic Optimization

  • Val. Acc. (%)

74 75 76 77 78

Training Time Relative to SGD 1/6 1/3 1/2 2/3 5/6 1

SGP AllReduce SGD AD-PSGD D-PSGD

ImageNet, ResNet 50

90 epochs 270 epochs 90 epochs 90 epochs 270 epochs 90 epochs

slide-12
SLIDE 12

11

Data Parallelism

Stochastic Gradient Push

* nonblocking communication

Algorithm features: asynchronous gossip

slide-13
SLIDE 13

12

Data Parallelism

Stochastic Gradient Push

* nonblocking communication * convergence guarantees for smooth

non-convex functions with arbitrary (bounded) message staleness

Algorithm features: asynchronous gossip

paper: arxiv.org/pdf/1811.10792.pdf code: github.com/facebookresearch/stochastic_gradient_push poster: Pacific Ballroom #183