Parallel and Distributed Training of Neural Networks via Successive - - PowerPoint PPT Presentation

parallel and distributed training of neural networks via
SMART_READER_LITE
LIVE PREVIEW

Parallel and Distributed Training of Neural Networks via Successive - - PowerPoint PPT Presentation

2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane Contents


slide-1
SLIDE 1

2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’16)

Parallel and Distributed Training of Neural Networks via Successive Convex Approximation

Authors: Paolo Di Lorenzo and Simone Scardapane

slide-2
SLIDE 2

Contents

Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

slide-3
SLIDE 3

Content at a glance

Setting: Training of neural networks (NNs) where data is distributed across agents with sparse connectivity (e.g. as in wireless sensor networks). State-of-the-art: Very limited literature on distributed optimization of nonconvex objec- tive functions as required by NNs training. Objective: We propose a general framework with theoretical guarantees, that can be customized to multiple loss functions and regularizers. It allows for agents exploiting parallel multi-core processors.

slide-4
SLIDE 4

Visual representation

Node 1 Node 3 Node 2 Dataset Node Model S3 Node 4 S1 S2 S4 Model Input/ Output Link

Figure 1 : Example of distributed learning with four agents agreeing on a common (neural network) model.

slide-5
SLIDE 5

State-of-the-art

Distributed learning with convex objective functions is well established:

◮ Kernel Ridge Regression

[Predd, Kulkarni and Poor, IEEE SPM, 2006]

◮ Sparse Linear Regression

[Mateos, Bazerque and Giannakis, IEEE TSP, 2010]

◮ Support Vector Machines

[Forero, Cano and Giannakis, JMLR, 2010]

◮ Local convex solvers & communication

[Jaggi et al., NIPS, 2014] This reflects the availability of general-purpose methods for dis- tributed optimization of convex losses, e.g. the ADMM.

slide-6
SLIDE 6

Our contribution

  • 1. Distributed learning of neural networks has mostly been consid-

ered with sub-optimal ensemble procedures (e.g., boosting), or using some form of centralized server [Jeffrey et al., NIPS, 2012].

  • 2. Similarly, literature on distributed nonconvex optimization is re-

cent and smaller.

  • 3. We customize a novel framework called in-NEtwork nonconveX
  • pTimization (NEXT), combining a convexification-decomposition

technique and a dynamic consensus procedure [Di Lorenzo and Scutari, IEEE TSIPN, 2016].

slide-7
SLIDE 7

Contents

Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

slide-8
SLIDE 8

Problem formulation

Distributed training of a neural network f(x; w) can be cast as the min- imization of a social cost function G plus a regularization term r: min

w

U(w) = G(w) + r(w) =

I

  • i=1

gi(w) + r(w) , (1) where gi(·) is the local cost function of agent i, defined as: gi(w) =

  • m∈Si

l(di,m, f(w; xi,m)) , (2) where l(·, ·) is a (convex) loss function, and (xi,m, di,m) is a training ex- ample. Problem (1) is typically nonconvex due to f(x; w).

slide-9
SLIDE 9

Network model

◮ The network is modeled as a digraph G[n] = (V, E[n]), where V =

{1, . . . , I} is the set of agents, and E[n] is the set of (possibly) time- varying directed edges.

◮ Associated with each graph G[n], we introduce (possibly) time-

varying weights cij[n] matching G[n]: cij[n] =

  • θij ∈ [ϑ, 1]

if j ∈ N in

i [n];

  • therwise,

(3) for some ϑ ∈ (0, 1), and define the matrix C[n] (cij[n])I

i,j=1. ◮ The weights define the communication topology.

slide-10
SLIDE 10

Network assumptions

  • 1. The sequence of graphs G[n] is B-strongly connected, i.e.:

G[k] = (V, EB[k]) with EB[k] =

(k+1)B−1

  • n=kB

E[n] is strongly connected, for all k ≥ 0 and some B > 0.

  • 2. Every weight matrix C[n] in (3) is doubly stochastic, i.e. it satisfies

C[n] 1 = 1 and 1TC[n] = 1T ∀n. (4)

  • 3. Each agent i knows only its own cost function gi (but not the entire

G), and the common function r.

slide-11
SLIDE 11

Step 1 - Local optimization

At every step, a local estimate wi[n] is obtained by solving a strongly convex surrogate of the original cost function:

  • wi[n] = arg min

wi

  • gi(wi; wi[n]) + πi[n]T(wi − wi[n]) + r(wi) ,

(5) where πi[n]

  • j=i

∇w gj(wi[n]) (6) and gi(wi; wi[n]) is a convex approximation of gi at the point wi[n], preserving the first order properties of gi. πi[n] is not available to the agents and must be approximated.

slide-12
SLIDE 12

Step 2 - Computation of new estimate

The new estimate is obtained as the convex combination: zi[n] = wi[n] + α[n] ( wi[n] − wi[n]) , (7) where α[n] is a possibly time-varying step-size sequence.

slide-13
SLIDE 13

Step 3 - Consensus phase

Each agent i updates wi[n] with a consensus procedure: wi[n + 1] =

  • j∈N in

i [n]

cij[n] zi[n], (8) Finally, we replace πi[n] with a local estimate πi[n], asymptotically con- verging to πi[n]. We can update the local estimate πi[n] as:

  • πi[n] I · yi[n] − ∇gi(wi[n]),

(9) where yi[n] is a local auxiliary variable to asymptotically track the av- erage of the gradients, updated as: yi[n + 1]

I

  • j=1

cij[n]yj[n] + (∇gi(wi[n + 1]) − ∇gi(wi[n])) . (10)

slide-14
SLIDE 14

Convergence

Theorem

Let {w[n]}n {(wi[n])I

i=1}n be the sequence generated by the algorithm,

and let {w[n]}n {(1/I) I

i=1 wi[n]}n be its average. Suppose that the

step-size sequence {α[n]}n is chosen so that α[n] ∈ (0, 1], for all n, ∞

n=0 α[n] = ∞

and ∞

n=0 α[n]2 < ∞.

(11) If the sequence {w[n]}n is bounded, then (a) all its limit points are stationary solutions of the original problem; (b) all the sequences {wi[n]}n asymptoti- cally agree, i.e., wi[n] − w[n] − →

n→∞ 0, for all i.

Proof.

See [Di Lorenzo and Scutari, IEEE TSIPN, 2016].

slide-15
SLIDE 15

Contents

Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

slide-16
SLIDE 16

Choice of surrogate function

Strategy (a): Partial linearization (PL) We only linearize the NN mapping as:

  • gi(wi; wi[n]) =
  • m∈Si

l(di,m, f(wi; wi[n], xi,m)) + τi 2 wi − wi[n]2, (12) where τi ≥ 0, and

  • f(wi; wi[n], xi,m) = f(wi[n], xi,m) + ∇wf(wi[n]; xi,m)T(wi − wi[n])

Strategy (b): Full linearization (FL) We linearize gi around wi[n]:

  • gi(wi; wi[n]) = gi(wi[n]) + ∇gi(wi[n])T(wi − wi[n]) + τi

2 wi − wi[n]2. (13)

slide-17
SLIDE 17

Parallel computing of the surrogate function

◮ Assume there are C cores available at each node i, and partition

wi = (wi,c)C

c=1 in C nonoverlapping blocks. ◮ Choose

gi as additively separable in the blocks:

  • gi(wi; wi[n]) =

C

  • c=1
  • gi,c(wi,c; wi,−c[n])

where each gi,c(•; wi,−c[n]) satisfies the assumptions in the vari- able wi,c.

◮ The surrogate optimization problem decomposes in C separate

strongly convex subproblems as:

  • wi,c[n] = arg min

wi,c

  • gi(wi,c; wi,−c[n]) +

πi,c[n]T(wi,c − wi,c[n]) + r(wi,c),

slide-18
SLIDE 18

A practical example I

We consider a squared loss l(·, ·) = (di,m − f(w; xi,m))2, and an ℓ2 norm regularization r(w) = λ w2

  • 2. Define:

Ai[n] =

M

  • m=1

JT

i,m[n]Ji,m[n] + λI ,

(14) bi[n] =

M

  • m=1

rT

i,m[n]Ji,m[n] .

(15) with [Ji,m[n]]kl = ∂fk(wi[n]; xi,m) ∂wl . (16) ri,m[n] = di,m − f(wi[n]; xi,m) + Ji,m[n]wi[n] . (17)

slide-19
SLIDE 19

A practical example II

The cost function at agent i and core c for the PL formulation can be cast as:

  • Ui,c(wi,c; wi[n],

πi,c[n]) = wT

i,cAi,c,c[n]wi,c

− 2(bi,c[n] + Ai,c,−c[n]wi,−c[n] − 0.5 · πi,c[n])Twi (18) where Ai,c,c[n] is the block of Ai[n] corresponding to the c-th partition, and similarly for Ai,c,−c[n]. The solution is given in closed form as:

  • wi,c[n] = A−1

i,c [n](bi,c[n] + Ai,c,−c[n]wi,−c[n] − 0.5 ·

πi,c[n]) , (19)

slide-20
SLIDE 20

A practical example III

In the FL case, the cost function at agent i and core c, can be cast as:

  • Ui,c(wi,c; wi[n],

πi,c[n]) = (0.5 · τ + λ) wi,c2 − (τiwi,c[n] − ∇c gi(wi[n]) − πi,c[n])Twi,c, (20) This leads to the closed form solution:

  • wi,c[n] =
  • 2

τ + 2λ

  • (τiwi,c[n] − ∇c gi(wi[n]) −

πi,c[n]) (21)

slide-21
SLIDE 21

Contents

Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

slide-22
SLIDE 22

Experimental setup

◮ We use the Wisconsin breast cancer database (WDBC) as a dis-

tributed medical scenario.

◮ We perform a 3-fold cross-validation on the training data, and

for every fold we partition uniformly the training data among a predefined number of I = 5 agents with random connectivity.

◮ The overall cross-validation process is repeated 15 times. ◮ We consider a NN with 20 hidden nodes, for a total of Q = 641

free parameters, using a small regularization factor λ = 10−3.

◮ In all cases, the step-sizes are chosen according to the rule:

α[n] = α[n − 1](1 − µα[n − 1]), n ≥ 1. (22)

◮ An open-source library is available: https://bitbucket.org/

ispamm/parallel-and-distributed-neural-networks/

slide-23
SLIDE 23

Experimental results

Epoch 200 400 600 800 1000 Cost function 10 1 10 2

PL-SCA-NN (C=1) PL-SCA-NN (C=16) FL-SCA-NN

(a) Cost function

Processors per agent 1 2 4 8 16 32 Relative decrease training time [%] 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

PL-SCA-NN

(b) Training time Figure 2 : (a) Cost function’s evolution for FL-SCA-NN and PL-SCA-NN. (b) Relative decrease in training time (with respect to the case C = 1) obtained when varying the number of processors.

slide-24
SLIDE 24

Disagreement evolution

Epoch 200 400 600 800 1000 Disagreement 10 -8 10 -6 10 -4 10 -2 10 0

PL-SCA-NN (C=1) PL-SCA-NN (C=16)

Figure 3 : Behavior of average disagreement, versus number of local communication exchanges.

slide-25
SLIDE 25

Conclusive remarks

◮ We have proposed a novel framework for parallel and distributed

training of neural networks, where training data is distributed

  • ver a set of agents that are interconnected through a sparse net-

work topology.

◮ The method hinges on a (primal) successive convex approxima-

tion framework, and leverages dynamic consensus to propagate the information over the network.

◮ To the best of our knowledge, the proposed method is the first

available in the literature to solve non-convex distributed learn- ing problems with provable theoretical guarantees.

◮ We envision future works on other loss formulations, more com-

plex network structures (e.g. recurrent NNs), and stochastic up- dates of the surrogate functions.

slide-26
SLIDE 26

Questions?