Parallel and Distributed Training of Neural Networks via Successive - - PowerPoint PPT Presentation
Parallel and Distributed Training of Neural Networks via Successive - - PowerPoint PPT Presentation
2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane Contents
Contents
Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Content at a glance
Setting: Training of neural networks (NNs) where data is distributed across agents with sparse connectivity (e.g. as in wireless sensor networks). State-of-the-art: Very limited literature on distributed optimization of nonconvex objec- tive functions as required by NNs training. Objective: We propose a general framework with theoretical guarantees, that can be customized to multiple loss functions and regularizers. It allows for agents exploiting parallel multi-core processors.
Visual representation
Node 1 Node 3 Node 2 Dataset Node Model S3 Node 4 S1 S2 S4 Model Input/ Output Link
Figure 1 : Example of distributed learning with four agents agreeing on a common (neural network) model.
State-of-the-art
Distributed learning with convex objective functions is well established:
◮ Kernel Ridge Regression
[Predd, Kulkarni and Poor, IEEE SPM, 2006]
◮ Sparse Linear Regression
[Mateos, Bazerque and Giannakis, IEEE TSP, 2010]
◮ Support Vector Machines
[Forero, Cano and Giannakis, JMLR, 2010]
◮ Local convex solvers & communication
[Jaggi et al., NIPS, 2014] This reflects the availability of general-purpose methods for dis- tributed optimization of convex losses, e.g. the ADMM.
Our contribution
- 1. Distributed learning of neural networks has mostly been consid-
ered with sub-optimal ensemble procedures (e.g., boosting), or using some form of centralized server [Jeffrey et al., NIPS, 2012].
- 2. Similarly, literature on distributed nonconvex optimization is re-
cent and smaller.
- 3. We customize a novel framework called in-NEtwork nonconveX
- pTimization (NEXT), combining a convexification-decomposition
technique and a dynamic consensus procedure [Di Lorenzo and Scutari, IEEE TSIPN, 2016].
Contents
Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Problem formulation
Distributed training of a neural network f(x; w) can be cast as the min- imization of a social cost function G plus a regularization term r: min
w
U(w) = G(w) + r(w) =
I
- i=1
gi(w) + r(w) , (1) where gi(·) is the local cost function of agent i, defined as: gi(w) =
- m∈Si
l(di,m, f(w; xi,m)) , (2) where l(·, ·) is a (convex) loss function, and (xi,m, di,m) is a training ex- ample. Problem (1) is typically nonconvex due to f(x; w).
Network model
◮ The network is modeled as a digraph G[n] = (V, E[n]), where V =
{1, . . . , I} is the set of agents, and E[n] is the set of (possibly) time- varying directed edges.
◮ Associated with each graph G[n], we introduce (possibly) time-
varying weights cij[n] matching G[n]: cij[n] =
- θij ∈ [ϑ, 1]
if j ∈ N in
i [n];
- therwise,
(3) for some ϑ ∈ (0, 1), and define the matrix C[n] (cij[n])I
i,j=1. ◮ The weights define the communication topology.
Network assumptions
- 1. The sequence of graphs G[n] is B-strongly connected, i.e.:
G[k] = (V, EB[k]) with EB[k] =
(k+1)B−1
- n=kB
E[n] is strongly connected, for all k ≥ 0 and some B > 0.
- 2. Every weight matrix C[n] in (3) is doubly stochastic, i.e. it satisfies
C[n] 1 = 1 and 1TC[n] = 1T ∀n. (4)
- 3. Each agent i knows only its own cost function gi (but not the entire
G), and the common function r.
Step 1 - Local optimization
At every step, a local estimate wi[n] is obtained by solving a strongly convex surrogate of the original cost function:
- wi[n] = arg min
wi
- gi(wi; wi[n]) + πi[n]T(wi − wi[n]) + r(wi) ,
(5) where πi[n]
- j=i
∇w gj(wi[n]) (6) and gi(wi; wi[n]) is a convex approximation of gi at the point wi[n], preserving the first order properties of gi. πi[n] is not available to the agents and must be approximated.
Step 2 - Computation of new estimate
The new estimate is obtained as the convex combination: zi[n] = wi[n] + α[n] ( wi[n] − wi[n]) , (7) where α[n] is a possibly time-varying step-size sequence.
Step 3 - Consensus phase
Each agent i updates wi[n] with a consensus procedure: wi[n + 1] =
- j∈N in
i [n]
cij[n] zi[n], (8) Finally, we replace πi[n] with a local estimate πi[n], asymptotically con- verging to πi[n]. We can update the local estimate πi[n] as:
- πi[n] I · yi[n] − ∇gi(wi[n]),
(9) where yi[n] is a local auxiliary variable to asymptotically track the av- erage of the gradients, updated as: yi[n + 1]
I
- j=1
cij[n]yj[n] + (∇gi(wi[n + 1]) − ∇gi(wi[n])) . (10)
Convergence
Theorem
Let {w[n]}n {(wi[n])I
i=1}n be the sequence generated by the algorithm,
and let {w[n]}n {(1/I) I
i=1 wi[n]}n be its average. Suppose that the
step-size sequence {α[n]}n is chosen so that α[n] ∈ (0, 1], for all n, ∞
n=0 α[n] = ∞
and ∞
n=0 α[n]2 < ∞.
(11) If the sequence {w[n]}n is bounded, then (a) all its limit points are stationary solutions of the original problem; (b) all the sequences {wi[n]}n asymptoti- cally agree, i.e., wi[n] − w[n] − →
n→∞ 0, for all i.
Proof.
See [Di Lorenzo and Scutari, IEEE TSIPN, 2016].
Contents
Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Choice of surrogate function
Strategy (a): Partial linearization (PL) We only linearize the NN mapping as:
- gi(wi; wi[n]) =
- m∈Si
l(di,m, f(wi; wi[n], xi,m)) + τi 2 wi − wi[n]2, (12) where τi ≥ 0, and
- f(wi; wi[n], xi,m) = f(wi[n], xi,m) + ∇wf(wi[n]; xi,m)T(wi − wi[n])
Strategy (b): Full linearization (FL) We linearize gi around wi[n]:
- gi(wi; wi[n]) = gi(wi[n]) + ∇gi(wi[n])T(wi − wi[n]) + τi
2 wi − wi[n]2. (13)
Parallel computing of the surrogate function
◮ Assume there are C cores available at each node i, and partition
wi = (wi,c)C
c=1 in C nonoverlapping blocks. ◮ Choose
gi as additively separable in the blocks:
- gi(wi; wi[n]) =
C
- c=1
- gi,c(wi,c; wi,−c[n])
where each gi,c(•; wi,−c[n]) satisfies the assumptions in the vari- able wi,c.
◮ The surrogate optimization problem decomposes in C separate
strongly convex subproblems as:
- wi,c[n] = arg min
wi,c
- gi(wi,c; wi,−c[n]) +
πi,c[n]T(wi,c − wi,c[n]) + r(wi,c),
A practical example I
We consider a squared loss l(·, ·) = (di,m − f(w; xi,m))2, and an ℓ2 norm regularization r(w) = λ w2
- 2. Define:
Ai[n] =
M
- m=1
JT
i,m[n]Ji,m[n] + λI ,
(14) bi[n] =
M
- m=1
rT
i,m[n]Ji,m[n] .
(15) with [Ji,m[n]]kl = ∂fk(wi[n]; xi,m) ∂wl . (16) ri,m[n] = di,m − f(wi[n]; xi,m) + Ji,m[n]wi[n] . (17)
A practical example II
The cost function at agent i and core c for the PL formulation can be cast as:
- Ui,c(wi,c; wi[n],
πi,c[n]) = wT
i,cAi,c,c[n]wi,c
− 2(bi,c[n] + Ai,c,−c[n]wi,−c[n] − 0.5 · πi,c[n])Twi (18) where Ai,c,c[n] is the block of Ai[n] corresponding to the c-th partition, and similarly for Ai,c,−c[n]. The solution is given in closed form as:
- wi,c[n] = A−1
i,c [n](bi,c[n] + Ai,c,−c[n]wi,−c[n] − 0.5 ·
πi,c[n]) , (19)
A practical example III
In the FL case, the cost function at agent i and core c, can be cast as:
- Ui,c(wi,c; wi[n],
πi,c[n]) = (0.5 · τ + λ) wi,c2 − (τiwi,c[n] − ∇c gi(wi[n]) − πi,c[n])Twi,c, (20) This leads to the closed form solution:
- wi,c[n] =
- 2
τ + 2λ
- (τiwi,c[n] − ∇c gi(wi[n]) −
πi,c[n]) (21)
Contents
Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions
Experimental setup
◮ We use the Wisconsin breast cancer database (WDBC) as a dis-
tributed medical scenario.
◮ We perform a 3-fold cross-validation on the training data, and
for every fold we partition uniformly the training data among a predefined number of I = 5 agents with random connectivity.
◮ The overall cross-validation process is repeated 15 times. ◮ We consider a NN with 20 hidden nodes, for a total of Q = 641
free parameters, using a small regularization factor λ = 10−3.
◮ In all cases, the step-sizes are chosen according to the rule:
α[n] = α[n − 1](1 − µα[n − 1]), n ≥ 1. (22)
◮ An open-source library is available: https://bitbucket.org/
ispamm/parallel-and-distributed-neural-networks/
Experimental results
Epoch 200 400 600 800 1000 Cost function 10 1 10 2
PL-SCA-NN (C=1) PL-SCA-NN (C=16) FL-SCA-NN
(a) Cost function
Processors per agent 1 2 4 8 16 32 Relative decrease training time [%] 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
PL-SCA-NN
(b) Training time Figure 2 : (a) Cost function’s evolution for FL-SCA-NN and PL-SCA-NN. (b) Relative decrease in training time (with respect to the case C = 1) obtained when varying the number of processors.
Disagreement evolution
Epoch 200 400 600 800 1000 Disagreement 10 -8 10 -6 10 -4 10 -2 10 0
PL-SCA-NN (C=1) PL-SCA-NN (C=16)
Figure 3 : Behavior of average disagreement, versus number of local communication exchanges.
Conclusive remarks
◮ We have proposed a novel framework for parallel and distributed
training of neural networks, where training data is distributed
- ver a set of agents that are interconnected through a sparse net-
work topology.
◮ The method hinges on a (primal) successive convex approxima-
tion framework, and leverages dynamic consensus to propagate the information over the network.
◮ To the best of our knowledge, the proposed method is the first
available in the literature to solve non-convex distributed learn- ing problems with provable theoretical guarantees.
◮ We envision future works on other loss formulations, more com-