Parallel and Distributed Training of Neural Networks via Successive - PowerPoint PPT Presentation

2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane

Contents Introduction Overview State-of-the-art The NEXT Framework Problem Formulation Derivation of the NEXT Algorithm Application to Distributed NN Training Choice of surrogate function Parallel computing of the surrogate function A practical example Experimental results and conclusions Experimental results Conclusions

Content at a glance Setting : Training of neural networks (NNs) where data is distributed across agents with sparse connectivity (e.g. as in wireless sensor networks). State-of-the-art : Very limited literature on distributed optimization of nonconvex objective functions as required by NNs training. Objective : We propose a general framework with theoretical guarantees, that can be customized to multiple loss functions and regularizers. It allows for agents exploiting parallel multi-core processors.

Visual representation Node Node 3 S 3 Dataset Node 1 S 1 Model Model Link Input/ Output Node 2 Node 4 S 2 S 4 Figure 1 : Example of distributed learning with four agents agreeing on a common (neural network) model.

State-of-the-art Distributed learning with convex objective functions is well established: ◮ Kernel Ridge Regression [Predd, Kulkarni and Poor, IEEE SPM, 2006] ◮ Sparse Linear Regression [Mateos, Bazerque and Giannakis, IEEE TSP, 2010] ◮ Support Vector Machines [Forero, Cano and Giannakis, JMLR, 2010] ◮ Local convex solvers & communication [Jaggi et al., NIPS, 2014] This reflects the availability of general-purpose methods for distributed optimization of convex losses, e.g. the ADMM.

Our contribution 1. Distributed learning of neural networks has mostly been consid- ered with sub-optimal ensemble procedures (e.g., boosting), or using some form of centralized server [Jeffrey et al., NIPS, 2012]. 2. Similarly, literature on distributed nonconvex optimization is re- cent and smaller. 3. We customize a novel framework called in-NEtwork nonconveX opTimization (NEXT), combining a convexification-decomposition technique and a dynamic consensus procedure [Di Lorenzo and Scutari, IEEE TSIPN, 2016].

Problem formulation Distributed training of a neural network f ( x ; w ) can be cast as the min- imization of a social cost function G plus a regularization term r : I � min U ( w ) = G ( w ) + r ( w ) = g i ( w ) + r ( w ) , (1) w i = 1 where g i ( · ) is the local cost function of agent i , defined as: � g i ( w ) = l ( d i , m , f ( w ; x i , m )) , (2) m ∈S i where l ( · , · ) is a (convex) loss function, and ( x i , m , d i , m ) is a training example. Problem (1) is typically nonconvex due to f ( x ; w ) .

Network model ◮ The network is modeled as a digraph G [ n ] = ( V , E [ n ]) , where V = { 1 , . . . , I } is the set of agents, and E [ n ] is the set of (possibly) time- varying directed edges. ◮ Associated with each graph G [ n ] , we introduce (possibly) time- varying weights c ij [ n ] matching G [ n ] : � if j ∈ N in θ ij ∈ [ ϑ, 1 ] i [ n ] ; c ij [ n ] = (3) 0 otherwise, for some ϑ ∈ ( 0 , 1 ) , and define the matrix C [ n ] � ( c ij [ n ]) I i , j = 1 . ◮ The weights define the communication topology.

Network assumptions 1. The sequence of graphs G [ n ] is B-strongly connected, i.e.: G [ k ] = ( V , E B [ k ]) with ( k + 1 ) B − 1 � E B [ k ] = E [ n ] n = kB is strongly connected, for all k ≥ 0 and some B > 0. 2. Every weight matrix C [ n ] in (3) is doubly stochastic, i.e. it satisfies 1 T C [ n ] = 1 T C [ n ] 1 = 1 and ∀ n . (4) 3. Each agent i knows only its own cost function g i (but not the entire G ), and the common function r .

Step 1 - Local optimization At every step, a local estimate w i [ n ] is obtained by solving a strongly convex surrogate of the original cost function: g i ( w i ; w i [ n ]) + π i [ n ] T ( w i − w i [ n ]) + r ( w i ) , w i [ n ] = arg min � � (5) w i where � π i [ n ] � ∇ w g j ( w i [ n ]) (6) j � = i and � g i ( w i ; w i [ n ]) is a convex approximation of g i at the point w i [ n ] , preserving the first order properties of g i . π i [ n ] is not available to the agents and must be approximated.

Step 2 - Computation of new estimate The new estimate is obtained as the convex combination: z i [ n ] = w i [ n ] + α [ n ] ( � w i [ n ] − w i [ n ]) , (7) where α [ n ] is a possibly time-varying step-size sequence.

Step 3 - Consensus phase Each agent i updates w i [ n ] with a consensus procedure: � w i [ n + 1 ] = c ij [ n ] z i [ n ] , (8) j ∈N in i [ n ] Finally, we replace π i [ n ] with a local estimate � π i [ n ] , asymptotically con- verging to π i [ n ] . We can update the local estimate � π i [ n ] as: π i [ n ] � I · y i [ n ] − ∇ g i ( w i [ n ]) , � (9) where y i [ n ] is a local auxiliary variable to asymptotically track the average of the gradients, updated as: � I y i [ n + 1 ] � c ij [ n ] y j [ n ] + ( ∇ g i ( w i [ n + 1 ]) − ∇ g i ( w i [ n ])) . (10) j = 1

Convergence Theorem Let { w [ n ] } n � { ( w i [ n ]) I i = 1 } n be the sequence generated by the algorithm, and let { w [ n ] } n � { ( 1 / I ) � I i = 1 w i [ n ] } n be its average. Suppose that the step-size sequence { α [ n ] } n is chosen so that α [ n ] ∈ ( 0 , 1 ] , for all n, � ∞ � ∞ n = 0 α [ n ] 2 < ∞ . n = 0 α [ n ] = ∞ and (11) If the sequence { w [ n ] } n is bounded, then (a) all its limit points are stationary solutions of the original problem; (b) all the sequences { w i [ n ] } n asymptotically agree, i.e., � w i [ n ] − w [ n ] � − n →∞ 0 , for all i. → Proof. See [Di Lorenzo and Scutari, IEEE TSIPN, 2016].

Choice of surrogate function Strategy (a): Partial linearization (PL) We only linearize the NN mapping as: � f ( w i ; w i [ n ] , x i , m )) + τ i l ( d i , m , � 2 � w i − w i [ n ] � 2 , � g i ( w i ; w i [ n ]) = (12) m ∈S i where τ i ≥ 0, and � f ( w i ; w i [ n ] , x i , m ) = f ( w i [ n ] , x i , m ) + ∇ w f ( w i [ n ]; x i , m ) T ( w i − w i [ n ]) Strategy (b): Full linearization (FL) We linearize g i around w i [ n ] : g i ( w i ; w i [ n ]) = g i ( w i [ n ]) + ∇ g i ( w i [ n ]) T ( w i − w i [ n ]) + τ i 2 � w i − w i [ n ] � 2 . � (13)

Parallel computing of the surrogate function ◮ Assume there are C cores available at each node i , and partition w i = ( w i , c ) C c = 1 in C nonoverlapping blocks. ◮ Choose � g i as additively separable in the blocks: C � � g i ( w i ; w i [ n ]) = � g i , c ( w i , c ; w i , − c [ n ]) c = 1 where each � g i , c ( • ; w i , − c [ n ]) satisfies the assumptions in the variable w i , c . ◮ The surrogate optimization problem decomposes in C separate strongly convex subproblems as: π i , c [ n ] T ( w i , c − w i , c [ n ]) + r ( w i , c ) , w i , c [ n ] = arg min � � g i ( w i , c ; w i , − c [ n ]) + � w i , c

A practical example I We consider a squared loss l ( · , · ) = ( d i , m − f ( w ; x i , m )) 2 , and an ℓ 2 norm regularization r ( w ) = λ � w � 2 2 . Define: M � J T A i [ n ] = i , m [ n ] J i , m [ n ] + λ I , (14) m = 1 M � r T b i [ n ] = i , m [ n ] J i , m [ n ] . (15) m = 1 with [ J i , m [ n ]] kl = ∂ f k ( w i [ n ]; x i , m ) . (16) ∂ w l r i , m [ n ] = d i , m − f ( w i [ n ]; x i , m ) + J i , m [ n ] w i [ n ] . (17)

A practical example II The cost function at agent i and core c for the PL formulation can be cast as: � π i , c [ n ]) = w T U i , c ( w i , c ; w i [ n ] , � i , c A i , c , c [ n ] w i , c π i , c [ n ]) T w i − 2 ( b i , c [ n ] + A i , c , − c [ n ] w i , − c [ n ] − 0 . 5 · � (18) where A i , c , c [ n ] is the block of A i [ n ] corresponding to the c -th partition, and similarly for A i , c , − c [ n ] . The solution is given in closed form as: w i , c [ n ] = A − 1 � i , c [ n ]( b i , c [ n ] + A i , c , − c [ n ] w i , − c [ n ] − 0 . 5 · � π i , c [ n ]) , (19)

A practical example III In the FL case, the cost function at agent i and core c , can be cast as: � π i , c [ n ]) = ( 0 . 5 · τ + λ ) � w i , c � 2 U i , c ( w i , c ; w i [ n ] , � π i , c [ n ]) T w i , c , − ( τ i w i , c [ n ] − ∇ c g i ( w i [ n ]) − � (20) This leads to the closed form solution: � � 2 w i , c [ n ] = � ( τ i w i , c [ n ] − ∇ c g i ( w i [ n ]) − � π i , c [ n ]) (21) τ + 2 λ

Parallel and Distributed Training of Neural Networks via Successive - PowerPoint PPT Presentation

2016 IEEE International Workshop on Machine Learning for Signal Processing (MLSP16) Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors : Paolo Di Lorenzo and Simone Scardapane Contents

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Artificial Neural Networks Oliver Schulte - CMPT 726 Feed-forward Networks Network Training

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of

More efficient representations of compounds for machine learning models Bing Huang and Anatole von

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26,

World's Fastest Machine Learning With GPUs http://github.com/h2oai/h2o4gpu Speaker: Jonathan C.

Neuromorphic Computing with Reservoir Neural Networks on Memristive Hardware Aaron Stockdill

A Contextual-Bandit Approach to Personalized News Article Recommendation Lihong li, Wei Chu,

Planning practice and the purpose of the enforcement system Remedy the effects of

The Simplex Algorithm Prominent algorithm for solving optimization problems over a set

Sambuz

Useful Links

Newsletter

Mail Us