Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last updated: June 18, 2019 Chih-Jen Lin (National Taiwan Univ.) 1 / 29

Outline Introduction 1 Newton method 2 Hessian and Gaussian-Newton Matrices 3 Chih-Jen Lin (National Taiwan Univ.) 2 / 29

Introduction Outline Introduction 1 Newton method 2 Hessian and Gaussian-Newton Matrices 3 Chih-Jen Lin (National Taiwan Univ.) 3 / 29

Introduction Optimization Methods Other than Stochastic Gradient We have explained why stochastic gradient is popular for deep learning The same reasons may explain why other methods are not suitable for deep learning But we also notice that from the simplest SG to what people are using many modifications were made Can we extend other optimization methods to be suitable for deep learning? Chih-Jen Lin (National Taiwan Univ.) 4 / 29

Newton method Outline Introduction 1 Newton method 2 Hessian and Gaussian-Newton Matrices 3 Chih-Jen Lin (National Taiwan Univ.) 5 / 29

Newton method Newton Method Consider an optimization problem min θ f ( θ ) Newton method solves the 2nd-order approximation to get a direction d ∇ f ( θ ) T d + 1 2 d T ∇ 2 f ( θ ) d min (1) d If f ( θ ) isn’t strictly convex, (1) may not have a unique solution Chih-Jen Lin (National Taiwan Univ.) 6 / 29

Newton method Newton Method (Cont’d) We may use a positive-definite G to approximate ∇ 2 f ( θ ). Then (1) can be solved by G d = −∇ f ( θ ) The resulting direction is a descent one ∇ f ( θ ) T d = −∇ f ( θ ) T G − 1 ∇ f ( θ ) < 0 Chih-Jen Lin (National Taiwan Univ.) 7 / 29

Newton method Newton Method (Cont’d) The procedure: while stopping condition not satisfied do Let G be ∇ 2 f ( θ ) or its approximation Exactly or approximately solve G d = −∇ f ( θ ) Find a suitable step size α Update θ ← θ + α d . end while Chih-Jen Lin (National Taiwan Univ.) 8 / 29

Newton method Step Size I Selection of the step size α : usually two types of approaches Line search Trust region (or its predecessor: Levenberg-Marquardt algorithm) If using line search, details are similar to what we had for gradient descent We gradually reduce α such that f ( θ + α d ) < f ( θ ) + ν ∇ f ( θ ) T ( α d ) Chih-Jen Lin (National Taiwan Univ.) 9 / 29

Newton method Newton versus Gradient Descent I We know they use second-order and first-order information respectively What are their special properties? It is known that using higher order information leads to faster final local convergence Chih-Jen Lin (National Taiwan Univ.) 10 / 29

Newton method Newton versus Gradient Descent II An illustration (modified from Tsai et al. (2014)) presented earlier distance to optimum distance to optimum ◦ × ◦ × time time Slow final convergence Fast final convergence Chih-Jen Lin (National Taiwan Univ.) 11 / 29

Newton method Newton versus Gradient Descent III But the question is for machine learning why we need fast local convergence? The answer is no However, higher-order methods tend to be more robust Their behavior may be more consistent across easy and difficult problems It’s known that stochastic gradient is sometimes sensitive to parameters Thus what we hope to try here is if we can have a more robust optimization method Chih-Jen Lin (National Taiwan Univ.) 12 / 29

Newton method Difficulties of Newton for NN I The Newton linear system G d = −∇ f ( θ ) (2) can be large. G ∈ R n × n , where n is the total number of variables Thus G is often too large to be stored Chih-Jen Lin (National Taiwan Univ.) 13 / 29

Newton method Difficulties of Newton for NN II Evan if we can store G , calculating d = − G − 1 ∇ f ( θ ) is usually very expensive Thus a direct use of Newton for deep learning is hopeless Chih-Jen Lin (National Taiwan Univ.) 14 / 29

Newton method Existing Works Trying to Make Newton Practical I Many works tried to address this issue Their approaches significantly vary I roughly categorize them to two groups Hessian-free (Martens, 2010; Martens and Sutskever, 2012; Wang et al., 2018b; Henriques et al., 2018) Hessian approximation (Martens and Grosse, 2015; Botev et al., 2017; Zhang et al., 2017) In particular, diagonal approximation Chih-Jen Lin (National Taiwan Univ.) 15 / 29

Newton method Existing Works Trying to Make Newton Practical II There are many others where I didn’t put into the above two groups for various reasons (Osawa et al., 2019; Wang et al., 2018a; Chen et al., 2019; Wilamowski et al., 2007) There are also comparisons (Chen and Hsieh, 2018) With the many possibilities it is difficult to reach conclusions We decide to first check the robustness of standard Newton methods on small-scale data Then we don’t need approximations Chih-Jen Lin (National Taiwan Univ.) 16 / 29

Newton method Existing Works Trying to Make Newton Practical III We will see more details in the project description Chih-Jen Lin (National Taiwan Univ.) 17 / 29

Hessian and Gaussian-Newton Matrices Outline Introduction 1 Newton method 2 Hessian and Gaussian-Newton Matrices 3 Chih-Jen Lin (National Taiwan Univ.) 18 / 29

Hessian and Gaussian-Newton Matrices Introduction We will check techniques to address the difficulty of storing or inverting the Hessian But before that let’s derive the mathematical form Chih-Jen Lin (National Taiwan Univ.) 19 / 29

Hessian and Gaussian-Newton Matrices Hessian Matrix I For CNN, the gradient of f ( θ ) is l ∇ f ( θ ) = 1 C θ + 1 � ( J i ) T ∇ z L +1 , i ξ ( z L +1 , i ; y i , Z 1 , i ) , l i =1 (3) where  ∂ z L +1 , i ∂ z L +1 , i  · · · 1 1 ∂θ 1 ∂θ n . . . J i = . . .   , i = 1 , . . . , l , (4) . . .    ∂ z L +1 , i ∂ z L +1 , i  nL +1 nL +1 · · · ∂θ 1 ∂θ n n L +1 × n Chih-Jen Lin (National Taiwan Univ.) 20 / 29

Hessian and Gaussian-Newton Matrices Hessian Matrix II is the Jacobian of z L +1 , i ( θ ). The Hessian matrix of f ( θ ) is l ∇ 2 f ( θ ) = 1 C I + 1 � ( J i ) T B i J i l i =1  ∂ 2 z L +1 , i ∂ 2 z L +1 , i  j j · · · n L l ∂ξ ( z L +1 , i ; y i , Z 1 , i ) ∂θ 1 ∂θ 1 ∂θ 1 ∂θ n + 1 . . ... � � . .    , . .   ∂ z L +1 , i l  ∂ 2 z L +1 , i ∂ 2 z L +1 , i j i =1 j =1 j j · · · ∂θ n ∂θ 1 ∂θ n ∂θ n Chih-Jen Lin (National Taiwan Univ.) 21 / 29

Hessian and Gaussian-Newton Matrices Hessian Matrix III where I is the identity matrix and B i is the Hessian of ξ ( · ) with respect to z L +1 , i : B i = ∇ 2 z L +1 , i , z L +1 , i ξ ( z L +1 , i ; y i , Z 1 , i ) More precisely, ts = ∂ 2 ξ ( z L +1 , i ; y i , Z 1 , i ) B i , ∀ t , s = 1 , . . . , n L +1 . (5) ∂ z L +1 , i ∂ z L +1 , i s t Usually B i is very simple. Chih-Jen Lin (National Taiwan Univ.) 22 / 29

Hessian and Gaussian-Newton Matrices Hessian Matrix IV For example, if the squared loss is used, ξ ( z L +1 , i ; y i ) = || z L +1 , i − y i || 2 . then  2  B i = ...   2 Usually we consider a convex loss function ξ ( z L +1 , i ; y i ) with respect to z L +1 , i Chih-Jen Lin (National Taiwan Univ.) 23 / 29

Hessian and Gaussian-Newton Matrices Hessian Matrix V Thus B i is positive semi-definite The last term of ∇ 2 f ( θ ) may not be positive semi-definite Note that for a twice differentiable function f ( θ ) f ( θ ) is convex if and only if ∇ 2 f ( θ ) is positive semi-definite Chih-Jen Lin (National Taiwan Univ.) 24 / 29

Hessian and Gaussian-Newton Matrices Jacobian Matrix The Jacobian matrix of z L +1 , i ( θ ) ∈ R n L +1 is  ∂ z L +1 , i ∂ z L +1 , i  · · · 1 1 ∂θ 1 ∂θ n . . . J i = . . .    ∈ R n L +1 × n , i = 1 , . . . l . . . .    ∂ z L +1 , i ∂ z L +1 , i nL nL · · · ∂θ 1 ∂θ n n L +1 : # of neurons in the output layer n : number of total variables n L +1 × n can be large Chih-Jen Lin (National Taiwan Univ.) 25 / 29

Hessian and Gaussian-Newton Matrices Gauss-Newton Matrix I The Hessian matrix ∇ 2 f ( θ ) is now not positive definite. We may need a positive definite approximation This is a deep research issue Many existing Newton methods for NN has considered the Gauss-Newton matrix (Schraudolph, 2002) l G = 1 C I + 1 � ( J i ) T B i J i l i =1 by removing the last term in ∇ 2 f ( θ ) Chih-Jen Lin (National Taiwan Univ.) 26 / 29

Hessian and Gaussian-Newton Matrices Gauss-Newton Matrix II The Gauss-Newton matrix is positive definite if B i is positive semi-definite This can be achieved if we use a convex loss function in terms of z L +1 , i ( θ ) We then solve G d = −∇ f ( θ ) Chih-Jen Lin (National Taiwan Univ.) 27 / 29

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last updated: June 18, 2019 Chih-Jen Lin (National Taiwan Univ.) 1 / 29 Outline Introduction 1 Newton method 2 Hessian and Gaussian-Newton Matrices 3

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Worldwide Newton Conference Paris, September 2004 eBook composition on the Newton MessagePad 2100

Images of Isaac Newton 1 Portrait of Isaac Newton, Godfrey Kneller, 1689 This image is in the

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Phase Separation, Interfaces and Wetting in Two Dimensions Gesualdo Delfino SISSA - Trieste

State-Space Search and the STRIPS Planner Searching for a Path through a Graph of Nodes

Call-by-name, call-by-value, and the -calculus G.D. Plotkin Presented by Dietrich Geisler

Pretorsion theories in general categories Alberto Facchini Universit` a di Padova P arnu, 17

Secret sharing on large girth graphs Lszl Csirmaz, Pter Ligeti Etvs Lornd University,

Chapter 4: Training Regression Models Dr. Xudong Liu Assistant Professor School of Computing

Stochastic Gradient Descent 10701 Recitations 3 Mu Li Computer Science Department Cargenie

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports Ulf Regg Steve Kieffer

Sambuz

Useful Links

Newsletter

Mail Us