Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81

Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 2 / 81

Backward setting Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 3 / 81

Backward setting Jacobian evaluation Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 4 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer I For an instance i the Jacobian can be partitioned into L blocks according to layers J i = � J 1 , i J 2 , i . . . J L , i � , m = 1 , . . . , L , (1) where � � ∂ ③ L +1 , i ∂ ③ L +1 , i J m , i = . ∂ ( ❜ m ) T ∂ vec( W m ) T The calculation seems to be very similar to that for the gradient. Chih-Jen Lin (National Taiwan Univ.) 5 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer II For the convolutional layers, recall for gradient we have l ∂ W m = 1 ∂ f C W m + 1 ∂ξ i � ∂ W m l i =1 and � ∂ξ i � T ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ vec( W m ) T = vec Chih-Jen Lin (National Taiwan Univ.) 6 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer III Now we have   ∂ z L +1 , i 1 ∂ vec( W m ) T ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ vec( W m ) T   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 .   . = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Chih-Jen Lin (National Taiwan Univ.) 7 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer IV If ❜ m is considered, the result is � � ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( ❜ m ) T ∂ vec( W m ) T   � �� T ∂ z L +1 , i � φ (pad( Z m , i )) T 1 a m vec 1 conv b m ∂ S m , i conv   . .   . = .     � �� T ∂ z L +1 , i  �  φ (pad( Z m , i )) T 1 a m nL +1 vec conv b m ∂ S m , i conv Chih-Jen Lin (National Taiwan Univ.) 8 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer V We can see that it’s more complicated than gradient. Gradient is a vector but Jacobian is a matrix Chih-Jen Lin (National Taiwan Univ.) 9 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process I For gradient, earlier we need a backward process to calculate ∂ξ i ∂ S m , i Now what we need are ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i The process is similar Chih-Jen Lin (National Taiwan Univ.) 10 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process II If with RELU activation function and max pooling, for gradient we had ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . Chih-Jen Lin (National Taiwan Univ.) 11 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process III Assume that ∂ ③ L +1 , i ∂ vec( Z m +1 , i ) are available. ∂ z L +1 , i j ∂ vec( S m , i ) T � � ∂ z L +1 , i j P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool , j = 1 , . . . , n L +1 . Chih-Jen Lin (National Taiwan Univ.) 12 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process IV These row vectors can be written together as a matrix ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool . Chih-Jen Lin (National Taiwan Univ.) 13 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process V For gradient, we use ∂ξ i ∂ S m , i to have � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad ∂ S m , i and pass it to the previous layer Chih-Jen Lin (National Taiwan Univ.) 14 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process VI Now we need to generate ∂ ③ L +1 , i ∂ vec( Z m , i ) T and pass it to the previous layer. Now we have   � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i   ∂ ③ L +1 , i .  .  . ∂ vec( Z m , i ) T = .     � � T ( W m ) T ∂ z L +1 , i   nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 15 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer I We do not discuss details, but list all results below ∂ ③ L +1 , i ∂ vec( W m ) T = � � � � �� T ∂ z L +1 , i ∂ z L +1 , i n L +1 ∂ s m , i ( ③ m , i ) T 1 ∂ s m , i ( ③ m , i ) T vec . . . vec Chih-Jen Lin (National Taiwan Univ.) 16 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer II ∂ ( ❜ m ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T , ∂ ③ L +1 , i ∂ ③ L +1 , i � 1 n L +1 I [ ③ m +1 , i ] T � ∂ ( s m , i ) T = ∂ ( ③ m +1 , i ) T ⊙ ∂ ( ③ m , i ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T W m Chih-Jen Lin (National Taiwan Univ.) 17 / 81

Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer III For layer L + 1, if using the squared loss and the linear activation function, we have ∂ ③ L +1 , i ∂ ( s L , i ) T = I n L +1 . Chih-Jen Lin (National Taiwan Univ.) 18 / 81

Backward setting Jacobian evaluation Gradient versus Jacobian I Operations for gradient ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . ∂ξ i ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ W m = � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad , ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 19 / 81

Backward setting Jacobian evaluation Gradient versus Jacobian II For Jacobian we have ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool .   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T nL +1 vec( Chih-Jen Lin (National Taiwan Univ.) 20 / 81

Backward setting Jacobian evaluation Gradient versus Jacobian III ∂ ③ L +1 , i ∂ vec( Z m , i ) T   � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i   .  .  . = .     � � T ( W m ) T ∂ z L +1 , i   nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 21 / 81

Backward setting Jacobian evaluation Implementation I For gradient we did ∆ ← mat(vec(∆) T P m , i pool ) ∂ξ i ∂ W m = ∆ · φ (pad( Z m , i )) T � T P m � ( W m ) T ∆ φ P m ∆ ← vec pad ∆ ← ∆ ⊙ I [ Z m , i ] Now for Jacobian we have similar settings but there are some differences Chih-Jen Lin (National Taiwan Univ.) 22 / 81

Backward setting Jacobian evaluation Implementation II We don’t really store the Jacobian:   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Recall Jacobian is used for matrix-vector products G S ✈ = 1 C ✈ + 1 � � ( J i ) T � �� B i ( J i ✈ ) (2) | S | i ∈ S Chih-Jen Lin (National Taiwan Univ.) 23 / 81

Backward setting Jacobian evaluation Implementation III The form   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( is like the product of two things Chih-Jen Lin (National Taiwan Univ.) 24 / 81

Backward setting Jacobian evaluation Implementation IV If we have ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i , and φ (pad( Z m , i )) probably we can do the matrix-vector product without multiplying these two things out We will talk about this again later Thus our Jacobian evaluation is solely on obtaining ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 25 / 81

Backward setting Jacobian evaluation Implementation V Further we need to take all data (or data in the selected subset) into account In the end what we have is the following procedure In the beginning ∆ ∈ R d m +1 a m +1 b m +1 × n L +1 × l This corresponds to ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � ∂ vec( Z m +1 , i ) T ⊙ , ∀ i = 1 , . . . , l Chih-Jen Lin (National Taiwan Univ.) 26 / 81

Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81 Outline Backward setting 1 Jacobian evaluation

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

1 AP Physics C E & M Gauss's Law 20160109 www.njctl.org 2 Gauss's Law Click on

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Prime Numbers Prime Numbers Prime number : an integer p>1 that is divisible only by 1 and

Gated Precharging: Reducing Bitline Precharge in Deep-Sub Caches Se-Hyun Yang and Babak Falsafi

Recent status and plans at SPring-8 LEPS2 facility M. Miyabe ELPH Tohoku University LEPS and

The Atmospheric Monitoring system of the JEM-EUSO telescope Simona Toscano for the JEM-EUSO

Compact Fourier Analysis for Multigrid Methods Cortona 2008 Thomas Huckle joint work with

Breaking and restoration of rotational symmetry in the spectrum of conjugate nuclei on the

G. Moreau Laboratoire de Physique Thorique, Orsay, France Based on arXiv:1210.3977 (will be

Last time: the Divergence Theorem Assume E is a vector field with continuous first derivatives on

Sambuz

Useful Links

Newsletter

Mail Us

Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81 Outline Backward setting 1 Jacobian evaluation

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

1 AP Physics C E &amp; M Gauss's Law 20160109 www.njctl.org 2 Gauss's Law Click on

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Newton Methods for Neural Networks: Part 1 Chih-Jen Lin National Taiwan University Last

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

An Efficient Gauss-Newton Algorithm for Symmetric Low-Rank Product Matrix Approximations Xin Liu

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Prime Numbers Prime Numbers Prime number : an integer p&gt;1 that is divisible only by 1 and

Gated Precharging: Reducing Bitline Precharge in Deep-Sub Caches Se-Hyun Yang and Babak Falsafi

Recent status and plans at SPring-8 LEPS2 facility M. Miyabe ELPH Tohoku University LEPS and

The Atmospheric Monitoring system of the JEM-EUSO telescope Simona Toscano for the JEM-EUSO

Compact Fourier Analysis for Multigrid Methods Cortona 2008 Thomas Huckle joint work with

Breaking and restoration of rotational symmetry in the spectrum of conjugate nuclei on the

G. Moreau Laboratoire de Physique Thorique, Orsay, France Based on arXiv:1210.3977 (will be

Last time: the Divergence Theorem Assume E is a vector field with continuous first derivatives on

Sambuz

Useful Links

Newsletter

Mail Us

1 AP Physics C E & M Gauss's Law 20160109 www.njctl.org 2 Gauss's Law Click on

Prime Numbers Prime Numbers Prime number : an integer p>1 that is divisible only by 1 and