newton methods for neural networks gauss newton matrix
play

Newton Methods for Neural Networks: Gauss Newton Matrix-vector - PowerPoint PPT Presentation

Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81 Outline Backward setting 1 Jacobian evaluation


  1. Newton Methods for Neural Networks: Gauss Newton Matrix-vector Product Chih-Jen Lin National Taiwan University Last updated: June 1, 2020 Chih-Jen Lin (National Taiwan Univ.) 1 / 81

  2. Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 2 / 81

  3. Backward setting Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 3 / 81

  4. Backward setting Jacobian evaluation Outline Backward setting 1 Jacobian evaluation Gauss-Newton Matrix-vector products Forward + backward settings 2 R operator Gauss-Newton matrix-vector product Chih-Jen Lin (National Taiwan Univ.) 4 / 81

  5. Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer I For an instance i the Jacobian can be partitioned into L blocks according to layers J i = � J 1 , i J 2 , i . . . J L , i � , m = 1 , . . . , L , (1) where � � ∂ ③ L +1 , i ∂ ③ L +1 , i J m , i = . ∂ ( ❜ m ) T ∂ vec( W m ) T The calculation seems to be very similar to that for the gradient. Chih-Jen Lin (National Taiwan Univ.) 5 / 81

  6. Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer II For the convolutional layers, recall for gradient we have l ∂ W m = 1 ∂ f C W m + 1 ∂ξ i � ∂ W m l i =1 and � ∂ξ i � T ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ vec( W m ) T = vec Chih-Jen Lin (National Taiwan Univ.) 6 / 81

  7. Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer III Now we have   ∂ z L +1 , i 1 ∂ vec( W m ) T ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ vec( W m ) T   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 .   . = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Chih-Jen Lin (National Taiwan Univ.) 7 / 81

  8. Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer IV If ❜ m is considered, the result is � � ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( ❜ m ) T ∂ vec( W m ) T   � �� T ∂ z L +1 , i � φ (pad( Z m , i )) T 1 a m vec 1 conv b m ∂ S m , i conv   . .   . = .     � �� T ∂ z L +1 , i  �  φ (pad( Z m , i )) T 1 a m nL +1 vec conv b m ∂ S m , i conv Chih-Jen Lin (National Taiwan Univ.) 8 / 81

  9. Backward setting Jacobian evaluation Jacobian Evaluation: Convolutional Layer V We can see that it’s more complicated than gradient. Gradient is a vector but Jacobian is a matrix Chih-Jen Lin (National Taiwan Univ.) 9 / 81

  10. Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process I For gradient, earlier we need a backward process to calculate ∂ξ i ∂ S m , i Now what we need are ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i The process is similar Chih-Jen Lin (National Taiwan Univ.) 10 / 81

  11. Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process II If with RELU activation function and max pooling, for gradient we had ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . Chih-Jen Lin (National Taiwan Univ.) 11 / 81

  12. Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process III Assume that ∂ ③ L +1 , i ∂ vec( Z m +1 , i ) are available. ∂ z L +1 , i j ∂ vec( S m , i ) T � � ∂ z L +1 , i j P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool , j = 1 , . . . , n L +1 . Chih-Jen Lin (National Taiwan Univ.) 12 / 81

  13. Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process IV These row vectors can be written together as a matrix ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool . Chih-Jen Lin (National Taiwan Univ.) 13 / 81

  14. Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process V For gradient, we use ∂ξ i ∂ S m , i to have � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad ∂ S m , i and pass it to the previous layer Chih-Jen Lin (National Taiwan Univ.) 14 / 81

  15. Backward setting Jacobian evaluation Jacobian Evaluation: Backward Process VI Now we need to generate ∂ ③ L +1 , i ∂ vec( Z m , i ) T and pass it to the previous layer. Now we have   � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i   ∂ ③ L +1 , i .  .  . ∂ vec( Z m , i ) T = .     � � T ( W m ) T ∂ z L +1 , i   nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 15 / 81

  16. Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer I We do not discuss details, but list all results below ∂ ③ L +1 , i ∂ vec( W m ) T = � � � � �� T ∂ z L +1 , i ∂ z L +1 , i n L +1 ∂ s m , i ( ③ m , i ) T 1 ∂ s m , i ( ③ m , i ) T vec . . . vec Chih-Jen Lin (National Taiwan Univ.) 16 / 81

  17. Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer II ∂ ( ❜ m ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T , ∂ ③ L +1 , i ∂ ③ L +1 , i � 1 n L +1 I [ ③ m +1 , i ] T � ∂ ( s m , i ) T = ∂ ( ③ m +1 , i ) T ⊙ ∂ ( ③ m , i ) T = ∂ ③ L +1 , i ∂ ③ L +1 , i ∂ ( s m , i ) T W m Chih-Jen Lin (National Taiwan Univ.) 17 / 81

  18. Backward setting Jacobian evaluation Jacobian Evaluation: Fully-connected Layer III For layer L + 1, if using the squared loss and the linear activation function, we have ∂ ③ L +1 , i ∂ ( s L , i ) T = I n L +1 . Chih-Jen Lin (National Taiwan Univ.) 18 / 81

  19. Backward setting Jacobian evaluation Gradient versus Jacobian I Operations for gradient ∂ξ i ∂ vec( S m , i ) T � � ∂ξ i P m , i ∂ vec( Z m +1 , i ) T ⊙ vec( I [ Z m +1 , i ]) T = pool . ∂ξ i ∂ξ i ∂ S m , i φ (pad( Z m , i )) T ∂ W m = � � T ∂ξ i ( W m ) T ∂ξ i P m φ P m ∂ vec( Z m , i ) T = vec pad , ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 19 / 81

  20. Backward setting Jacobian evaluation Gradient versus Jacobian II For Jacobian we have ∂ ③ L +1 , i ∂ vec( S m , i ) T � � ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � P m , i = ∂ vec( Z m +1 , i ) T ⊙ pool .   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T nL +1 vec( Chih-Jen Lin (National Taiwan Univ.) 20 / 81

  21. Backward setting Jacobian evaluation Gradient versus Jacobian III ∂ ③ L +1 , i ∂ vec( Z m , i ) T   � � T ( W m ) T ∂ z L +1 , i P m φ P m vec 1 pad ∂ S m , i   .  .  . = .     � � T ( W m ) T ∂ z L +1 , i   nL +1 P m φ P m vec pad ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 21 / 81

  22. Backward setting Jacobian evaluation Implementation I For gradient we did ∆ ← mat(vec(∆) T P m , i pool ) ∂ξ i ∂ W m = ∆ · φ (pad( Z m , i )) T � T P m � ( W m ) T ∆ φ P m ∆ ← vec pad ∆ ← ∆ ⊙ I [ Z m , i ] Now for Jacobian we have similar settings but there are some differences Chih-Jen Lin (National Taiwan Univ.) 22 / 81

  23. Backward setting Jacobian evaluation Implementation II We don’t really store the Jacobian:   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( Recall Jacobian is used for matrix-vector products G S ✈ = 1 C ✈ + 1 � � ( J i ) T � �� B i ( J i ✈ ) (2) | S | i ∈ S Chih-Jen Lin (National Taiwan Univ.) 23 / 81

  24. Backward setting Jacobian evaluation Implementation III The form   vec( ∂ z L +1 , i ∂ S m , i φ (pad( Z m , i )) T ) T 1 ∂ ③ L +1 , i .   . ∂ vec( W m ) T = .     ∂ z L +1 , i nL +1 ∂ S m , i φ (pad( Z m , i )) T ) T vec( is like the product of two things Chih-Jen Lin (National Taiwan Univ.) 24 / 81

  25. Backward setting Jacobian evaluation Implementation IV If we have ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i , and φ (pad( Z m , i )) probably we can do the matrix-vector product without multiplying these two things out We will talk about this again later Thus our Jacobian evaluation is solely on obtaining ∂ z L +1 , i ∂ S m , i , . . . , ∂ z L +1 , i n L +1 1 ∂ S m , i Chih-Jen Lin (National Taiwan Univ.) 25 / 81

  26. Backward setting Jacobian evaluation Implementation V Further we need to take all data (or data in the selected subset) into account In the end what we have is the following procedure In the beginning ∆ ∈ R d m +1 a m +1 b m +1 × n L +1 × l This corresponds to ∂ ③ L +1 , i � 1 n L +1 vec( I [ Z m +1 , i ]) T � ∂ vec( Z m +1 , i ) T ⊙ , ∀ i = 1 , . . . , l Chih-Jen Lin (National Taiwan Univ.) 26 / 81

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend