step size matters in deep learning
play

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural - PowerPoint PPT Presentation

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1 Gradient Descent: Effect of Step Size Example min x R ( x 2 + 1)( x 1) 2 ( x


  1. Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1

  2. Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 Nar & Sastry Step Size Matters 2

  3. Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 From random initialization • converges to x ∗ 1 only if δ ≤ 0 . 5 • converges to x ∗ 2 only if δ ≤ 0 . 2 Nar & Sastry Step Size Matters 2

  4. Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 From random initialization • converges to x ∗ 1 only if δ ≤ 0 . 5 • converges to x ∗ 2 only if δ ≤ 0 . 2 If the algorithm converges with δ = 0 . 3, the solution is x ∗ 1 . Nar & Sastry Step Size Matters 2

  5. Deep Linear Networks x �→ W L W L − 1 · · · W 2 W 1 x Nar & Sastry Step Size Matters 3

  6. Deep Linear Networks x �→ W L W L − 1 · · · W 2 W 1 x • Cost function has infinitely many local minimum • Different dynamic characteristics at different optima Nar & Sastry Step Size Matters 3

  7. Lyapunov Stability of Gradient Descent Deep Linear Networks Proposition • λ ∈ R and λ � = 0 • λ is estimated as multiplication of scalar parameters { w i } 1 2 ( w L . . . w 2 w 1 − λ ) 2 . min { w i } Nar & Sastry Step Size Matters 4

  8. Lyapunov Stability of Gradient Descent Deep Linear Networks Proposition • λ ∈ R and λ � = 0 • λ is estimated as multiplication of scalar parameters { w i } 1 2 ( w L . . . w 2 w 1 − λ ) 2 . min { w i } For convergence to { w ∗ i } with w ∗ L . . . w ∗ 2 w ∗ 1 = λ , step size must satisfy 2 δ ≤ � 2 . � λ � L i =1 w ∗ i Nar & Sastry Step Size Matters 4

  9. Lyapunov Stability of Gradient Descent Deep Linear Networks • δ needs to be very small for equilibria with disproportionate { w ∗ i } • For each δ , the algorithm can converge only to a subset of optima Nar & Sastry Step Size Matters 5

  10. Lyapunov Stability of Gradient Descent Deep Linear Networks • δ needs to be very small for equilibria with disproportionate { w ∗ i } • For each δ , the algorithm can converge only to a subset of optima • No finite Lipschitz constant for the gradient on the whole parameter space Nar & Sastry Step Size Matters 5

  11. Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Nar & Sastry Step Size Matters 6

  12. Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ Nar & Sastry Step Size Matters 6

  13. Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ • Step size bounds the Lipschitz constant of the estimated function Nar & Sastry Step Size Matters 6

  14. Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ • Step size bounds the Lipschitz constant of the estimated function • Contrary to ordinary-least-squares Nar & Sastry Step Size Matters 6

  15. Deep Linear Networks Symmetric PSD matrices: • The bound is tight with identity initialization • Identity initialization allows convergence with the largest step size Nar & Sastry Step Size Matters 7

  16. Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Nar & Sastry Step Size Matters 8

  17. Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Theorem Let f : R n → R m be estimated by 1 � N i =1 � W ( V x i − b ) + − f ( x i ) � 2 min 2 . 2 W,V Nar & Sastry Step Size Matters 8

  18. Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Theorem Let f : R n → R m be estimated by 1 � N i =1 � W ( V x i − b ) + − f ( x i ) � 2 min 2 . 2 W,V If the algorithm converges, then the estimate ˆ f ( x i ) satisfies f ( x i ) � ≤ 1 i ∈ [ N ] � x i �� ˆ max δ almost surely. Nar & Sastry Step Size Matters 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend