Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural - PowerPoint PPT Presentation

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1

Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 Nar & Sastry Step Size Matters 2

Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 From random initialization • converges to x ∗ 1 only if δ ≤ 0 . 5 • converges to x ∗ 2 only if δ ≤ 0 . 2 Nar & Sastry Step Size Matters 2

Gradient Descent: Effect of Step Size Example min x ∈ R ( x 2 + 1)( x − 1) 2 ( x − 2) 2 f ( x ) x x ∗ x ∗ 1 = 1 2 = 2 From random initialization • converges to x ∗ 1 only if δ ≤ 0 . 5 • converges to x ∗ 2 only if δ ≤ 0 . 2 If the algorithm converges with δ = 0 . 3, the solution is x ∗ 1 . Nar & Sastry Step Size Matters 2

Deep Linear Networks x �→ W L W L − 1 · · · W 2 W 1 x Nar & Sastry Step Size Matters 3

Deep Linear Networks x �→ W L W L − 1 · · · W 2 W 1 x • Cost function has infinitely many local minimum • Different dynamic characteristics at different optima Nar & Sastry Step Size Matters 3

Lyapunov Stability of Gradient Descent Deep Linear Networks Proposition • λ ∈ R and λ � = 0 • λ is estimated as multiplication of scalar parameters { w i } 1 2 ( w L . . . w 2 w 1 − λ ) 2 . min { w i } Nar & Sastry Step Size Matters 4

Lyapunov Stability of Gradient Descent Deep Linear Networks Proposition • λ ∈ R and λ � = 0 • λ is estimated as multiplication of scalar parameters { w i } 1 2 ( w L . . . w 2 w 1 − λ ) 2 . min { w i } For convergence to { w ∗ i } with w ∗ L . . . w ∗ 2 w ∗ 1 = λ , step size must satisfy 2 δ ≤ � 2 . � λ � L i =1 w ∗ i Nar & Sastry Step Size Matters 4

Lyapunov Stability of Gradient Descent Deep Linear Networks • δ needs to be very small for equilibria with disproportionate { w ∗ i } • For each δ , the algorithm can converge only to a subset of optima Nar & Sastry Step Size Matters 5

Lyapunov Stability of Gradient Descent Deep Linear Networks • δ needs to be very small for equilibria with disproportionate { w ∗ i } • For each δ , the algorithm can converge only to a subset of optima • No finite Lipschitz constant for the gradient on the whole parameter space Nar & Sastry Step Size Matters 5

Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Nar & Sastry Step Size Matters 6

Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ Nar & Sastry Step Size Matters 6

Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ • Step size bounds the Lipschitz constant of the estimated function Nar & Sastry Step Size Matters 6

Deep Linear Networks Theorem � N 1 i =1 x i x ⊤ • { x i } i ∈ [ N ] satisfies i = I N • R is estimated as multiplication of { W j } by 1 � N i =1 � Rx i − W L W L − 1 · · · W 2 W 1 x i � 2 min 2 2 N { W j } Assume the gradient descent algorithm with random initialization has converged to ˆ R . Then, � 2 � L/ (2 L − 2) ρ ( ˆ R ) ≤ almost surely. Lδ • Step size bounds the Lipschitz constant of the estimated function • Contrary to ordinary-least-squares Nar & Sastry Step Size Matters 6

Deep Linear Networks Symmetric PSD matrices: • The bound is tight with identity initialization • Identity initialization allows convergence with the largest step size Nar & Sastry Step Size Matters 7

Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Nar & Sastry Step Size Matters 8

Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Theorem Let f : R n → R m be estimated by 1 � N i =1 � W ( V x i − b ) + − f ( x i ) � 2 min 2 . 2 W,V Nar & Sastry Step Size Matters 8

Nonlinear Networks Poster #8 Two-layer ReLU network: x �→ W ( V x − b ) + Theorem Let f : R n → R m be estimated by 1 � N i =1 � W ( V x i − b ) + − f ( x i ) � 2 min 2 . 2 W,V If the algorithm converges, then the estimate ˆ f ( x i ) satisfies f ( x i ) � ≤ 1 i ∈ [ N ] � x i �� ˆ max δ almost surely. Nar & Sastry Step Size Matters 8

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural - PowerPoint PPT Presentation

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1 Gradient Descent: Effect of Step Size Example min x R ( x 2 + 1)( x 1) 2 ( x

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Step by step guide Step 1: Accessing the account Step 2: Download RSFiles! 2.1 Download the

Step 1 Step 2 Step 3 Step 4 Step 5 Preparation of a sketch Submission of birth map of all

Quick guide Step 1: Purchasing RSMail! Step 2: Download RSMail! Step 3: Installing RSMail! Step

Credential Assessment Mapping Privilege Escalation at Scale Matt Weeks @scriptjunkie1 Adversary

S et the Bar Low. Be a WINNER every time. Public Power Matters Public Power Matters Innovation

Step by step guide Step 1: Purchasing a RSMembership! membership Step 2: Download RSMembership!

Selection of Design Team Step 3 Design Step 4 June 2013 Project Management Concept

Step by step guide Step 1: Purchasing an RSMail! membership Step 2: Download RSMail! 2.1.

Step by step guide Step 1: Purchasing a RSFirewall! membership Step 2: Download RSFirewall! 2.1.

Step by step guide Step 1: Purchasing a RSTickets!Pro membership Step 2: Downloading

Quick guide Step 1: Purchasing a RSComments! membership Step 2: Download RSComments! Step 3:

Step by step guide Step 1: Purchasing an RSSeo! membership Step 2: Download RSSeo! 2.1 Download

t r trs

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

13. The Weak Law and the Strong Law of Large Numbers James Bernoulli proved the weak law of large

Convergence and Efficiency of the Wang Landau algorithm Gersende FORT CNRS & Telecom

Interpolation, Growth Conditions, and Stochastic Gradient Descent Aaron Mishkin,

Inference for periodic Ornstein Uhlenbeck process driven by fractional Brownian motion Jeannette

Quantitative CLTs via Martingale Embeddings Dan Mikulincer Weizmann Institute of Science Joint

Zeros and critical points of monochromatic random waves 06-18-2018 Yaiza Canzani The setting: (

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural - PowerPoint PPT Presentation

Step Size Matters in Deep Learning Kamil Nar Shankar Sastry Neural Information Processing Systems December 4, 2018 Nar & Sastry Step Size Matters 1 Gradient Descent: Effect of Step Size Example min x R ( x 2 + 1)( x 1) 2 ( x

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Step by step guide Step 1: Accessing the account Step 2: Download RSFiles! 2.1 Download the

Step 1 Step 2 Step 3 Step 4 Step 5 Preparation of a sketch Submission of birth map of all

Quick guide Step 1: Purchasing RSMail! Step 2: Download RSMail! Step 3: Installing RSMail! Step

Credential Assessment Mapping Privilege Escalation at Scale Matt Weeks @scriptjunkie1 Adversary

S et the Bar Low. Be a WINNER every time. Public Power Matters Public Power Matters Innovation

Step by step guide Step 1: Purchasing a RSMembership! membership Step 2: Download RSMembership!

Selection of Design Team Step 3 Design Step 4 June 2013 Project Management Concept

Step by step guide Step 1: Purchasing an RSMail! membership Step 2: Download RSMail! 2.1.

Step by step guide Step 1: Purchasing a RSFirewall! membership Step 2: Download RSFirewall! 2.1.

Step by step guide Step 1: Purchasing a RSTickets!Pro membership Step 2: Downloading

Quick guide Step 1: Purchasing a RSComments! membership Step 2: Download RSComments! Step 3:

Step by step guide Step 1: Purchasing an RSSeo! membership Step 2: Download RSSeo! 2.1 Download

t r trs

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

13. The Weak Law and the Strong Law of Large Numbers James Bernoulli proved the weak law of large

Convergence and Efficiency of the Wang Landau algorithm Gersende FORT CNRS &amp; Telecom

Interpolation, Growth Conditions, and Stochastic Gradient Descent Aaron Mishkin,

Inference for periodic Ornstein Uhlenbeck process driven by fractional Brownian motion Jeannette

Quantitative CLTs via Martingale Embeddings Dan Mikulincer Weizmann Institute of Science Joint

Zeros and critical points of monochromatic random waves 06-18-2018 Yaiza Canzani The setting: (

Convergence and Efficiency of the Wang Landau algorithm Gersende FORT CNRS & Telecom