adagrad stepsizes sharp convergence over nonconvex
play

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - PowerPoint PPT Presentation

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU PhD Candidate, The University of Texas at Austin June 11th, 2019 joint work with Rachel Ward and L eon Bottou, at Facebook AI Research. Outline


  1. AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU ⋆ PhD Candidate, The University of Texas at Austin June 11th, 2019 ⋆ joint work with Rachel Ward and L´ eon Bottou, at Facebook AI Research.

  2. Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

  3. Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

  4. Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

  5. Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Our desired goal ⇒ x ∈ R d F ( x ) min �∇ F ( x ) � 2 ≤ ε We can achieve ⇒

  6. Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Our desired goal ⇒ x ∈ R d F ( x ) min �∇ F ( x ) � 2 ≤ ε We can achieve ⇒ Algorithm Stochastic Gradient Descent (SGD) at the j th iteration x j +1 ← x j − η j G ( x j ) , (1) where E [ G ( x j )] = ∇ F ( x j ) and η j > 0 is the stepsize .

  7. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? 1 E [ � G ( x ) − ∇ F ( x ) � 2 ] ≤ σ 2

  8. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? Difficulty in Choosing Stepsizes The classical Robbins/Monro theory (Robbins and Monro, 1951) if ∞ ∞ � � η 2 η j = ∞ and j < ∞ ; (2) j =1 j =1 and the variance of the gradient is bounded 1 , then lim j →∞ E [ �∇ F ( x j ) � 2 ] = 0 . 1 E [ � G ( x ) − ∇ F ( x ) � 2 ] ≤ σ 2

  9. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? Difficulty in Choosing Stepsizes The classical Robbins/Monro theory (Robbins and Monro, 1951) if ∞ ∞ � � η 2 η j = ∞ and j < ∞ ; (3) j =1 j =1 and the variance of the gradient is bounded, then lim j →∞ E [ �∇ F ( x j ) � 2 ] = 0 . However, the rule is too general for practical applications.

  10. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Possible Choice: Manual Tuning   η j ≤ T 1     α 1 η T 1 ≤ j ≤ T 2 η j =  α 2 η T 2 ≤ j ≤ T 3     · · · 2 �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

  11. Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Possible Choice: Manual Tuning   η j ≤ T 1     α 1 η T 1 ≤ j ≤ T 2 η j =  α 2 η T 2 ≤ j ≤ T 3     · · · However, tuning η , α 1 , α 2 , T 1 , T 2 , . . . are computationally costly. In particular, it requires η ≤ 2 / L . 2 2 �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

  12. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2

  13. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2 ◮ It helps with “increasing the stepsize for more sparse parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011)

  14. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2 ◮ It helps with “increasing the stepsize for more sparse parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011) ◮ However, “co-ordinate” AdaGrad changes the optimization problem by introducing the “bias” in the solutions, leading to worse generalization (Wilson et al. 2017)

  15. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ b j +1 Possible Variant: Norm Version of AdaGrad b 2 j +1 = b 2 j + � G ( x j ) � 2 (AdaGrad-Norm)

  16. Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ b j +1 Possible Variant: Norm Version of AdaGrad b 2 j +1 = b 2 j + � G ( x j ) � 2 (AdaGrad-Norm) ◮ Auto-tuning property (Wu, Ward, and Bottou, 2018): robustness to the choices of hyper-parameters ( b 0 and η ); connection to Weight/Layer/Batch Normalization; ◮ Does not affect generalization.

  17. Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

  18. Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0

  19. Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0 � � 1 ◮ Convex Landscapes O (Levy, 2018) √ T

  20. Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0 � � 1 ◮ Convex Landscapes O (Levy, 2018) √ T � � log( T ) ◮ Nonconvex Landscapes O (Ours, Theorem 2.1) √ T

  21. Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Theorem Under the assumption: 1. The random vectors ξ j , j = 0 , 1 , 2 , . . . , are mutually independent and also independent of x j ; 2. Bounded variance 3 : E ξ j [ � G ( x j , ξ j ) − ∇ F ( x j ) � 2 ] ≤ σ 2 ; 3. Bounded gradient norm: �∇ F ( x j ) � ≤ γ uniformly; 3 It means the expectation with respect to ξ j conditional on x j .

  22. Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Theorem Under the assumption: 1. The random vectors ξ j , j = 0 , 1 , 2 , . . . , are mutually independent and also independent of x j ; 2. Bounded variance 3 : E ξ j [ � G ( x j , ξ j ) − ∇ F ( x j ) � 2 ] ≤ σ 2 ; 3. Bounded gradient norm: �∇ F ( x j ) � ≤ γ uniformly; AdaGrad-Norm converges to a stationary point w.h.p. at the rate ℓ =0 , 1 ,..., T − 1 �∇ F ( x ℓ ) � 2 ≤ C 2 T + σ C min √ T where C = � O (log ( T / b 0 + 1)) and � O hides η, L and F ( x 0 ) − F ∗ . 3 It means the expectation with respect to ξ j conditional on x j .

  23. Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Challenges in the proof: b j +1 is a random variable correlated with ∇ F ( x j ) and G ( x j ) ◮ L -Lipschitz continuous gradient 4 + �∇ F j , ∇ F j − G j � ≤ − �∇ F j � 2 + η L � G j � 2 F j +1 − F j j +1 . 2 b 2 η b j +1 b j +1 � �� � KeyTerm 4 We write F ( x j ) = F j , ∇ F ( x j ) = ∇ F j and G ( x j ) = G j .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend