AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - PowerPoint PPT Presentation

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU ⋆ PhD Candidate, The University of Texas at Austin June 11th, 2019 ⋆ joint work with Rachel Ward and L´ eon Bottou, at Facebook AI Research.

Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Our desired goal ⇒ x ∈ R d F ( x ) min �∇ F ( x ) � 2 ≤ ε We can achieve ⇒

Motivation Problem Setup Given a differentiable non-convex function, F : R d → R , ◮ �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d Our desired goal ⇒ x ∈ R d F ( x ) min �∇ F ( x ) � 2 ≤ ε We can achieve ⇒ Algorithm Stochastic Gradient Descent (SGD) at the j th iteration x j +1 ← x j − η j G ( x j ) , (1) where E [ G ( x j )] = ∇ F ( x j ) and η j > 0 is the stepsize .

Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? 1 E [ � G ( x ) − ∇ F ( x ) � 2 ] ≤ σ 2

Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? Difficulty in Choosing Stepsizes The classical Robbins/Monro theory (Robbins and Monro, 1951) if ∞ ∞ � � η 2 η j = ∞ and j < ∞ ; (2) j =1 j =1 and the variance of the gradient is bounded 1 , then lim j →∞ E [ �∇ F ( x j ) � 2 ] = 0 . 1 E [ � G ( x ) − ∇ F ( x ) � 2 ] ≤ σ 2

Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Q : How to set the sequence { η j } j ≥ 0 ? Difficulty in Choosing Stepsizes The classical Robbins/Monro theory (Robbins and Monro, 1951) if ∞ ∞ � � η 2 η j = ∞ and j < ∞ ; (3) j =1 j =1 and the variance of the gradient is bounded, then lim j →∞ E [ �∇ F ( x j ) � 2 ] = 0 . However, the rule is too general for practical applications.

Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Possible Choice: Manual Tuning   η j ≤ T 1     α 1 η T 1 ≤ j ≤ T 2 η j =  α 2 η T 2 ≤ j ≤ T 3     · · · 2 �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

Motivation Algorithm: SGD Set a sequence { η j } j ≥ 0 for x j +1 ← x j − η j G ( x j ) Possible Choice: Manual Tuning   η j ≤ T 1     α 1 η T 1 ≤ j ≤ T 2 η j =  α 2 η T 2 ≤ j ≤ T 3     · · · However, tuning η , α 1 , α 2 , T 1 , T 2 , . . . are computationally costly. In particular, it requires η ≤ 2 / L . 2 2 �∇ F ( x ) − ∇ F ( y ) � ≤ L � x − y � , ∀ x , y ∈ R d

Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2

Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2 ◮ It helps with “increasing the stepsize for more sparse parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011)

Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ [ b j +1 ] ℓ Possible Choice: Adaptive Gradient Methods Among many variants, one is AdaGrad ([ b j +1 ] ℓ ) 2 = ([ b j ] ℓ ) 2 + ([ G ( x j )] ℓ ) 2 ◮ It helps with “increasing the stepsize for more sparse parameters and decreasing the stepsize for less sparse ones.” (Duchi et al. 2011) ◮ However, “co-ordinate” AdaGrad changes the optimization problem by introducing the “bias” in the solutions, leading to worse generalization (Wilson et al. 2017)

Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ b j +1 Possible Variant: Norm Version of AdaGrad b 2 j +1 = b 2 j + � G ( x j ) � 2 (AdaGrad-Norm)

Motivation Algorithm: SGD with Adaptive Stepsize Set a sequence { b j } j ≥ 0 for ℓ = 1 , 2 , · · · , d η [ x j +1 ] ℓ ← [ x j ] ℓ − [ G ( x j )] ℓ b j +1 Possible Variant: Norm Version of AdaGrad b 2 j +1 = b 2 j + � G ( x j ) � 2 (AdaGrad-Norm) ◮ Auto-tuning property (Wu, Ward, and Bottou, 2018): robustness to the choices of hyper-parameters ( b 0 and η ); connection to Weight/Layer/Batch Normalization; ◮ Does not affect generalization.

Outline Motivations Theoretical Contributions We provide a novel convergence result for AdaGrad-Norm to emphasize its robustness to the hyper-parameter tuning over nonconvex landscapes. Practical Implications

Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0

Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0 � � 1 ◮ Convex Landscapes O (Levy, 2018) √ T

Theory Algorithm: SGD with Adaptive Stepsize η b 2 j +1 = b 2 j + � G ( x j ) � 2 x j +1 ← x j − G ( x j ) with b j +1 What is the convergence rate of AdaGrad-Norm? ◮ Intuition: if E [ � G ( x j ) � 2 ] ≤ γ 2 , then the effective stepsize η b j � η � η ≥ � E b j j γ 2 + b 2 0 � � 1 ◮ Convex Landscapes O (Levy, 2018) √ T � � log( T ) ◮ Nonconvex Landscapes O (Ours, Theorem 2.1) √ T

Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Theorem Under the assumption: 1. The random vectors ξ j , j = 0 , 1 , 2 , . . . , are mutually independent and also independent of x j ; 2. Bounded variance 3 : E ξ j [ � G ( x j , ξ j ) − ∇ F ( x j ) � 2 ] ≤ σ 2 ; 3. Bounded gradient norm: �∇ F ( x j ) � ≤ γ uniformly; 3 It means the expectation with respect to ξ j conditional on x j .

Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Theorem Under the assumption: 1. The random vectors ξ j , j = 0 , 1 , 2 , . . . , are mutually independent and also independent of x j ; 2. Bounded variance 3 : E ξ j [ � G ( x j , ξ j ) − ∇ F ( x j ) � 2 ] ≤ σ 2 ; 3. Bounded gradient norm: �∇ F ( x j ) � ≤ γ uniformly; AdaGrad-Norm converges to a stationary point w.h.p. at the rate ℓ =0 , 1 ,..., T − 1 �∇ F ( x ℓ ) � 2 ≤ C 2 T + σ C min √ T where C = � O (log ( T / b 0 + 1)) and � O hides η, L and F ( x 0 ) − F ∗ . 3 It means the expectation with respect to ξ j conditional on x j .

Theory Algorithm: SGD with Adaptive Stepsize (1) At j th iteration, generate ξ j and G ( x j ) = G ( x j , ξ j ) η b 2 j +1 = b 2 j + � G ( x j ) � 2 (2) x j +1 ← x j − b j +1 G ( x j ) with Challenges in the proof: b j +1 is a random variable correlated with ∇ F ( x j ) and G ( x j ) ◮ L -Lipschitz continuous gradient 4 + �∇ F j , ∇ F j − G j � ≤ − �∇ F j � 2 + η L � G j � 2 F j +1 − F j j +1 . 2 b 2 η b j +1 b j +1 � �� KeyTerm 4 We write F ( x j ) = F j , ∇ F ( x j ) = ∇ F j and G ( x j ) = G j .

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - PowerPoint PPT Presentation

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU PhD Candidate, The University of Texas at Austin June 11th, 2019 joint work with Rachel Ward and L eon Bottou, at Facebook AI Research. Outline

Safety and Health Recogni2on Achievement Program (SHARP) OSHCON SHARP Introduc/on SHARP

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

The Economics of The Economics of Maori Fishing Maori Fishing Basil Sharp Basil Sharp

Staging locations Bobby Brady-Sharp / OFA Field Director / @bobbyhtx Bobby Brady-Sharp OFA Field

A Versatile Sharp I nterface I mmersed A Versatile Sharp I nterface I mmersed Boundary Method

Multi- -Disciplinary Convergence in Life Sciences: Disciplinary Convergence in Life Sciences:

OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG Convergence in Chemistry and Biology

Asymptotics Review Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline Types

II of large Number Lattin in probability almost convergence convergence sure - - "

NS NSF Convergence Accelerator Chaitan Baru Senior Science Advisor, Convergence Accelerator

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling

New Keynesian Pricing Behaviour: an Analysis of Micro Data James Cloyne, Lena Koerber, Martin

(Towards a) Bayesian Estimation of the Heuristic Switching Model using Experimental Data Mikhail

The Impact of Credit Market Sentiment Shocks A TVAR Approach NED 2019, Kiev Maximilian Bck

Evolution of Market Heuristics (An Explanation of an Asset-Pricing Experiment) Mikhail Anufriev

Adaptive estimation in functional linear model: a model selection approach Angelina Roche joint

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for

Adaptive networks with preferred degree from the mundane to the astonishing R.K.P. Zia

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes - PowerPoint PPT Presentation

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU PhD Candidate, The University of Texas at Austin June 11th, 2019 joint work with Rachel Ward and L eon Bottou, at Facebook AI Research. Outline

Safety and Health Recogni2on Achievement Program (SHARP) OSHCON SHARP Introduc/on SHARP

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

The Economics of The Economics of Maori Fishing Maori Fishing Basil Sharp Basil Sharp

Staging locations Bobby Brady-Sharp / OFA Field Director / @bobbyhtx Bobby Brady-Sharp OFA Field

A Versatile Sharp I nterface I mmersed A Versatile Sharp I nterface I mmersed Boundary Method

Multi- -Disciplinary Convergence in Life Sciences: Disciplinary Convergence in Life Sciences:

OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG Convergence in Chemistry and Biology

Asymptotics Review Harvard Math Camp - Econometrics Ashesh Rambachan Summer 2018 Outline Types

II of large Number Lattin in probability almost convergence convergence sure - - &quot;

NS NSF Convergence Accelerator Chaitan Baru Senior Science Advisor, Convergence Accelerator

CS 557 BGP Convergence Improved BGP Convergence via Ghost Flushing Bremler-Barr, Afek, Schwarz,

Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling

New Keynesian Pricing Behaviour: an Analysis of Micro Data James Cloyne, Lena Koerber, Martin

(Towards a) Bayesian Estimation of the Heuristic Switching Model using Experimental Data Mikhail

The Impact of Credit Market Sentiment Shocks A TVAR Approach NED 2019, Kiev Maximilian Bck

Evolution of Market Heuristics (An Explanation of an Asset-Pricing Experiment) Mikhail Anufriev

Adaptive estimation in functional linear model: a model selection approach Angelina Roche joint

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for

Adaptive networks with preferred degree from the mundane to the astonishing R.K.P. Zia

II of large Number Lattin in probability almost convergence convergence sure - - "