on the steplength selection in stochastic gradient methods
play

On the steplength selection in Stochastic Gradient Methods Giorgia - PowerPoint PPT Presentation

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments On the steplength selection in Stochastic Gradient Methods Giorgia Franchini giorgia.franchini@unimore.it Universit


  1. Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments On the steplength selection in Stochastic Gradient Methods Giorgia Franchini giorgia.franchini@unimore.it Università degli studi di Modena e Reggio Emilia Como, 16-18 July, 2018 Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  2. Introduction Stochastic Gradient Methods and their properties Large-scale optimization in machine learning A numerical experiment: the test problem Future developments Optimization problem in machine learning The following optimization problem, which minimizes the sum of cost functions over samples from a finite training set composed by sample data a i ∈ R d and class label b i ∈ {± 1 } for i ∈ { 1 . . . n } , appears frequently in machine learning: n min F ( x ) ≡ 1 � f i ( x ) , (1) n i = 1 where n is the sample size, and each f i : R d → R is the cost function corresponding to a training set element. For example in the logistic regression case we have: � � 1 + exp ( − b i a T f i ( x ) = log i x ) We are interested in finding x that minimizes (1). Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  3. Introduction Stochastic Gradient Methods and their properties Large-scale optimization in machine learning A numerical experiment: the test problem Future developments Stochastic Gradient Descent (SGD) For given x , computing F ( x ) and ∇ F ( x ) is prohibited, due to the large size of the training set; when n is large, Stochastic Gradient Descent (SGD) method and its variants have been the main approaches for solving (1); in the t − th iteration of SGD, a random index of a training sample i t is chosen from { 1 , 2 , . . . , n } and the iterate x t is updated by x t + 1 = x t − η t ∇ f i t ( x t ) where ∇ f i t ( x t ) denotes the gradient of the i t − th component function at x t , and η t > 0 is the step size or learning rate. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  4. Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments SGD properties Theorem (Strongly Convex Objective, Fixed Step size) Suppose that the SGD method is run with a fixed step size, η t = ¯ η for all t ∈ N , satisfying µ 0 < ¯ η ≤ . LM G Then, the expected optimality gap satisfies the following relation: E [ F ( x t ) − F ∗ ] t →∞ → ¯ η LM − − − 2 c µ . L > 0 Lipschitz constant of the gradient of F ( x ) ; c > 0 strongly convex constant of F ( x ) ; µ and M are related to the first and second moment of the stochastic gradient. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  5. Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Notation and assumptions L > 0 gradient Lipschitz constant; c > 0 strongly convex constant; there exist scalars µ G ≥ µ > 0 such that, for all k ∈ N , ∇ F ( x t ) T E ξ t [ g ( x t , ξ t )] ≥ µ � ∇ F ( x t ) � 2 2 � E ξ t [ g ( x t , ξ t )] � 2 ≤ µ G � ∇ F ( x t ) � 2 ; there exist scalars M ≥ 0 and M V ≥ 0 such that, for all t ∈ N , V ξ t [ g ( x t , ξ t )] ≤ M + M V � ∇ F ( x t ) � 2 2 ; G ≥ µ 2 > 0. M G := M V + µ 2 Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  6. Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments SGD properties, Diminishing Step sizes Theorem (Strongly Convex Objective, Diminishing Step sizes) Suppose that the SGD method is run with a step size sequence such that: ∞ ∞ � � η 2 η t = ∞ and t < ∞ . t = 1 t = 1 Then the expected optimality gap satisfies: ν E [ F ( x t ) − F ∗ ] ≤ γ + t , where ν and γ are constant. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  7. Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Notation β 1 µ η t = γ + t for some β > c µ and γ > 0 such that η 1 ≤ LM G ; � β 2 LM � ν := max 2 ( β c µ − 1 ) , ( γ + 1 )( F ( x 1 ) − F ∗ ) . Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  8. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments A numerical experiment: the test problem Logistic regression with l 2 − norm regularization: n x F ( x ) = 1 + λ � � � 2 � x � 2 1 + exp ( − b i a T min log i x ) 2 n i = 1 where a i ∈ R d and b i ∈ {± 1 } are the feature vectors and class labels of the i − th sample, respectively, and λ > 0 is a regularization parameter; database: MNIST 8 and 9 digits, dimension: 11800 × 784. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  9. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments MNIST Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  10. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments A numerical experiment: the algorithms The two full gradient BB rules: BB1 full gradient: a nonmonotone gradient method with the first Barzilai-Borwein step size rule; Adaptive BB (ABB): a nonmonotone gradient method with a step size rule that alternates between the two BB rules full gradient. Stochastic: ADAM: stochastic gradient method based on adaptive moment estimation; ADAM ABB. behaviour with respect to the epochs: one epoch = 11800 SGD steps. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  11. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Deterministic cases = s T = s T t − 1 s t − 1 t − 1 v t − 1 η BB 1 η BB 2 t t s T v T t − 1 v t − 1 t − 1 v t − 1 where s t − 1 = x t − x t − 1 and v t − 1 = ∇ F ( x t ) − ∇ F ( x t − 1 ) with � n ∇ F ( x ) = 1 i = 1 ∇ f i ( x ) . n if η BB 2 � min { η BB 2 : j = max { 1 , t − m a } , . . . , t } , < τ t η ABB min j η BB 1 = t t η BB 1 , otherwise t where m a is a nonnegative integer and τ ∈ ( 0 , 1 ) . [Di Serafino, Ruggiero, Toraldo, Zanni, On the steplength selection in gradient methods for unconstrained optimization, AMC 318, 2018] Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  12. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Kingma, Lei Ba, Adam: a method for stochastic optimization, ArXiv, 2017 Algorithm 1 Adam 1: Choose maxit , η , ǫ , β 1 and β 2 ∈ [ 0 , 1 ) , x 0 ; 2: initialize m 0 ← 0, v 0 ← 0, t ← 0 3: for t ∈ { 0 , . . . , maxit } do t ← t + 1 4: g t ← ∇ f i t ( x t − 1 ) 5: m t ← β 1 · m t − 1 + ( 1 − β 1 ) · g t 6: v t ← β 2 · v t − 1 + ( 1 − β 2 ) · g 2 7: √ t 1 − β t η t = η 2 8: ( 1 − β t 1 ) x t ← x t − 1 − η t · m t / ( √ v t + ǫ ) 9: 10: end for 11: Result: x t Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  13. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Behaviour of the deterministic and the stochastic methods Optimality gap 10 0 ADAM BB1 FULL GRADIENT ABB FULL GRADIENT 10 -1 F-F * 10 -2 10 -3 0 10 20 30 40 50 60 70 80 90 100 epoch Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

  14. Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Comparison between different SGD types Optimality gap 10 0 ADAM SGD MOMENTUM F-F * 10 -1 10 -2 0 2 4 6 8 10 12 14 16 18 20 epoch Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend