On the steplength selection in Stochastic Gradient Methods Giorgia - PowerPoint PPT Presentation

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments On the steplength selection in Stochastic Gradient Methods Giorgia Franchini giorgia.franchini@unimore.it Università degli studi di Modena e Reggio Emilia Como, 16-18 July, 2018 Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Large-scale optimization in machine learning A numerical experiment: the test problem Future developments Optimization problem in machine learning The following optimization problem, which minimizes the sum of cost functions over samples from a finite training set composed by sample data a i ∈ R d and class label b i ∈ {± 1 } for i ∈ { 1 . . . n } , appears frequently in machine learning: n min F ( x ) ≡ 1 � f i ( x ) , (1) n i = 1 where n is the sample size, and each f i : R d → R is the cost function corresponding to a training set element. For example in the logistic regression case we have: � � 1 + exp ( − b i a T f i ( x ) = log i x ) We are interested in finding x that minimizes (1). Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Large-scale optimization in machine learning A numerical experiment: the test problem Future developments Stochastic Gradient Descent (SGD) For given x , computing F ( x ) and ∇ F ( x ) is prohibited, due to the large size of the training set; when n is large, Stochastic Gradient Descent (SGD) method and its variants have been the main approaches for solving (1); in the t − th iteration of SGD, a random index of a training sample i t is chosen from { 1 , 2 , . . . , n } and the iterate x t is updated by x t + 1 = x t − η t ∇ f i t ( x t ) where ∇ f i t ( x t ) denotes the gradient of the i t − th component function at x t , and η t > 0 is the step size or learning rate. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments SGD properties Theorem (Strongly Convex Objective, Fixed Step size) Suppose that the SGD method is run with a fixed step size, η t = ¯ η for all t ∈ N , satisfying µ 0 < ¯ η ≤ . LM G Then, the expected optimality gap satisfies the following relation: E [ F ( x t ) − F ∗ ] t →∞ → ¯ η LM − − − 2 c µ . L > 0 Lipschitz constant of the gradient of F ( x ) ; c > 0 strongly convex constant of F ( x ) ; µ and M are related to the first and second moment of the stochastic gradient. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Notation and assumptions L > 0 gradient Lipschitz constant; c > 0 strongly convex constant; there exist scalars µ G ≥ µ > 0 such that, for all k ∈ N , ∇ F ( x t ) T E ξ t [ g ( x t , ξ t )] ≥ µ � ∇ F ( x t ) � 2 2 � E ξ t [ g ( x t , ξ t )] � 2 ≤ µ G � ∇ F ( x t ) � 2 ; there exist scalars M ≥ 0 and M V ≥ 0 such that, for all t ∈ N , V ξ t [ g ( x t , ξ t )] ≤ M + M V � ∇ F ( x t ) � 2 2 ; G ≥ µ 2 > 0. M G := M V + µ 2 Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments SGD properties, Diminishing Step sizes Theorem (Strongly Convex Objective, Diminishing Step sizes) Suppose that the SGD method is run with a step size sequence such that: ∞ ∞ � � η 2 η t = ∞ and t < ∞ . t = 1 t = 1 Then the expected optimality gap satisfies: ν E [ F ( x t ) − F ∗ ] ≤ γ + t , where ν and γ are constant. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Notation β 1 µ η t = γ + t for some β > c µ and γ > 0 such that η 1 ≤ LM G ; � β 2 LM � ν := max 2 ( β c µ − 1 ) , ( γ + 1 )( F ( x 1 ) − F ∗ ) . Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments A numerical experiment: the test problem Logistic regression with l 2 − norm regularization: n x F ( x ) = 1 + λ � � � 2 � x � 2 1 + exp ( − b i a T min log i x ) 2 n i = 1 where a i ∈ R d and b i ∈ {± 1 } are the feature vectors and class labels of the i − th sample, respectively, and λ > 0 is a regularization parameter; database: MNIST 8 and 9 digits, dimension: 11800 × 784. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments MNIST Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments A numerical experiment: the algorithms The two full gradient BB rules: BB1 full gradient: a nonmonotone gradient method with the first Barzilai-Borwein step size rule; Adaptive BB (ABB): a nonmonotone gradient method with a step size rule that alternates between the two BB rules full gradient. Stochastic: ADAM: stochastic gradient method based on adaptive moment estimation; ADAM ABB. behaviour with respect to the epochs: one epoch = 11800 SGD steps. Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Deterministic cases = s T = s T t − 1 s t − 1 t − 1 v t − 1 η BB 1 η BB 2 t t s T v T t − 1 v t − 1 t − 1 v t − 1 where s t − 1 = x t − x t − 1 and v t − 1 = ∇ F ( x t ) − ∇ F ( x t − 1 ) with � n ∇ F ( x ) = 1 i = 1 ∇ f i ( x ) . n if η BB 2 � min { η BB 2 : j = max { 1 , t − m a } , . . . , t } , < τ t η ABB min j η BB 1 = t t η BB 1 , otherwise t where m a is a nonnegative integer and τ ∈ ( 0 , 1 ) . [Di Serafino, Ruggiero, Toraldo, Zanni, On the steplength selection in gradient methods for unconstrained optimization, AMC 318, 2018] Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Kingma, Lei Ba, Adam: a method for stochastic optimization, ArXiv, 2017 Algorithm 1 Adam 1: Choose maxit , η , ǫ , β 1 and β 2 ∈ [ 0 , 1 ) , x 0 ; 2: initialize m 0 ← 0, v 0 ← 0, t ← 0 3: for t ∈ { 0 , . . . , maxit } do t ← t + 1 4: g t ← ∇ f i t ( x t − 1 ) 5: m t ← β 1 · m t − 1 + ( 1 − β 1 ) · g t 6: v t ← β 2 · v t − 1 + ( 1 − β 2 ) · g 2 7: √ t 1 − β t η t = η 2 8: ( 1 − β t 1 ) x t ← x t − 1 − η t · m t / ( √ v t + ǫ ) 9: 10: end for 11: Result: x t Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Behaviour of the deterministic and the stochastic methods Optimality gap 10 0 ADAM BB1 FULL GRADIENT ABB FULL GRADIENT 10 -1 F-F * 10 -2 10 -3 0 10 20 30 40 50 60 70 80 90 100 epoch Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

Introduction Stochastic Gradient Methods and their properties Step size selection: full and stochastic gradient methods A numerical experiment: the test problem Numerical experiments Future developments Comparison between different SGD types Optimality gap 10 0 ADAM SGD MOMENTUM F-F * 10 -1 10 -2 0 2 4 6 8 10 12 14 16 18 20 epoch Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

On the steplength selection in Stochastic Gradient Methods Giorgia - PowerPoint PPT Presentation

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments On the steplength selection in Stochastic Gradient Methods Giorgia Franchini giorgia.franchini@unimore.it Universit

Spectral properties of steplength selections in gradient methods: from unconstrained to

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Conjugate gradient methods for stochastic Galerkin finite element saddle point matrices B T A

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Exponential convergence of testing error for stochastic gradient methods Loucas Pillaud-Vivien

Project: More Experiments on Stochastic Gradient Methods Last updated: May 25, 2020 May 25, 2020

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Generalized Approach for Analysing Quantum Key Distribution Experiments Arpita Maitra and Suvra

Statistical Analysis in the Please interrupt: Lexis Diagram: Most likely I did a mistake or

The dual of non-extremal area: difgerential entropy in higher dimensions Charles Rabideau Vrije

for Optical Fiber Telecoms Dr. Elias Giacoumidis 26 March 2018 Personal background Bangor

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive

Learning Step Size Controllers for Robust Neural Network Training Christian Daniel et al. Recent

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer