Algorithms: Gradient Descent This classic greedy algorithm for - PowerPoint PPT Presentation

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current point x as the descent direction ∆ x . ∗ ∗ ∗ This choice of ∆ x corresponds to the direction of steepest descent under the L 2 (eucledian) norm and follows from the Cauchy Shwarz inequality ∈ D Find a starting point x (0) repeat = − ∇ f ( x 1. Set ∆ x ( k ) ( k ) ). 2. Choose a step size t ( k ) > 0 using exact or backtracking ray search. 3. Obtain x ( k +1) = x ( k ) t ( k ) ∆ x ( k ) + . 4. Set k = k + 1. ( k +1) ) || ≤ ϵ ) is satisfed until stopping criterion (such as || ∇ f ( x 2 The steepest descent method can be thought of as changing the coordinate system in a particular way and then applying the gradient descent method in the changed coordinate system. October 2, 2018 148 / 409

Convergence of the Gradient Descent Algorithm We recap the (necessary) inequality (36) resulting from Lipschitz continuity of ∇ f ( x ): ⊤ f ( y ) ≤ f ( x ) + ∇ f ( x )( y − x ) + L ∥ y − x ∥ 2 2 = x − t ∇ f ( x ) ≡ y , we get Considering x ≡ x , and x k k +1 k k k October 2, 2018 149 / 409

Convergence of the Gradient Descent Algorithm We recap the (necessary) inequality (36) resulting from Lipschitz continuity of ∇ f ( x ): ( ) ⊤ f ( y ) ≤ f ( x ) + ∇ f ( x )( y − x ) + L ∥ y − x ∥ 2 2 2 = x − t ∇ f ( x ) ≡ y , we get Considering x ≡ x , and x k k +1 k k k ∇ f ( x ) 2 k L t 2 f ( x k +1 ) ≤ f ( x ) − t ∇ f ( x ) ∇ f ( x ) + k k ⊤ k k k 2 ) t ∇ f ( x ) Lt k = ⇒ f ( x k +1 ) ≤ f ( x ) − (1 − k k 2 2 We desire to have the following (46). It holds if.... ∇ f ( x ) b t f ( x k +1 ) ≤ f ( x ) − k k (46) 2 October 2, 2018 149 / 409

Convergence of the Gradient Descent Algorithm We recap the (necessary) inequality (36) resulting from Lipschitz continuity of ∇ f ( x ): ( ) ⊤ f ( y ) ≤ f ( x ) + ∇ f ( x )( y − x ) + L ∥ y − x ∥ 2 2 2 = x − t ∇ f ( x ) ≡ y , we get Considering x ≡ x , and x k k +1 k k k ∇ f ( x ) 2 k L t 2 f ( x k +1 ) ≤ f ( x ) − t ∇ f ( x ) ∇ f ( x ) + k k ⊤ k k k 2 ) t ∇ f ( x ) Lt k = ⇒ f ( x k +1 ) ≤ f ( x ) − (1 − k k 2 2 We desire to have the following (46). It holds if.... ∇ f ( x ) b t f ( x k +1 ) ≤ f ( x ) − k k { } (46) 2 the drop in the value of the objective will be atleast order of square of norm = ⇒ 1 − k t , we ensure that 0 < b t ≤ Lt 1 2 ≥ 1 ▶ With fxed step size t = b b 2 . of gradient L 2(1 − c t = min 1 , β 1 ) ▶ With backtracking step seach, (46) holds with b L derivation provided a few slides later See https://www.youtube.com/watch?v=SGZdsQviFYs&list=PLsd82ngobrvcYfCdnSnqM7lKLqE9qUUpX&index=17 October 2, 2018 149 / 409

Aside: Backtracking ray search and Lipschitz Continuity Recap the Backtracking ray search algorithm ▶ Choose a β ∈ (0 , 1) ▶ Start with t = 1 ▶ While f ( x + t ∆ x ) > f ( x ) + c t ∇ f ( x )∆ x , do T 1 ⋆ Update t ← βt k ∇ f ( x ) is satisfed if b On convergence, f ( x + t ∆ x ) ≤ f ( x ) + c t ∇ f ( x )∆ x T { } 1 For gradient descent, this means f ( x + t ∆ x ) ≤ f ( x ) − c t ∥∇ f ( x ) ∥ 2 1 For a function f with Lipschitz continuous ∇ f ( x ) we have that 2 f ( x k +1 ) ≤ f ( x k ) − b 2(1 − 1 ) c t t = min 1 , β { } L 2 Lt k Reason: With backtracking step seach, if 1 − 2 ≥ c , the Armijo rule will be satisfed. 1 2(1 − 1 ) c Lt k That is, 0 < t ≤ k = ⇒ 1 − 2 ≥ c . If not, there must exist an interger j for 1 L 2(1 −c 1 ) which β 2(1 − 1 ) c 2(1 − 1 ) c j t = min 1 , β ≤ β ≤ , we take b L L L October 2, 2018 152 / 409

∗ ⊤ ∗ Using convexity, we have f ( x ) ≥ f ( x ) + ∇ f ( x )( x − x ) k k k 2 = ⇒ f ( x ) ≤ f ( x ) + ∇ f ( x )( x − x ) ∗ ⊤ ∗ k k k k k ∇ f ( x ) Thus, + ∇ f ( x )( x − x ) − 2 ∇ f ( x ) − x − x f ( x k +1 ) ≤ f ( x ) − k t k ∇ f ( x ) 2 2 2 2 = ⇒ f ( x k +1 ) ≤ f ( x ) + ∇ f ( x )( x − x ) − ∗ ⊤ ∗ t k k k x − x 2 = ⇒ f ( x k +1 ) ≤ f ( x )+ ∗ ∗ T k k ∗ t k ∗ 1 1 2 t 2 t 2 October 2, 2018 150 / 409

∗ ⊤ ∗ Using convexity, we have f ( x ) ≥ f ( x ) + ∇ f ( x )( x − x ) k k k 2 = ⇒ f ( x ) ≤ f ( x ) + ∇ f ( x )( x − x ) ∗ ⊤ ∗ k k k k k ∇ f ( x ) Thus, + ∇ f ( x )( x − x ) − 2 ∇ f ( x ) − x − x f ( x k +1 ) ≤ f ( x ) − k t k k k ∇ f ( x ) 2 ( x − x 2 − x − x − t ∇ f ( x ) ) 2 2 = ⇒ f ( x k +1 ) ≤ f ( x ) + ∇ f ( x )( x − x ) − ∗ ⊤ ∗ t k k k k k +1 x − x 2 2 ( x − x − x 2 − x ) = ⇒ f ( x k +1 ) ≤ f ( x )+ ∗ ∗ T k k ∗ t k ∗ 1 1 2 t 2 t 2 2 2 = ⇒ f ( x k +1 ) ≤ f ( x ) + ∗ ∗ ∗ k 1 k 2 t ( x − x − x − x ) = ⇒ f ( x k +1 ) ≤ f ( x ∗ ) + ∗ ∗ 1 2 t 2 2 1 = ⇒ f ( x k +1 ) − f ( x ) ≤ ∗ ∗ ∗ k +1 (47) 2 t October 2, 2018 150 / 409

∑ ( ) 2 ∗ ) x ( (0) − x ∗ ) 1 i ∗ f ( x ) − f ( x ) ≤ 2 t i =1 i +1 ) ≤ f ( x ) ∀ i = 0 , 1 , . . . , k . We thus get 6 i The ray and line search ensure that f ( x i +1 ) ≤ f ( x ) + c 1 t ∇ f ( x )∆ x 6 By Armijo condition in (29), for some 0 < c 1 < 1, f ( x i i T i i October 2, 2018 151 / 409

∑ ( ) 2 ∗ ) x ( (0) − x ∗ ) 1 i ∗ f ( x ) − f ( x ) ≤ 2 t i =1 ∑ 2 i +1 ) ≤ f ( x ) ∀ i = 0 , 1 , . . . , k . We thus get 6 i The ray and line search ensure that f ( x k ( ) x (0) − x ∗ 1 k ∗ i ∗ f ( x ) − f ( x ) ≤ f ( x ) − f ( x ) ≤ k 2 tk i =1 k ∗ Thus, as k → ∞ , f ( x ) → f ( x ). This shows convergence for gradient descent. To get epsilon close to f(x*), it is su ffi cient for k to be O(1/epsilon) i +1 ) ≤ f ( x ) + c 1 t ∇ f ( x )∆ x 6 By Armijo condition in (29), for some 0 < c 1 < 1, f ( x i i T i i October 2, 2018 151 / 409

Rates of Convergence October 2, 2018 153 / 409

Convergence rate of convergence = slope increasing order observe acceleration (this is what we observed for the better algo for Rosenbrack function) October 2, 2018 154 / 409

Linear Convergence k +1 k 1 ∗ v , . . . , v is Linearly (or specifcally, Q-linearly) convergent if v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above October 2, 2018 155 / 409

k +1 Q-convergence k ∗ v , . . . , v is Q-linearly convergent if 1 v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above [ 11 n , . . . ] 2 , 21 4 , 41 1 ▶ Consider the sequence s s = 8 , . . . , 5 + 1 1 2 The sequence converges to 5 October 2, 2018 156 / 409

k +1 Q-convergence k ∗ v , . . . , v is Q-linearly convergent if 1 v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above [ 11 n , . . . ] 2 , 21 4 , 41 1 ▶ Consider the sequence s s = 8 , . . . , 5 + 1 1 2 ∗ Q-linearly convergent The sequence converges to s = 5 and it is 1 October 2, 2018 156 / 409

k +1 Q-convergence k ∗ v , . . . , v is Q-linearly convergent if 1 v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) k +1 k +1 ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above k = [ 11 n , . . . ] 2 , 21 4 , 41 1 ▶ Consider the sequence s s = 8 , . . . , 5 + 1 1 2 ∗ ∗ The sequence converges to s = 5 and it is Q-linear convergence because: s 1 = − s 1 s − s 1 1 1 2 2 < 0 . 6(= M ) 1 k ∗ 1 1 1 2 ▶ How about the convergence result we got by assuming Lipschitz continuity with backtracking and exact line searches? October 2, 2018 156 / 409

Generalizing Q-convergence to R-convergence 21 4 , 21 1 Consider the sequence r r = 5 , 4 , . . . , 5 + 2 ⌋ , . . . 1 1 ⌊ n 4 5 The sequence converges to October 2, 2018 157 / 409

Generalizing Q-convergence to R-convergence 21 4 , 21 1 Consider the sequence r r = 5 , 4 , . . . , 5 + 2 ⌋ , . . . 1 1 ⌊ n 4 ∗ The sequence converges to s = 5 but not Q-linearly! 1 Let us consider the convergence result we got by assuming Lipschitz continuity with 2 backtracking and exact line searches: − x x ∗ (0) ∗ f ( x ) − f ( x ) ≤ k 2 tk October 2, 2018 157 / 409

Generalizing Q-convergence to R-convergence 21 4 , 21 1 Consider the sequence r r = 5 , 4 , . . . , 5 + 2 ⌋ , . . . 1 1 ⌊ n 4 ∗ The sequence converges to s = 5 but not Q-linearly! 1 Let us consider the convergence result we got by assuming Lipschitz continuity with 2 backtracking and exact line searches: − x x ∗ (0) ∗ f ( x ) − f ( x ) ≤ k 2 tk ≤ v , ∀ k , and We say that the sequence s , . . . , s is R-linearly convergent if s − s Q-convergence by itself insufcient. We will generalize it to R-convergence . { } ‘R’ here stands for ‘root’, as we are looking at convergence rooted at x ∗ k k ∗ k 1 v k converges Q-linearly to zero October 2, 2018 157 / 409

R-convergence assuming Lipschitz continuity ∥ x (0) − x ∥ 2 ∗ k α k , where α is a constant Consider v = = 2 tk Here, we have ∥ v k +1 −v ∥ ∗ <= k/(k+1) -- > 1 as k tends to infinity ∥ v −v ∥ k ∗ October 2, 2018 158 / 409

Algorithms: Gradient Descent This classic greedy algorithm for - PowerPoint PPT Presentation

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current point x as the descent direction x . This choice of x corresponds to the direction

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

CSCI 446: Artificial Intelligence Optimization and Neural Nets Instructors: Michele Van Dyne

IAML: Optimization Charles Sutton and Victor Lavrenko School of Informatics Semester 1 1 / 24

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

CS 188: Artificial Intelligence Optimization and Neural Nets Instructors: Brijen Thananjeyan and

X Example? In such cases, we can use local search algorithms Keep a single

Approaching the sign problem by complexification Manuel Scherzer in collaboration with I.-O.

A Slide Rule and a Half Colin Tombeur The Conundrum In some of Charles N. Pickworths detailed

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol University of Cambridge & The

Algorithms: Gradient Descent This classic greedy algorithm for - PowerPoint PPT Presentation

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current point x as the descent direction x . This choice of x corresponds to the direction

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Gradient Descent Michail Michailidis &amp; Patrick Maiden Outline

Learning to learn by gradient descent by gradient descent Liyan Jiang July 18, 2019 1

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Conjugate gradient training algorithm Steepest descent algorithm Definitions: So far: j

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Gradient descent revisited Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent Rainer Gemulla

CSCI 446: Artificial Intelligence Optimization and Neural Nets Instructors: Michele Van Dyne

IAML: Optimization Charles Sutton and Victor Lavrenko School of Informatics Semester 1 1 / 24

Machine Learning 2007: Lecture 8 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website:

CS 188: Artificial Intelligence Optimization and Neural Nets Instructors: Brijen Thananjeyan and

X Example? In such cases, we can use local search algorithms Keep a single

Approaching the sign problem by complexification Manuel Scherzer in collaboration with I.-O.

A Slide Rule and a Half Colin Tombeur The Conundrum In some of Charles N. Pickworths detailed

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol University of Cambridge &amp; The

Gradient Descent Michail Michailidis & Patrick Maiden Outline

Gradient descent revisited Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Minimum Stein Discrepancy Estimators Fran cois-Xavier Briol University of Cambridge & The