algorithms gradient descent
play

Algorithms: Gradient Descent This classic greedy algorithm for - PowerPoint PPT Presentation

Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current point x as the descent direction x . This choice of x corresponds to the direction


  1. Algorithms: Gradient Descent This classic greedy algorithm for minimization uses the negative of the gradient of the function at the current point x as the descent direction ∆ x . ∗ ∗ ∗ This choice of ∆ x corresponds to the direction of steepest descent under the L 2 (eucledian) norm and follows from the Cauchy Shwarz inequality ∈ D Find a starting point x (0) repeat = − ∇ f ( x 1. Set ∆ x ( k ) ( k ) ). 2. Choose a step size t ( k ) > 0 using exact or backtracking ray search. 3. Obtain x ( k +1) = x ( k ) t ( k ) ∆ x ( k ) + . 4. Set k = k + 1. ( k +1) ) || ≤ ϵ ) is satisfed until stopping criterion (such as || ∇ f ( x 2 The steepest descent method can be thought of as changing the coordinate system in a particular way and then applying the gradient descent method in the changed coordinate system. October 2, 2018 148 / 409

  2. Convergence of the Gradient Descent Algorithm We recap the (necessary) inequality (36) resulting from Lipschitz continuity of ∇ f ( x ): ⊤ f ( y ) ≤ f ( x ) + ∇ f ( x )( y − x ) + L ∥ y − x ∥ 2 2 = x − t ∇ f ( x ) ≡ y , we get Considering x ≡ x , and x k k +1 k k k October 2, 2018 149 / 409

  3. Convergence of the Gradient Descent Algorithm We recap the (necessary) inequality (36) resulting from Lipschitz continuity of ∇ f ( x ): ( ) ⊤ f ( y ) ≤ f ( x ) + ∇ f ( x )( y − x ) + L ∥ y − x ∥ 2 2 2 = x − t ∇ f ( x ) ≡ y , we get Considering x ≡ x , and x k k +1 k k k ∇ f ( x ) 2 k L t 2 f ( x k +1 ) ≤ f ( x ) − t ∇ f ( x ) ∇ f ( x ) + k k ⊤ k k k 2 ) t ∇ f ( x ) Lt k = ⇒ f ( x k +1 ) ≤ f ( x ) − (1 − k k 2 2 We desire to have the following (46). It holds if.... ∇ f ( x ) b t f ( x k +1 ) ≤ f ( x ) − k k (46) 2 October 2, 2018 149 / 409

  4. Convergence of the Gradient Descent Algorithm We recap the (necessary) inequality (36) resulting from Lipschitz continuity of ∇ f ( x ): ( ) ⊤ f ( y ) ≤ f ( x ) + ∇ f ( x )( y − x ) + L ∥ y − x ∥ 2 2 2 = x − t ∇ f ( x ) ≡ y , we get Considering x ≡ x , and x k k +1 k k k ∇ f ( x ) 2 k L t 2 f ( x k +1 ) ≤ f ( x ) − t ∇ f ( x ) ∇ f ( x ) + k k ⊤ k k k 2 ) t ∇ f ( x ) Lt k = ⇒ f ( x k +1 ) ≤ f ( x ) − (1 − k k 2 2 We desire to have the following (46). It holds if.... ∇ f ( x ) b t f ( x k +1 ) ≤ f ( x ) − k k { } (46) 2 the drop in the value of the objective will be atleast order of square of norm = ⇒ 1 − k t , we ensure that 0 < b t ≤ Lt 1 2 ≥ 1 ▶ With fxed step size t = b b 2 . of gradient L 2(1 − c t = min 1 , β 1 ) ▶ With backtracking step seach, (46) holds with b L derivation provided a few slides later See https://www.youtube.com/watch?v=SGZdsQviFYs&list=PLsd82ngobrvcYfCdnSnqM7lKLqE9qUUpX&index=17 October 2, 2018 149 / 409

  5. Aside: Backtracking ray search and Lipschitz Continuity Recap the Backtracking ray search algorithm ▶ Choose a β ∈ (0 , 1) ▶ Start with t = 1 ▶ While f ( x + t ∆ x ) > f ( x ) + c t ∇ f ( x )∆ x , do T 1 ⋆ Update t ← βt k ∇ f ( x ) is satisfed if b On convergence, f ( x + t ∆ x ) ≤ f ( x ) + c t ∇ f ( x )∆ x T { } 1 For gradient descent, this means f ( x + t ∆ x ) ≤ f ( x ) − c t ∥∇ f ( x ) ∥ 2 1 For a function f with Lipschitz continuous ∇ f ( x ) we have that 2 f ( x k +1 ) ≤ f ( x k ) − b 2(1 − 1 ) c t t = min 1 , β { } L 2 Lt k Reason: With backtracking step seach, if 1 − 2 ≥ c , the Armijo rule will be satisfed. 1 2(1 − 1 ) c Lt k That is, 0 < t ≤ k = ⇒ 1 − 2 ≥ c . If not, there must exist an interger j for 1 L 2(1 −c 1 ) which β 2(1 − 1 ) c 2(1 − 1 ) c j t = min 1 , β ≤ β ≤ , we take b L L L October 2, 2018 152 / 409

  6. ∗ ⊤ ∗ Using convexity, we have f ( x ) ≥ f ( x ) + ∇ f ( x )( x − x ) k k k 2 = ⇒ f ( x ) ≤ f ( x ) + ∇ f ( x )( x − x ) ∗ ⊤ ∗ k k k k k ∇ f ( x ) Thus, + ∇ f ( x )( x − x ) − 2 ∇ f ( x ) − x − x f ( x k +1 ) ≤ f ( x ) − k t k ∇ f ( x ) 2 2 2 2 = ⇒ f ( x k +1 ) ≤ f ( x ) + ∇ f ( x )( x − x ) − ∗ ⊤ ∗ t k k k x − x 2 = ⇒ f ( x k +1 ) ≤ f ( x )+ ∗ ∗ T k k ∗ t k ∗ 1 1 2 t 2 t 2 October 2, 2018 150 / 409

  7. ∗ ⊤ ∗ Using convexity, we have f ( x ) ≥ f ( x ) + ∇ f ( x )( x − x ) k k k 2 = ⇒ f ( x ) ≤ f ( x ) + ∇ f ( x )( x − x ) ∗ ⊤ ∗ k k k k k ∇ f ( x ) Thus, + ∇ f ( x )( x − x ) − 2 ∇ f ( x ) − x − x f ( x k +1 ) ≤ f ( x ) − k t k k k ∇ f ( x ) 2 ( x − x 2 − x − x − t ∇ f ( x ) ) 2 2 = ⇒ f ( x k +1 ) ≤ f ( x ) + ∇ f ( x )( x − x ) − ∗ ⊤ ∗ t k k k k k +1 x − x 2 2 ( x − x − x 2 − x ) = ⇒ f ( x k +1 ) ≤ f ( x )+ ∗ ∗ T k k ∗ t k ∗ 1 1 2 t 2 t 2 2 2 = ⇒ f ( x k +1 ) ≤ f ( x ) + ∗ ∗ ∗ k 1 k 2 t ( x − x − x − x ) = ⇒ f ( x k +1 ) ≤ f ( x ∗ ) + ∗ ∗ 1 2 t 2 2 1 = ⇒ f ( x k +1 ) − f ( x ) ≤ ∗ ∗ ∗ k +1 (47) 2 t October 2, 2018 150 / 409

  8. ∑ ( ) 2 ∗ ) x ( (0) − x ∗ ) 1 i ∗ f ( x ) − f ( x ) ≤ 2 t i =1 i +1 ) ≤ f ( x ) ∀ i = 0 , 1 , . . . , k . We thus get 6 i The ray and line search ensure that f ( x i +1 ) ≤ f ( x ) + c 1 t ∇ f ( x )∆ x 6 By Armijo condition in (29), for some 0 < c 1 < 1, f ( x i i T i i October 2, 2018 151 / 409

  9. ∑ ( ) 2 ∗ ) x ( (0) − x ∗ ) 1 i ∗ f ( x ) − f ( x ) ≤ 2 t i =1 ∑ 2 i +1 ) ≤ f ( x ) ∀ i = 0 , 1 , . . . , k . We thus get 6 i The ray and line search ensure that f ( x k ( ) x (0) − x ∗ 1 k ∗ i ∗ f ( x ) − f ( x ) ≤ f ( x ) − f ( x ) ≤ k 2 tk i =1 k ∗ Thus, as k → ∞ , f ( x ) → f ( x ). This shows convergence for gradient descent. To get epsilon close to f(x*), it is su ffi cient for k to be O(1/epsilon) i +1 ) ≤ f ( x ) + c 1 t ∇ f ( x )∆ x 6 By Armijo condition in (29), for some 0 < c 1 < 1, f ( x i i T i i October 2, 2018 151 / 409

  10. Rates of Convergence October 2, 2018 153 / 409

  11. Convergence rate of convergence = slope increasing order observe acceleration (this is what we observed for the better algo for Rosenbrack function) October 2, 2018 154 / 409

  12. Linear Convergence k +1 k 1 ∗ v , . . . , v is Linearly (or specifcally, Q-linearly) convergent if v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above October 2, 2018 155 / 409

  13. k +1 Q-convergence k ∗ v , . . . , v is Q-linearly convergent if 1 v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above [ 11 n , . . . ] 2 , 21 4 , 41 1 ▶ Consider the sequence s s = 8 , . . . , 5 + 1 1 2 The sequence converges to 5 October 2, 2018 156 / 409

  14. k +1 Q-convergence k ∗ v , . . . , v is Q-linearly convergent if 1 v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above [ 11 n , . . . ] 2 , 21 4 , 41 1 ▶ Consider the sequence s s = 8 , . . . , 5 + 1 1 2 ∗ Q-linearly convergent The sequence converges to s = 5 and it is 1 October 2, 2018 156 / 409

  15. k +1 Q-convergence k ∗ v , . . . , v is Q-linearly convergent if 1 v ≤ r v − v − v k ∗ for some k ≥ θ , and r ∈ (0 , 1) k +1 k +1 ▶ ‘Q’ here stands for ‘quotient’ of the norms as shown above k = [ 11 n , . . . ] 2 , 21 4 , 41 1 ▶ Consider the sequence s s = 8 , . . . , 5 + 1 1 2 ∗ ∗ The sequence converges to s = 5 and it is Q-linear convergence because: s 1 = − s 1 s − s 1 1 1 2 2 < 0 . 6(= M ) 1 k ∗ 1 1 1 2 ▶ How about the convergence result we got by assuming Lipschitz continuity with backtracking and exact line searches? October 2, 2018 156 / 409

  16. Generalizing Q-convergence to R-convergence 21 4 , 21 1 Consider the sequence r r = 5 , 4 , . . . , 5 + 2 ⌋ , . . . 1 1 ⌊ n 4 5 The sequence converges to October 2, 2018 157 / 409

  17. Generalizing Q-convergence to R-convergence 21 4 , 21 1 Consider the sequence r r = 5 , 4 , . . . , 5 + 2 ⌋ , . . . 1 1 ⌊ n 4 ∗ The sequence converges to s = 5 but not Q-linearly! 1 Let us consider the convergence result we got by assuming Lipschitz continuity with 2 backtracking and exact line searches: − x x ∗ (0) ∗ f ( x ) − f ( x ) ≤ k 2 tk October 2, 2018 157 / 409

  18. Generalizing Q-convergence to R-convergence 21 4 , 21 1 Consider the sequence r r = 5 , 4 , . . . , 5 + 2 ⌋ , . . . 1 1 ⌊ n 4 ∗ The sequence converges to s = 5 but not Q-linearly! 1 Let us consider the convergence result we got by assuming Lipschitz continuity with 2 backtracking and exact line searches: − x x ∗ (0) ∗ f ( x ) − f ( x ) ≤ k 2 tk ≤ v , ∀ k , and We say that the sequence s , . . . , s is R-linearly convergent if s − s Q-convergence by itself insufcient. We will generalize it to R-convergence . { } ‘R’ here stands for ‘root’, as we are looking at convergence rooted at x ∗ k k ∗ k 1 v k converges Q-linearly to zero October 2, 2018 157 / 409

  19. R-convergence assuming Lipschitz continuity ∥ x (0) − x ∥ 2 ∗ k α k , where α is a constant Consider v = = 2 tk Here, we have ∥ v k +1 −v ∥ ∗ <= k/(k+1) -- > 1 as k tends to infinity ∥ v −v ∥ k ∗ October 2, 2018 158 / 409

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend