cs257 linear and convex optimization
play

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John - PowerPoint PPT Presentation

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 9, 2020 Recap Strong convexity. f is m -strongly convex if 2 x 2 is convex f ( x ) m


  1. CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 9, 2020

  2. Recap Strong convexity. f is m -strongly convex if 2 � x � 2 is convex • f ( x ) − m • first-order condition f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � y − x � 2 • second-order condition ∇ 2 f ( x ) � m I ⇐ ⇒ λ min ( ∇ 2 f ( x )) ≥ m Convergence. For m -strongly convex and L -smooth f with minimum x ∗ , gradient descent with constant step size t ∈ ( 0 , 1 L ] satisfies f ( x k ) − f ( x ∗ ) ≤ L ( 1 − mt ) k [ f ( x 0 ) − f ( x ∗ )] m Condition number. For Q ≻ O , κ ( Q ) = λ max ( Q ) λ min ( Q ) Well-/Ill-conditioned if κ ( Q ) is small/large = ⇒ fast/slow convergence. 1/24

  3. Today • exact line search • backtracking line search • Newton’s method 2/24

  4. Step Size Gradient descent x k + 1 = x k − t k ∇ f ( x k ) • constant step size: t k = t for all k • exact line search: optimal t k for each step t k = arg min f ( x k − s ∇ f ( x k )) s • backtracking line search (Armijo’s rule): t k satisfies f ( x k ) − f ( x k − t k ∇ f ( x k )) ≥ α t k �∇ f ( x k ) � 2 2 for some given α ∈ ( 0 , 1 ) . 3/24

  5. Exact Line Search 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do t ← arg min f ( x − s ∇ f ( x )) 3: s x ← x − t ∇ f ( x ) 4: 5: end while 6: return x ⋆ −∇ f f ( x k − s ∇ f ( x k )) level curves of f ( x 1 , x 2 ) = x 2 4 + x 2 1 s t 2 Note. Often impractical; used only if the inner minimization is cheap. 4/24

  6. Exact Line Search for Quadratic Functions f ( x ) = 1 2 x T Qx + b T x , Q ≻ O • gradient at x k is g k = ∇ f ( x k ) = Qx k + b • second-order Taylor expansion is exact for quadratic functions, h ( t ) = f ( x k − t g k ) = f ( x k ) + ∇ f ( x k ) T ( − t g k ) + 1 2 ( − t g k ) T ∇ 2 f ( x k )( − t g k ) � 1 � t 2 − g T 2 g T = k g k t + f ( x k ) k Qg k • minimizing h ( t ) yields best step size t k = g T k g k g T k Qg k • update step x k + 1 = x k − t k g k = x k − g T k g k g k g T k Qg k 5/24

  7. Example 2 x T Qx = γ f ( x 1 , x 2 ) = 1 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Well-conditioned. γ = 0 . 5 , x 0 = ( 2 , 1 ) T 1.5 10 − 1 1.0 f ( x * ) 10 − 3 0.5 0.0 error f ( x k ) x2 10 − 5 0.5 10 − 7 1.0 10 − 9 1.5 2 1 0 1 2 0.0 2.5 5.0 7.5 10.0 iteration (k) x1 Fast convergence. Note. Successive gradient directions are always orthogonal, as 0 = h ′ ( t k ) = −∇ f ( x k − t k ∇ f ( x k )) T ∇ f ( x k ) = −∇ f ( x k + 1 ) T ∇ f ( x k ) 6/24

  8. Example (cont’d) f ( x 1 , x 2 ) = 1 2 x T Qx = γ 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Ill-conditioned. γ = 0 . 01 , convergence rate depends on initial point 0.1 0.25 x2 0.0 x2 0.00 0.1 0.25 0.0 0.5 1.0 1.5 2.0 1.2 1.4 1.6 1.8 2.0 x1 x1 10 − 2 10 − 3 f ( x * ) f ( x * ) 10 − 4 10 − 5 error f ( x k ) error f ( x k ) 10 − 6 10 − 7 10 − 8 0 0 5 10 15 100 200 300 400 iteration (k) iteration (k) x 0 = ( 2 , 0 . 3 ) , fast convergence x 0 = ( 2 , 0 . 02 ) , slow convergence 7/24

  9. Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then the sequence { x k } produced by gradient descent with exact line search satisfies � � k 1 − m f ( x k ) − f ( x ∗ ) ≤ [ f ( x 0 ) − f ( x ∗ )] L Notes. L < 1 , so x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast • 0 ≤ 1 − m • The number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits. • The convergence rate depends on the condition number L / m and can be slow if L / m is large. When close to x ∗ , we can estimate L / m by κ ( ∇ f 2 ( x ∗ )) . 8/24

  10. Proof 1. By the quadratic upper bound for L -smooth functions, f ( x k − t ∇ f ( x k )) ≤ f ( x k ) − t �∇ f ( x k ) � 2 + Lt 2 2 �∇ f ( x k ) � 2 � q ( t ) 2. Minimizing over t in step 1, q ( t ) = q ( 1 L ) = f ( x k ) − 1 2 L �∇ f ( x k ) � 2 f ( x k + 1 ) = min f ( x k − t ∇ f ( x k )) ≤ min t t 3. By m -strong convexity, f ( x ) ≥ f ( x k ) + ∇ f ( x k ) T ( x − x k ) + m 2 � x − x k � 2 � ˆ f ( x ) 4. Minimizing over x in step 3, f ( x k − 1 m ∇ f ( x k )) = f ( x k ) − 1 ˆ f ( x ) = ˆ 2 m �∇ f ( x k ) � 2 f ( x ∗ ) = min f ( x ) ≥ min x x 5. By 4, �∇ f ( x k ) � 2 ≥ 2 m [ f ( x k ) − f ( x ∗ )] . Plugging into 2, � � 1 − m f ( x k + 1 ) − f ( x ∗ ) ≤ [ f ( x k ) − f ( x ∗ )] L 9/24

  11. Backtracking Line Search Exact line search is often expensive and not worth it. Suffices to find a good enough step size. One way to do so is to use backtracking line search, aka Armijo’s rule. Gradient descent with backtracking line search 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do t ← t 0 3: while f ( x − t ∇ f ( x )) > f ( x ) − α t �∇ f ( x ) � 2 2 do 4: t ← β t 5: end while 6: x ← x − t ∇ f ( x ) 7: 8: end while 9: return x α ∈ ( 0 , 1 ) and β ∈ ( 0 , 1 ) are constants. Armijo used α = β = 0 . 5 Values suggested in [BV]: α ∈ [ 0 . 01 , 0 . 3 ] , β ∈ [ 0 . 1 , 0 . 8 ] Note. For general d , use condition f ( x + t d ) > f ( x ) + α t ∇ f ( x ) T d 10/24

  12. Backtracking Line Search (cont’d) f ( x k + t d k ) f ( x k ) f ( x k ) + α t ∇ f ( x k ) T d k t 0 t t 2 = β 2 t 0 t 1 = β t 0 f ( x k ) + t ∇ f ( x k ) T d k • ∇ f ( x k ) T d k < 0 for descent direction d k • start from some “large” step size t 0 ([BV] uses t 0 = 1 ) • reduce step size geometrically until decrease is “large enough” t |∇ f ( x k ) T d k | f ( x k ) − f ( x k + t d k ) ≥ α × � �� � � �� � actual decrease in function value decrease along tangent line 11/24

  13. Example 2 x T Qx = γ f ( x 1 , x 2 ) = 1 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Well-conditioned. γ = 0 . 5 , x 0 = ( 2 , 1 ) T 1.5 10 1 1.0 10 3 0.5 f ( x * ) 0.0 10 5 x2 f ( x k ) 0.5 10 7 1.0 10 9 1.5 2 1 0 1 2 0 5 10 15 x1 iteration (k) Fast convergence. 12/24

  14. Example (cont’d) f ( x 1 , x 2 ) = 1 2 x T Qx = γ 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Ill-conditioned. γ = 0 . 01 0.1 0 . 1 0.0 x2 x2 0 . 0 0.1 − 0 . 1 1.2 1.4 1.6 1.8 2.0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0 x1 x1 10 1 10 − 3 10 3 f ( x k ) − f ( x ∗ ) f ( x * ) 10 − 5 10 5 f ( x k ) 10 7 10 − 7 0 200 400 600 0 200 400 600 iteration (k) iteration (k) x 0 = ( 2 , 0 . 3 ) , slow convergence x 0 = ( 2 , 0 . 02 ) , slow convergence 13/24

  15. Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then the sequence { x k } produced by gradient descent with backtracking line search satisfies f ( x k ) − f ( x ∗ ) ≤ c k [ f ( x 0 ) − f ( x ∗ )] where � � 2 m α t 0 , 4 m βα ( 1 − α ) c = 1 − min L Notes. • c ∈ ( 0 , 1 ) , as 4 m βα ( 1 − α ) ≤ β m L ≤ β < 1 L so x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast • Number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits. 14/24

  16. Proof The inner loop terminates with a step size bounded from below. 1. By the quadratic upper bound for L -smooth functions, f ( x k − t ∇ f ( x k )) ≤ f ( x k ) − t ( 1 − Lt 2 ) �∇ f ( x k ) � 2 2. The inner loop terminates for sure if ⇒ t ≤ 2 ( 1 − α ) − t ( 1 − Lt 2 ) �∇ f ( x k ) � 2 ≤ − α t �∇ f ( x k ) � 2 = L 3. The step size in backtracking line search satisfies � � t 0 , 2 β ( 1 − α ) t k ≥ η � min L ◮ t k = t 0 if Armijo’s condition is satisfied by t 0 ◮ otherwise, t k β > 2 ( 1 − α ) , since the inner loop did not terminate at t k L β 15/24

  17. Proof (cont’d) Now we look at the outer loop 4. By Armijo’s condition in the inner loop, f ( x k + 1 ) = f ( x k − t k ∇ f ( x k )) ≤ f ( x k ) − α t k �∇ f ( x k ) � 2 5. By 3 and 4, f ( x k + 1 ) − f ( x ∗ ) ≤ f ( x k ) − f ( x ∗ ) − αη �∇ f ( x k ) � 2 6. By step 4 of slide 9, �∇ f ( x k ) � 2 ≥ 2 m [ f ( x k ) − f ( x ∗ )] 7. By 5 and 6, f ( x k + 1 ) − f ( x ∗ ) ≤ ( 1 − 2 m αη )[ f ( x k ) − f ( x ∗ )] = c [ f ( x k ) − f ( x ∗ )] so f ( x k ) − f ( x ∗ ) ≤ c k [ f ( x 0 ) − f ( x ∗ )] 16/24

  18. Better Descent Direction Gradient descent uses first-order information (i.e. gradient), x k + 1 = x k − t k ∇ f ( x k ) Locally −∇ f ( x k ) is the max-rate descending direction, but globally it may not be the “right” direction. 2 x T Qx with Q = diag { 0 . 01 , 1 } , optimum is x ∗ = 0 . Example. For f ( x ) = 1 ⋆ The negative gradient is −∇ f ( x ) = − Qx = − ( 0 . 01 x 1 , x 2 ) T quite different from the “right” descent direction d = − x . Note d = − Q − 1 ∇ f ( x ) = − [ ∇ 2 f ( x )] − 1 ∇ f ( x ) With second-order information (i.e. Hessian), we hope to do better. 17/24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend