CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John - PowerPoint PPT Presentation

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 9, 2020

Recap Strong convexity. f is m -strongly convex if 2 � x � 2 is convex • f ( x ) − m • first-order condition f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + m 2 � y − x � 2 • second-order condition ∇ 2 f ( x ) � m I ⇐ ⇒ λ min ( ∇ 2 f ( x )) ≥ m Convergence. For m -strongly convex and L -smooth f with minimum x ∗ , gradient descent with constant step size t ∈ ( 0 , 1 L ] satisfies f ( x k ) − f ( x ∗ ) ≤ L ( 1 − mt ) k [ f ( x 0 ) − f ( x ∗ )] m Condition number. For Q ≻ O , κ ( Q ) = λ max ( Q ) λ min ( Q ) Well-/Ill-conditioned if κ ( Q ) is small/large = ⇒ fast/slow convergence. 1/24

Today • exact line search • backtracking line search • Newton’s method 2/24

Step Size Gradient descent x k + 1 = x k − t k ∇ f ( x k ) • constant step size: t k = t for all k • exact line search: optimal t k for each step t k = arg min f ( x k − s ∇ f ( x k )) s • backtracking line search (Armijo’s rule): t k satisfies f ( x k ) − f ( x k − t k ∇ f ( x k )) ≥ α t k �∇ f ( x k ) � 2 2 for some given α ∈ ( 0 , 1 ) . 3/24

Exact Line Search 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do t ← arg min f ( x − s ∇ f ( x )) 3: s x ← x − t ∇ f ( x ) 4: 5: end while 6: return x ⋆ −∇ f f ( x k − s ∇ f ( x k )) level curves of f ( x 1 , x 2 ) = x 2 4 + x 2 1 s t 2 Note. Often impractical; used only if the inner minimization is cheap. 4/24

Exact Line Search for Quadratic Functions f ( x ) = 1 2 x T Qx + b T x , Q ≻ O • gradient at x k is g k = ∇ f ( x k ) = Qx k + b • second-order Taylor expansion is exact for quadratic functions, h ( t ) = f ( x k − t g k ) = f ( x k ) + ∇ f ( x k ) T ( − t g k ) + 1 2 ( − t g k ) T ∇ 2 f ( x k )( − t g k ) � 1 � t 2 − g T 2 g T = k g k t + f ( x k ) k Qg k • minimizing h ( t ) yields best step size t k = g T k g k g T k Qg k • update step x k + 1 = x k − t k g k = x k − g T k g k g k g T k Qg k 5/24

Example 2 x T Qx = γ f ( x 1 , x 2 ) = 1 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Well-conditioned. γ = 0 . 5 , x 0 = ( 2 , 1 ) T 1.5 10 − 1 1.0 f ( x * ) 10 − 3 0.5 0.0 error f ( x k ) x2 10 − 5 0.5 10 − 7 1.0 10 − 9 1.5 2 1 0 1 2 0.0 2.5 5.0 7.5 10.0 iteration (k) x1 Fast convergence. Note. Successive gradient directions are always orthogonal, as 0 = h ′ ( t k ) = −∇ f ( x k − t k ∇ f ( x k )) T ∇ f ( x k ) = −∇ f ( x k + 1 ) T ∇ f ( x k ) 6/24

Example (cont’d) f ( x 1 , x 2 ) = 1 2 x T Qx = γ 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Ill-conditioned. γ = 0 . 01 , convergence rate depends on initial point 0.1 0.25 x2 0.0 x2 0.00 0.1 0.25 0.0 0.5 1.0 1.5 2.0 1.2 1.4 1.6 1.8 2.0 x1 x1 10 − 2 10 − 3 f ( x * ) f ( x * ) 10 − 4 10 − 5 error f ( x k ) error f ( x k ) 10 − 6 10 − 7 10 − 8 0 0 5 10 15 100 200 300 400 iteration (k) iteration (k) x 0 = ( 2 , 0 . 3 ) , fast convergence x 0 = ( 2 , 0 . 02 ) , slow convergence 7/24

Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then the sequence { x k } produced by gradient descent with exact line search satisfies � � k 1 − m f ( x k ) − f ( x ∗ ) ≤ [ f ( x 0 ) − f ( x ∗ )] L Notes. L < 1 , so x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast • 0 ≤ 1 − m • The number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits. • The convergence rate depends on the condition number L / m and can be slow if L / m is large. When close to x ∗ , we can estimate L / m by κ ( ∇ f 2 ( x ∗ )) . 8/24

Proof 1. By the quadratic upper bound for L -smooth functions, f ( x k − t ∇ f ( x k )) ≤ f ( x k ) − t �∇ f ( x k ) � 2 + Lt 2 2 �∇ f ( x k ) � 2 � q ( t ) 2. Minimizing over t in step 1, q ( t ) = q ( 1 L ) = f ( x k ) − 1 2 L �∇ f ( x k ) � 2 f ( x k + 1 ) = min f ( x k − t ∇ f ( x k )) ≤ min t t 3. By m -strong convexity, f ( x ) ≥ f ( x k ) + ∇ f ( x k ) T ( x − x k ) + m 2 � x − x k � 2 � ˆ f ( x ) 4. Minimizing over x in step 3, f ( x k − 1 m ∇ f ( x k )) = f ( x k ) − 1 ˆ f ( x ) = ˆ 2 m �∇ f ( x k ) � 2 f ( x ∗ ) = min f ( x ) ≥ min x x 5. By 4, �∇ f ( x k ) � 2 ≥ 2 m [ f ( x k ) − f ( x ∗ )] . Plugging into 2, � � 1 − m f ( x k + 1 ) − f ( x ∗ ) ≤ [ f ( x k ) − f ( x ∗ )] L 9/24

Backtracking Line Search Exact line search is often expensive and not worth it. Suffices to find a good enough step size. One way to do so is to use backtracking line search, aka Armijo’s rule. Gradient descent with backtracking line search 1: initialization x ← x 0 ∈ R n 2: while �∇ f ( x ) � > δ do t ← t 0 3: while f ( x − t ∇ f ( x )) > f ( x ) − α t �∇ f ( x ) � 2 2 do 4: t ← β t 5: end while 6: x ← x − t ∇ f ( x ) 7: 8: end while 9: return x α ∈ ( 0 , 1 ) and β ∈ ( 0 , 1 ) are constants. Armijo used α = β = 0 . 5 Values suggested in [BV]: α ∈ [ 0 . 01 , 0 . 3 ] , β ∈ [ 0 . 1 , 0 . 8 ] Note. For general d , use condition f ( x + t d ) > f ( x ) + α t ∇ f ( x ) T d 10/24

Backtracking Line Search (cont’d) f ( x k + t d k ) f ( x k ) f ( x k ) + α t ∇ f ( x k ) T d k t 0 t t 2 = β 2 t 0 t 1 = β t 0 f ( x k ) + t ∇ f ( x k ) T d k • ∇ f ( x k ) T d k < 0 for descent direction d k • start from some “large” step size t 0 ([BV] uses t 0 = 1 ) • reduce step size geometrically until decrease is “large enough” t |∇ f ( x k ) T d k | f ( x k ) − f ( x k + t d k ) ≥ α × � �� actual decrease in function value decrease along tangent line 11/24

Example 2 x T Qx = γ f ( x 1 , x 2 ) = 1 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Well-conditioned. γ = 0 . 5 , x 0 = ( 2 , 1 ) T 1.5 10 1 1.0 10 3 0.5 f ( x * ) 0.0 10 5 x2 f ( x k ) 0.5 10 7 1.0 10 9 1.5 2 1 0 1 2 0 5 10 15 x1 iteration (k) Fast convergence. 12/24

Example (cont’d) f ( x 1 , x 2 ) = 1 2 x T Qx = γ 1 + 1 2 x 2 2 x 2 2 , Q = diag { γ, 1 } Ill-conditioned. γ = 0 . 01 0.1 0 . 1 0.0 x2 x2 0 . 0 0.1 − 0 . 1 1.2 1.4 1.6 1.8 2.0 1 . 2 1 . 4 1 . 6 1 . 8 2 . 0 x1 x1 10 1 10 − 3 10 3 f ( x k ) − f ( x ∗ ) f ( x * ) 10 − 5 10 5 f ( x k ) 10 7 10 − 7 0 200 400 600 0 200 400 600 iteration (k) iteration (k) x 0 = ( 2 , 0 . 3 ) , slow convergence x 0 = ( 2 , 0 . 02 ) , slow convergence 13/24

Convergence Analysis Theorem. If f is m -strongly convex and L -smooth, and x ∗ is a minimum of f , then the sequence { x k } produced by gradient descent with backtracking line search satisfies f ( x k ) − f ( x ∗ ) ≤ c k [ f ( x 0 ) − f ( x ∗ )] where � � 2 m α t 0 , 4 m βα ( 1 − α ) c = 1 − min L Notes. • c ∈ ( 0 , 1 ) , as 4 m βα ( 1 − α ) ≤ β m L ≤ β < 1 L so x k → x ∗ and f ( x k ) → f ( x ∗ ) exponentially fast • Number of iterations to reach f ( x k ) − f ( x ∗ ) ≤ ǫ is O (log 1 ǫ ) . For ǫ = 10 − p , k = O ( p ) , linear in the number of significant digits. 14/24

Proof The inner loop terminates with a step size bounded from below. 1. By the quadratic upper bound for L -smooth functions, f ( x k − t ∇ f ( x k )) ≤ f ( x k ) − t ( 1 − Lt 2 ) �∇ f ( x k ) � 2 2. The inner loop terminates for sure if ⇒ t ≤ 2 ( 1 − α ) − t ( 1 − Lt 2 ) �∇ f ( x k ) � 2 ≤ − α t �∇ f ( x k ) � 2 = L 3. The step size in backtracking line search satisfies � � t 0 , 2 β ( 1 − α ) t k ≥ η � min L ◮ t k = t 0 if Armijo’s condition is satisfied by t 0 ◮ otherwise, t k β > 2 ( 1 − α ) , since the inner loop did not terminate at t k L β 15/24

Proof (cont’d) Now we look at the outer loop 4. By Armijo’s condition in the inner loop, f ( x k + 1 ) = f ( x k − t k ∇ f ( x k )) ≤ f ( x k ) − α t k �∇ f ( x k ) � 2 5. By 3 and 4, f ( x k + 1 ) − f ( x ∗ ) ≤ f ( x k ) − f ( x ∗ ) − αη �∇ f ( x k ) � 2 6. By step 4 of slide 9, �∇ f ( x k ) � 2 ≥ 2 m [ f ( x k ) − f ( x ∗ )] 7. By 5 and 6, f ( x k + 1 ) − f ( x ∗ ) ≤ ( 1 − 2 m αη )[ f ( x k ) − f ( x ∗ )] = c [ f ( x k ) − f ( x ∗ )] so f ( x k ) − f ( x ∗ ) ≤ c k [ f ( x 0 ) − f ( x ∗ )] 16/24

Better Descent Direction Gradient descent uses first-order information (i.e. gradient), x k + 1 = x k − t k ∇ f ( x k ) Locally −∇ f ( x k ) is the max-rate descending direction, but globally it may not be the “right” direction. 2 x T Qx with Q = diag { 0 . 01 , 1 } , optimum is x ∗ = 0 . Example. For f ( x ) = 1 ⋆ The negative gradient is −∇ f ( x ) = − Qx = − ( 0 . 01 x 1 , x 2 ) T quite different from the “right” descent direction d = − x . Note d = − Q − 1 ∇ f ( x ) = − [ ∇ 2 f ( x )] − 1 ∇ f ( x ) With second-order information (i.e. Hessian), we hope to do better. 17/24

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John - PowerPoint PPT Presentation

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 9, 2020 Recap Strong convexity. f is m -strongly convex if 2 x 2 is convex f ( x ) m

CS257 Linear and Convex Optimization Lecture 7 Bo Jiang John Hopcroft Center for Computer

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS257 Linear and Convex Optimization Lecture 1 Bo Jiang John Hopcroft Center for Computer

CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Faster convex optimization Simulated annealing & Interior point Elad Hazan Joint work with

Learning Step Size Controllers for Robust Neural Network Training Christian Daniel et al. Recent

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Leamer Monoids and the Huneke-Wiegand Conjecture Roberto Carlos Pelayo Christopher ONeill

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from

Stepwise Refinement Lecture 6 CGS 3416 Spring 2017 February 6, 2017 Lecture 6CGS 3416 Spring

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John - PowerPoint PPT Presentation

CS257 Linear and Convex Optimization Lecture 10 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University November 9, 2020 Recap Strong convexity. f is m -strongly convex if 2 x 2 is convex f ( x ) m

CS257 Linear and Convex Optimization Lecture 7 Bo Jiang John Hopcroft Center for Computer

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS257 Linear and Convex Optimization Lecture 1 Bo Jiang John Hopcroft Center for Computer

CS257 Linear and Convex Optimization Lecture 9 Bo Jiang John Hopcroft Center for Computer

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Faster convex optimization Simulated annealing &amp; Interior point Elad Hazan Joint work with

Learning Step Size Controllers for Robust Neural Network Training Christian Daniel et al. Recent

Introduction to Machine Learning Linear Regression Prof. Andreas Krause Learning and Adaptive

Math 211 Math 211 Lecture #14 M ATLAB s ODE Solvers September 26, 2003 2 Matlab Solvers

On the steplength selection in Stochastic Gradient Methods Giorgia Franchini

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

Leamer Monoids and the Huneke-Wiegand Conjecture Roberto Carlos Pelayo Christopher ONeill

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from

Stepwise Refinement Lecture 6 CGS 3416 Spring 2017 February 6, 2017 Lecture 6CGS 3416 Spring

Faster convex optimization Simulated annealing & Interior point Elad Hazan Joint work with