local function optimization
play

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI - PowerPoint PPT Presentation

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Local Function Optimization 1 / 29 Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4


  1. Local Function Optimization COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Local Function Optimization 1 / 29

  2. Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4 Termination 5 Convergence Speed of Steepest Descent 6 Convergence Speed of Newton’s Method 7 Newton’s Method 8 Counting Steps versus Clocking COMPSCI 371D — Machine Learning Local Function Optimization 2 / 29

  3. Motivation and Scope • Parametric predictor: h ( x ; v ) : R d × R m → Y • As a predictor: h ( x ; v ) = h v ( x ) : R d → Y n = 1 ℓ ( y n , h ( x n ; v )) : R m → R � N • Risk: L T ( v ) = 1 N • For risk minimization, h ( x n ; v ) = h x n ( v ) : R m → Y • Training a parametric predictor with m real parameters is function optimization: ˆ m L T ( v ) v ∈ arg min v ∈ R • Some v may be subject to constraints. We ignore those ML problems for now. • Other v may be integer-valued. We ignore those, too (combinatorial optimization). COMPSCI 371D — Machine Learning Local Function Optimization 3 / 29

  4. Example • A binary linear classifier has decision boundary b + w T x = 0 � b � ∈ R d + 1 • So v = w • m = d + 1 • Counterexample: Can you think of a ML method that does not involve v ∈ R m ? COMPSCI 371D — Machine Learning Local Function Optimization 4 / 29

  5. Warning: Change of Notation • Optimization is used for much more than ML • Even in ML, there are more than risks to optimize • So we use “generic notation” for optimization • Function to be minimized f ( u ) : R m → R • More in keeping with literature... ... except that we use u instead of x (too loaded for us!) • Minimizing f ( u ) is the same as maximizing − f ( u ) COMPSCI 371D — Machine Learning Local Function Optimization 5 / 29

  6. Only Local Minimization • All we know about f is a “black box” (think Python function) • For many problems, f has many local minima • Start somewhere ( u 0 ), and take steps “down” f ( u k + 1 ) < f ( u k ) • When we get stuck at a local minimum, we declare success • We would like global minima, but all we get is local ones • For some problems, f has a unique minimum... • ... or at least a single connected set of minima COMPSCI 371D — Machine Learning Local Function Optimization 6 / 29

  7. Gradient, Hessian, and Convexity Gradient  ∂ f  ∂ u 1 . ∇ f ( u ) = ∂ f . ∂ u =   .   ∂ f ∂ u m • ∇ f ( u ) is the direction of fastest growth of f at u • If ∇ f ( u ) exists everywhere, the condition ∇ f ( u ) = 0 is necessary and sufficient for a stationary point (max, min, or saddle) • Warning: only necessary for a minimum! • Reduces to first derivative for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 7 / 29

  8. Gradient, Hessian, and Convexity First Order Taylor Expansion f ( u ) ≈ g 1 ( u ) = f ( u 0 ) + [ ∇ f ( u 0 )] T ( u − u 0 ) approximates f ( u ) near u 0 with a (hyper)plane through u 0 f( u ) u 2 u 0 u 1 ∇ f ( u 0 ) points to direction of steepest increase of f at u 0 • If we want to find u 1 where f ( u 1 ) < f ( u 0 ) , going along −∇ f ( u 0 ) seems promising • This is the general idea of steepest descent COMPSCI 371D — Machine Learning Local Function Optimization 8 / 29

  9. Gradient, Hessian, and Convexity Hessian  ∂ 2 f ∂ 2 f  . . . ∂ u 2 ∂ u 1 ∂ u m 1 . .   H ( u ) = . . . .     ∂ 2 f ∂ 2 f . . . ∂ u 2 ∂ u m ∂ u 1 m • Symmetric matrix because of Schwarz’s theorem: ∂ 2 f ∂ 2 f = ∂ u i ∂ u j ∂ u j ∂ u i • Eigenvalues are real because of symmetry • Reduces to d 2 f du 2 for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 9 / 29

  10. Gradient, Hessian, and Convexity Convexity f( u ) z f( u ) + (1-z) f( u' ) f( u' ) f(z u + (1-z) u' ) u u' z u + (1-z) u' • Convex everywhere : For all u , u ′ in the (open) domain of f and for all z ∈ [ 0 , 1 ] f ( z u + ( 1 − z ) u ′ ) ≤ zf ( u ) + ( 1 − z ) f ( u ′ ) • Convex at u 0 : The function f is convex everywhere in some open neighborhood of u 0 COMPSCI 371D — Machine Learning Local Function Optimization 10 / 29

  11. Gradient, Hessian, and Convexity Convexity and Hessian • If H ( u ) is defined at a stationary point u of f , then u is a minimum iff H ( u ) � 0 • “ � ” means positive semidefinite : u T H u ≥ 0 for all u ∈ R m • Above is definition of H ( u ) � 0 • To check computationally: All eigenvalues are nonnegative • H ( u ) � 0 reduces to d 2 f du 2 ≥ 0 for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 11 / 29

  12. Gradient, Hessian, and Convexity Second Order Taylor Expansion f ≈ g 2 ( u ) = f ( u 0 ) + [ ∇ f ( u 0 )] T ( u − u 0 ) + ( u − u 0 ) T H ( u 0 )( u − u 0 ) approximates f ( u ) near u 0 with a quadratic equation through u 0 • For minimization, this is useful only when H ( u ) � 0 • Function looks locally like a bowl f( u ) u 2 u 0 u 1 u 1 • If we want to find u 1 where f ( u 1 ) < f ( u 0 ) , going to the bottom of the bowl seems promising • This is the general idea of Newton’s method COMPSCI 371D — Machine Learning Local Function Optimization 12 / 29

  13. A Local, Unconstrained Optimization Template A Template • Regardless of method, most local unconstrained optimization methods fit the following template, given a starting point u 0 : k = 0 while u k is not a minimum compute step direction p k compute step-size multiplier α k > 0 u k + 1 = u k + α k p k k = k + 1 end. COMPSCI 371D — Machine Learning Local Function Optimization 13 / 29

  14. A Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while u k is not a minimum”) • In what direction to proceed ( p k ) • How long a step to take in that direction ( α k ) • Different decisions for the last two lead to different methods with very different behaviors and computational costs COMPSCI 371D — Machine Learning Local Function Optimization 14 / 29

  15. Steepest Descent Steepest Descent: Follow the Gradient • In what direction to proceed: p k = −∇ f ( u k ) • “Steepest descent” or “gradient descent” • Problem reduces to one dimension: h ( α ) = f ( u k + α p k ) • α = 0 ⇒ u = u k • Find α = α k > 0 s.t. f ( u k + α k p k ) is a local minimum along the line • Line search (search along a line) • Q1: How to find α k ? • Q2: Is this a good strategy? COMPSCI 371D — Machine Learning Local Function Optimization 15 / 29

  16. Steepest Descent Line Search • Bracketing triple : • a < b < c , h ( a ) ≥ h ( b ) , h ( b ) ≤ h ( c ) • Contains a (local) minimum! • Split the bigger of [ a , b ] and [ b , c ] in half with a point z • Find a new, narrower bracketing triple involving z and two out of a , b , c • Stop when the bracket is narrow enough (say, 10 − 6 ) • Pinned down a minimum to within 10 − 6 COMPSCI 371D — Machine Learning Local Function Optimization 16 / 29

  17. Steepest Descent Phase 1: Find a Bracketing Triple h( α ) α COMPSCI 371D — Machine Learning Local Function Optimization 17 / 29

  18. Steepest Descent Phase 2: Shrink the Bracketing Triple h( α ) α COMPSCI 371D — Machine Learning Local Function Optimization 18 / 29

  19. Steepest Descent if b − a > c − b z = ( a + b ) / 2 if h ( z ) > h ( b ) ( a , b , c ) = ( z , b , c ) otherwise ( a , b , c ) = ( a , z , b ) end otherwise z = ( b + c ) / 2 if h ( z ) > h ( b ) ( a , b , c ) = ( a , b , z ) otherwise ( a , b , c ) = ( b , z , c ) end end COMPSCI 371D — Machine Learning Local Function Optimization 19 / 29

  20. Termination Termination • Are we still making “significant progress”? • Check f ( u k − 1 ) − f ( u k ) ? (We want this to be strictly positive) • Check � u k − 1 − u k � ? (We want this to be large enough) • Second is more stringent close the the minimum because ∇ f ( u ) ≈ 0 • Stop when � u k − 1 − u k � < δ COMPSCI 371D — Machine Learning Local Function Optimization 20 / 29

  21. Termination Is Steepest Descent a Good Strategy? • “We are going in the direction of fastest descent” • “We choose an optimal step by line search” • “Must be good, no?” Not so fast! (Pun intended) • An example for which we know the answer: f ( u ) = c + a T u + 1 2 u T Q u Q � 0 (convex paraboloid) • All smooth functions look like this close enough to u ∗ u * isocontours COMPSCI 371D — Machine Learning Local Function Optimization 21 / 29

  22. Termination Skating to a Minimum u 0 p 0 * u COMPSCI 371D — Machine Learning Local Function Optimization 22 / 29

  23. Termination How to Measure Convergence Speed • Asymptotics ( k → ∞ ) are what matters • If u ∗ is the true solution, how does � u k + 1 − u ∗ � compare with � u k − u ∗ � for large k ? • Which converges faster: � u k + 1 − u ∗ � ≈ β � u k − u ∗ � 1 or � u k + 1 − u ∗ � ≈ β � u k − u ∗ � 2 ? • Close to convergence these distances are small numbers � u k − u ∗ � 2 ≪ � u k − u ∗ � 1 [Example: ( 0 . 001 ) 2 ≪ ( 0 . 001 ) 1 ] COMPSCI 371D — Machine Learning Local Function Optimization 23 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend