Local Function Optimization COMPSCI 371D Machine Learning COMPSCI - PowerPoint PPT Presentation

Local Function Optimization COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Local Function Optimization 1 / 29

Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4 Termination 5 Convergence Speed of Steepest Descent 6 Convergence Speed of Newton’s Method 7 Newton’s Method 8 Counting Steps versus Clocking COMPSCI 371D — Machine Learning Local Function Optimization 2 / 29

Motivation and Scope • Parametric predictor: h ( x ; v ) : R d × R m → Y • As a predictor: h ( x ; v ) = h v ( x ) : R d → Y n = 1 ℓ ( y n , h ( x n ; v )) : R m → R � N • Risk: L T ( v ) = 1 N • For risk minimization, h ( x n ; v ) = h x n ( v ) : R m → Y • Training a parametric predictor with m real parameters is function optimization: ˆ m L T ( v ) v ∈ arg min v ∈ R • Some v may be subject to constraints. We ignore those ML problems for now. • Other v may be integer-valued. We ignore those, too (combinatorial optimization). COMPSCI 371D — Machine Learning Local Function Optimization 3 / 29

Example • A binary linear classifier has decision boundary b + w T x = 0 � b � ∈ R d + 1 • So v = w • m = d + 1 • Counterexample: Can you think of a ML method that does not involve v ∈ R m ? COMPSCI 371D — Machine Learning Local Function Optimization 4 / 29

Warning: Change of Notation • Optimization is used for much more than ML • Even in ML, there are more than risks to optimize • So we use “generic notation” for optimization • Function to be minimized f ( u ) : R m → R • More in keeping with literature... ... except that we use u instead of x (too loaded for us!) • Minimizing f ( u ) is the same as maximizing − f ( u ) COMPSCI 371D — Machine Learning Local Function Optimization 5 / 29

Only Local Minimization • All we know about f is a “black box” (think Python function) • For many problems, f has many local minima • Start somewhere ( u 0 ), and take steps “down” f ( u k + 1 ) < f ( u k ) • When we get stuck at a local minimum, we declare success • We would like global minima, but all we get is local ones • For some problems, f has a unique minimum... • ... or at least a single connected set of minima COMPSCI 371D — Machine Learning Local Function Optimization 6 / 29

Gradient, Hessian, and Convexity Gradient  ∂ f  ∂ u 1 . ∇ f ( u ) = ∂ f . ∂ u =   .   ∂ f ∂ u m • ∇ f ( u ) is the direction of fastest growth of f at u • If ∇ f ( u ) exists everywhere, the condition ∇ f ( u ) = 0 is necessary and sufficient for a stationary point (max, min, or saddle) • Warning: only necessary for a minimum! • Reduces to first derivative for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 7 / 29

Gradient, Hessian, and Convexity First Order Taylor Expansion f ( u ) ≈ g 1 ( u ) = f ( u 0 ) + [ ∇ f ( u 0 )] T ( u − u 0 ) approximates f ( u ) near u 0 with a (hyper)plane through u 0 f( u ) u 2 u 0 u 1 ∇ f ( u 0 ) points to direction of steepest increase of f at u 0 • If we want to find u 1 where f ( u 1 ) < f ( u 0 ) , going along −∇ f ( u 0 ) seems promising • This is the general idea of steepest descent COMPSCI 371D — Machine Learning Local Function Optimization 8 / 29

Gradient, Hessian, and Convexity Hessian  ∂ 2 f ∂ 2 f  . . . ∂ u 2 ∂ u 1 ∂ u m 1 . .   H ( u ) = . . . .     ∂ 2 f ∂ 2 f . . . ∂ u 2 ∂ u m ∂ u 1 m • Symmetric matrix because of Schwarz’s theorem: ∂ 2 f ∂ 2 f = ∂ u i ∂ u j ∂ u j ∂ u i • Eigenvalues are real because of symmetry • Reduces to d 2 f du 2 for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 9 / 29

Gradient, Hessian, and Convexity Convexity f( u ) z f( u ) + (1-z) f( u' ) f( u' ) f(z u + (1-z) u' ) u u' z u + (1-z) u' • Convex everywhere : For all u , u ′ in the (open) domain of f and for all z ∈ [ 0 , 1 ] f ( z u + ( 1 − z ) u ′ ) ≤ zf ( u ) + ( 1 − z ) f ( u ′ ) • Convex at u 0 : The function f is convex everywhere in some open neighborhood of u 0 COMPSCI 371D — Machine Learning Local Function Optimization 10 / 29

Gradient, Hessian, and Convexity Convexity and Hessian • If H ( u ) is defined at a stationary point u of f , then u is a minimum iff H ( u ) � 0 • “ � ” means positive semidefinite : u T H u ≥ 0 for all u ∈ R m • Above is definition of H ( u ) � 0 • To check computationally: All eigenvalues are nonnegative • H ( u ) � 0 reduces to d 2 f du 2 ≥ 0 for f : R → R COMPSCI 371D — Machine Learning Local Function Optimization 11 / 29

Gradient, Hessian, and Convexity Second Order Taylor Expansion f ≈ g 2 ( u ) = f ( u 0 ) + [ ∇ f ( u 0 )] T ( u − u 0 ) + ( u − u 0 ) T H ( u 0 )( u − u 0 ) approximates f ( u ) near u 0 with a quadratic equation through u 0 • For minimization, this is useful only when H ( u ) � 0 • Function looks locally like a bowl f( u ) u 2 u 0 u 1 u 1 • If we want to find u 1 where f ( u 1 ) < f ( u 0 ) , going to the bottom of the bowl seems promising • This is the general idea of Newton’s method COMPSCI 371D — Machine Learning Local Function Optimization 12 / 29

A Local, Unconstrained Optimization Template A Template • Regardless of method, most local unconstrained optimization methods fit the following template, given a starting point u 0 : k = 0 while u k is not a minimum compute step direction p k compute step-size multiplier α k > 0 u k + 1 = u k + α k p k k = k + 1 end. COMPSCI 371D — Machine Learning Local Function Optimization 13 / 29

A Local, Unconstrained Optimization Template Design Decisions • Whether to stop (“while u k is not a minimum”) • In what direction to proceed ( p k ) • How long a step to take in that direction ( α k ) • Different decisions for the last two lead to different methods with very different behaviors and computational costs COMPSCI 371D — Machine Learning Local Function Optimization 14 / 29

Steepest Descent Steepest Descent: Follow the Gradient • In what direction to proceed: p k = −∇ f ( u k ) • “Steepest descent” or “gradient descent” • Problem reduces to one dimension: h ( α ) = f ( u k + α p k ) • α = 0 ⇒ u = u k • Find α = α k > 0 s.t. f ( u k + α k p k ) is a local minimum along the line • Line search (search along a line) • Q1: How to find α k ? • Q2: Is this a good strategy? COMPSCI 371D — Machine Learning Local Function Optimization 15 / 29

Steepest Descent Line Search • Bracketing triple : • a < b < c , h ( a ) ≥ h ( b ) , h ( b ) ≤ h ( c ) • Contains a (local) minimum! • Split the bigger of [ a , b ] and [ b , c ] in half with a point z • Find a new, narrower bracketing triple involving z and two out of a , b , c • Stop when the bracket is narrow enough (say, 10 − 6 ) • Pinned down a minimum to within 10 − 6 COMPSCI 371D — Machine Learning Local Function Optimization 16 / 29

Steepest Descent Phase 1: Find a Bracketing Triple h( α ) α COMPSCI 371D — Machine Learning Local Function Optimization 17 / 29

Steepest Descent Phase 2: Shrink the Bracketing Triple h( α ) α COMPSCI 371D — Machine Learning Local Function Optimization 18 / 29

Steepest Descent if b − a > c − b z = ( a + b ) / 2 if h ( z ) > h ( b ) ( a , b , c ) = ( z , b , c ) otherwise ( a , b , c ) = ( a , z , b ) end otherwise z = ( b + c ) / 2 if h ( z ) > h ( b ) ( a , b , c ) = ( a , b , z ) otherwise ( a , b , c ) = ( b , z , c ) end end COMPSCI 371D — Machine Learning Local Function Optimization 19 / 29

Termination Termination • Are we still making “significant progress”? • Check f ( u k − 1 ) − f ( u k ) ? (We want this to be strictly positive) • Check � u k − 1 − u k � ? (We want this to be large enough) • Second is more stringent close the the minimum because ∇ f ( u ) ≈ 0 • Stop when � u k − 1 − u k � < δ COMPSCI 371D — Machine Learning Local Function Optimization 20 / 29

Termination Is Steepest Descent a Good Strategy? • “We are going in the direction of fastest descent” • “We choose an optimal step by line search” • “Must be good, no?” Not so fast! (Pun intended) • An example for which we know the answer: f ( u ) = c + a T u + 1 2 u T Q u Q � 0 (convex paraboloid) • All smooth functions look like this close enough to u ∗ u * isocontours COMPSCI 371D — Machine Learning Local Function Optimization 21 / 29

Termination Skating to a Minimum u 0 p 0 * u COMPSCI 371D — Machine Learning Local Function Optimization 22 / 29

Termination How to Measure Convergence Speed • Asymptotics ( k → ∞ ) are what matters • If u ∗ is the true solution, how does � u k + 1 − u ∗ � compare with � u k − u ∗ � for large k ? • Which converges faster: � u k + 1 − u ∗ � ≈ β � u k − u ∗ � 1 or � u k + 1 − u ∗ � ≈ β � u k − u ∗ � 2 ? • Close to convergence these distances are small numbers � u k − u ∗ � 2 ≪ � u k − u ∗ � 1 [Example: ( 0 . 001 ) 2 ≪ ( 0 . 001 ) 1 ] COMPSCI 371D — Machine Learning Local Function Optimization 23 / 29

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI - PowerPoint PPT Presentation

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Local Function Optimization 1 / 29 Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4

Local, Unconstrained Function Optimization COMPSCI 527 Computer Vision COMPSCI 527

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Function Fields, Curves Introduction Function Fields vs. Curves and Global sections Function

Function Calls Function Calls Python supports expressions with math-like functions A

Evaluation function Cost function g g Evaluation function Cost function expand vertex

Algorithms for unconstrained local optimization Fabio Schoen 2008

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Algorithms for constrained local optimization Fabio Schoen 2008

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Purpose, Function, and Design Purpose, Function, and Design Purpose, Function, and Design

The Natural Logarithm Function and The Exponential Function One specific logarithm function is

Recursion What happens when a function calls itself? This is known as a recursive function

kickback brake CIRCULAR SAW WITH SAFETY FEATURE TO PREVENT INJURY FROM KICKBACK Cause of circular

So the last will be the first Ruud H. Koning Manon Grevinga april 2018 1 / 32 Rules Theory: a

ChakraLinux.org ChakraLinux.org The Half Rolling repository model The golden intersection for

Collisions, Center of Mass & Motion of a System of Particles One-Dimensional Collisions

Momentum permission of the owners. NJCTL maintains its website for the convenience of teachers

Welcome to the Waveney Public Meeting Let us go forward together. (Sir Winston Churchill)

Algorithms for Natural Language Processing Discourse and Pragmatics How Do Sentences Relate to

Computing with Affective Lexicons Affective, Sentimental, and

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI - PowerPoint PPT Presentation

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Local Function Optimization 1 / 29 Outline 1 Gradient, Hessian, and Convexity 2 A Local, Unconstrained Optimization Template 3 Steepest Descent 4

Local, Unconstrained Function Optimization COMPSCI 527 Computer Vision COMPSCI 527

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Function Fields, Curves Introduction Function Fields vs. Curves and Global sections Function

Function Calls Function Calls Python supports expressions with math-like functions A

Evaluation function Cost function g g Evaluation function Cost function expand vertex

Algorithms for unconstrained local optimization Fabio Schoen 2008

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Algorithms for constrained local optimization Fabio Schoen 2008

Lecture 4: Optimization Maximizing or Minimizing a Function of a Single Variable

Purpose, Function, and Design Purpose, Function, and Design Purpose, Function, and Design

The Natural Logarithm Function and The Exponential Function One specific logarithm function is

Recursion What happens when a function calls itself? This is known as a recursive function

kickback brake CIRCULAR SAW WITH SAFETY FEATURE TO PREVENT INJURY FROM KICKBACK Cause of circular

So the last will be the first Ruud H. Koning Manon Grevinga april 2018 1 / 32 Rules Theory: a

ChakraLinux.org ChakraLinux.org The Half Rolling repository model The golden intersection for

Collisions, Center of Mass &amp; Motion of a System of Particles One-Dimensional Collisions

Momentum permission of the owners. NJCTL maintains its website for the convenience of teachers

Welcome to the Waveney Public Meeting Let us go forward together. (Sir Winston Churchill)

Algorithms for Natural Language Processing Discourse and Pragmatics How Do Sentences Relate to

Computing with Affective Lexicons Affective, Sentimental, and

Collisions, Center of Mass & Motion of a System of Particles One-Dimensional Collisions