optimization
play

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 - PowerPoint PPT Presentation

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 Logistics Reminders: Thought Question 1 due TODAY, September 17, by 11:59pm To be handed in via eClass Assignment 1 (due Thursday, September 24 ) Tutorial:


  1. Optimization CMPUT 296: Basics of Machine Learning Textbook §4.1-4.4

  2. Logistics Reminders: • Thought Question 1 due TODAY, September 17, by 11:59pm • To be handed in via eClass • Assignment 1 (due Thursday, September 24 ) Tutorial: • Python tutorial from yesterday is available on eClass

  3. Recap: Estimators • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

  4. Recap: Sample Complexity • Sample complexity is the number of samples needed to attain a desired error bound at a desired probability 1 − δ ϵ • The mean squared error of an estimator decomposes into bias (squared) and variance • Using a biased estimator can have lower error than an unbiased estimator • Bias the estimator based some prior information • But this only helps if the prior information is correct • Cannot reduce error by adding in arbitrary bias

  5. Outline 1. Recap & Logistics 2. Optimization by Gradient Descent 3. Multivariate Gradient Descent 4. Adaptive Step Sizes 5. Optimization Properties

  6. ̂ Optimization We often want to find the argument that minimizes an objective function w * c w * = arg min w c ( w ) n { ( x i , y i ) } Example: Using linear regression to fit a dataset i =1 • Estimate the targets by y = f ( x ) = w 0 + w 1 x y ( ) f x • Each vector specifies a particular f w ( , ) x y 2 2 n ∑ ( f ( x i ) − y i ) 2 • Objective is the total error c ( w ) = = ( ) { e f x y 1 1 1 ( , ) x y i =1 1 1 x

  7. Stationary Points • Recall that every minimum of an everywhere-differentiable function c ( w ) must* occur at a stationary point : A point at which c ′ ( w ) = 0 ✴ Question: What is the exception? Local Minima • However, not every stationary point is a minimum Saddlepoint • Every stationary point is either: • A local minimum • A local maximum • A saddlepoint Global Minima Global Minimum • The global minimum is either a local minimum, or a boundary point

  8. Numerical Optimization • So a simple recipe for optimizing a function is to find its stationary points; one of those must be the minimum (as long as domain is unbounded) • Question: Why don't we always just do that? • We will almost never be able to analytically compute the minimum of the functions that we want to optimize ✴ (Linear regression is an important exception) • Instead, we must try to find the minimum numerically • Main techniques: First-order and second-order gradient descent

  9. T Taylor Series Definition: A Taylor series is a way of approximating a function in a small c neighbourhood around a point : a ( w − a ) 2 + ⋯ + c ( k ) ( a ) ( a )( w − a ) + c ′ ′ ( a ) ( w − a ) k c ( w ) ≈ c ( a ) + c ′ 2 k ! k c ( i ) ( a ) ∑ ( w − a ) i = c ( a ) + i ! i =1 • Intuition: Following tangent line of the function approximates how it changes • i.e., following a function with the same first derivative • Following a function with the same first and second derivatives is a better approximation; with the same first, second, third derivatives is even better; etc.

  10. ̂ Second-Order Gradient Descent (Newton-Raphson Method) 1. Approximate the target function with a dw [ c ( a ) + c ′ ] 0 = d ( a )( w − a ) + c ′ ′ ( a ) ( w − a ) 2 second-order Taylor series around the 2 current guess : w t ( a ) + 2 c ′ ( a ) w − 2 c ′ ( a ) ′ ′ = c ′ a ( w t ) ( w t )( w − w t ) + c ′ ′ 2 2 ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ 2 = c ′ ( a ) + c ′ ( a )( w − a ) ′ 2. Find the stationary point of the approximation ⟺ − c ′ ( a ) = c ′ ′ ( a )( w − a ) w t +1 ← w t − c ′ ( w t ) ⟺ ( w − a ) = − c ′ ( a ) c ′ ′ ( w t ) c ′ ′ ( a ) ( a ) 3. If the stationary point of the approximation is ⟺ w = a − c ′ a (good enough) stationary point of the c ′ ( a ) ′ objective, then stop. Else, goto 1.

  11. ̂ ̂ (First-Order) Gradient Descent • We can run Newton-Raphson whenever we have access to both the first and second derivatives of the target function • Often we want to only use the first derivative ( why? ) • First-order gradient descent: Replace the second derivative with a 1 (the step size ) in the approximation: constant η ( w t )( w − w t )+ c ′ ( w t ) ′ ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ 2 ( w t )( w − w t )+ 1 2 η ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ • By exactly the same derivation as before: w t +1 ← w t − η c ′ ( w t )

  12. Partial Derivatives • So far: Optimizing univariate function c : ℝ → ℝ c : ℝ d → ℝ • But actually: Optimizing multivariate function is typically h u g e ( is not uncommon) d d ≫ 10,000 • • First derivative of a multivariate function is a vector of partial derivatives Definiton: 
 ∂ f The partial derivative ( x 1 , …, x d ) ∂ x i of a function at with respect to is , where f ( x 1 , …, x d ) x 1 , …, x d x i g ′ ( x i ) g ( y ) = f ( x 1 , …, x i − 1 , y , x i +1 , …, x d )

  13. Gradients The multivariate analog to a first derivative is called a gradient . Definition: f : ℝ d → ℝ x ∈ ℝ d The gradient of a function at is a vector of all the ∇ f ( x ) partial derivatives of at : f x ∂ f ∂ x 1 ( x ) ∂ f ∂ x 2 ( x ) ∇ f ( x ) = ⋮ ∂ f ∂ xd ( x )

  14. Multivariate Gradient Descent First-order gradient descent for multivariate functions is just: c : ℝ → ℝ w t +1 ← w t − η t ∇ c ( w t ) • Notice the subscript on t η We can choose a different for each iteration η t • Indeed, for univariate functions, Newton-Raphson can be understood as first- • 1 order gradient descent that chooses a step size of at each iteration. η t = c ′ ′ ( w t ) Choosing a good step size is crucial to efficiently using first-order gradient descent •

  15. Adaptive Step Sizes (a) Step-size too small (b) Step-size too big (c) Adaptive step-size • If the step size is too small , gradient descent will "work", but take forever • Too big , and we can overshoot the optimum η ∈ℝ + c ( w t − η ∇ c ( w t ) ) • Ideally, we would choose η t = arg min • But that's another optimization! • There are some heuristics that we can use to adaptively guess good values for η t

  16. Line Search Intuition: A simple heuristic: line search • Big step sizes are better so long as 1. Try some largest-reasonable step size they don't overshoot η (0) = η max t • Try a big step size! If it increases c ( w t − η ( s ) the objective, try a smaller one. t ∇ c ( w t ) ) < c ( w t ) 2. Is ? w t +1 ← w t − η ( s ) • Keep trying smaller ones until you If yes, t ∇ c ( w t ) decrease the objective; then start η ( s +1) = τη ( s ) 3. Otherwise, try iteration from again. t + 1 η max t t (for ) and goto 2 τ < 1 • Typically τ ∈ [0.5,0.9]

  17. Optimization Properties 1. Maximizing is the same as minimizing : c ( w ) − c ( w ) arg max w c ( w ) = arg min w − c ( w ) 2. Convex functions have a global minimum at every stationary point c is convex ⟺ c ( t w 1 + (1 − t ) w 2 ) ≤ tc ( w 1 ) + (1 − t ) c ( w 2 ) 3. Identifiability: Sometimes we want the actual global minimum ; other times we want a good-enough minimizer (i.e., local minimum might be OK). 4. Equivalence under constant shifts: Adding, subtracting, or multiplying by a positive constant does not change the minimizer of a function: ∀ k ∈ ℝ + arg min w c ( w ) = arg min w c ( w )+ k = arg min w c ( w ) − k = arg min w kc ( w )

  18. Summary • We often want to find the argument that minimizes an objective function : w * c w * = arg min w c ( w ) • Every interior minimum is a stationary point , so check the stationary points • Stationary points usually identified numerically • Typically, by gradient descent • Choosing the step size is important for efficiency and correctness • Common approach: Adaptive step size • E.g., by line search

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend