Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 - PowerPoint PPT Presentation

Optimization CMPUT 296: Basics of Machine Learning Textbook §4.1-4.4

Logistics Reminders: • Thought Question 1 due TODAY, September 17, by 11:59pm • To be handed in via eClass • Assignment 1 (due Thursday, September 24 ) Tutorial: • Python tutorial from yesterday is available on eClass

Recap: Estimators • An estimator is a random variable representing a procedure for estimating the value of an unobserved quantity based on observed data • Concentration inequalities let us bound the probability of a given estimator being at least from the estimated quantity ϵ • An estimator is consistent if it converges in probability to the estimated quantity

Recap: Sample Complexity • Sample complexity is the number of samples needed to attain a desired error bound at a desired probability 1 − δ ϵ • The mean squared error of an estimator decomposes into bias (squared) and variance • Using a biased estimator can have lower error than an unbiased estimator • Bias the estimator based some prior information • But this only helps if the prior information is correct • Cannot reduce error by adding in arbitrary bias

Outline 1. Recap & Logistics 2. Optimization by Gradient Descent 3. Multivariate Gradient Descent 4. Adaptive Step Sizes 5. Optimization Properties

̂ Optimization We often want to find the argument that minimizes an objective function w * c w * = arg min w c ( w ) n { ( x i , y i ) } Example: Using linear regression to fit a dataset i =1 • Estimate the targets by y = f ( x ) = w 0 + w 1 x y ( ) f x • Each vector specifies a particular f w ( , ) x y 2 2 n ∑ ( f ( x i ) − y i ) 2 • Objective is the total error c ( w ) = = ( ) { e f x y 1 1 1 ( , ) x y i =1 1 1 x

Stationary Points • Recall that every minimum of an everywhere-differentiable function c ( w ) must* occur at a stationary point : A point at which c ′ ( w ) = 0 ✴ Question: What is the exception? Local Minima • However, not every stationary point is a minimum Saddlepoint • Every stationary point is either: • A local minimum • A local maximum • A saddlepoint Global Minima Global Minimum • The global minimum is either a local minimum, or a boundary point

Numerical Optimization • So a simple recipe for optimizing a function is to find its stationary points; one of those must be the minimum (as long as domain is unbounded) • Question: Why don't we always just do that? • We will almost never be able to analytically compute the minimum of the functions that we want to optimize ✴ (Linear regression is an important exception) • Instead, we must try to find the minimum numerically • Main techniques: First-order and second-order gradient descent

T Taylor Series Definition: A Taylor series is a way of approximating a function in a small c neighbourhood around a point : a ( w − a ) 2 + ⋯ + c ( k ) ( a ) ( a )( w − a ) + c ′ ′ ( a ) ( w − a ) k c ( w ) ≈ c ( a ) + c ′ 2 k ! k c ( i ) ( a ) ∑ ( w − a ) i = c ( a ) + i ! i =1 • Intuition: Following tangent line of the function approximates how it changes • i.e., following a function with the same first derivative • Following a function with the same first and second derivatives is a better approximation; with the same first, second, third derivatives is even better; etc.

̂ Second-Order Gradient Descent (Newton-Raphson Method) 1. Approximate the target function with a dw [ c ( a ) + c ′ ] 0 = d ( a )( w − a ) + c ′ ′ ( a ) ( w − a ) 2 second-order Taylor series around the 2 current guess : w t ( a ) + 2 c ′ ( a ) w − 2 c ′ ( a ) ′ ′ = c ′ a ( w t ) ( w t )( w − w t ) + c ′ ′ 2 2 ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ 2 = c ′ ( a ) + c ′ ( a )( w − a ) ′ 2. Find the stationary point of the approximation ⟺ − c ′ ( a ) = c ′ ′ ( a )( w − a ) w t +1 ← w t − c ′ ( w t ) ⟺ ( w − a ) = − c ′ ( a ) c ′ ′ ( w t ) c ′ ′ ( a ) ( a ) 3. If the stationary point of the approximation is ⟺ w = a − c ′ a (good enough) stationary point of the c ′ ( a ) ′ objective, then stop. Else, goto 1.

̂ ̂ (First-Order) Gradient Descent • We can run Newton-Raphson whenever we have access to both the first and second derivatives of the target function • Often we want to only use the first derivative ( why? ) • First-order gradient descent: Replace the second derivative with a 1 (the step size ) in the approximation: constant η ( w t )( w − w t )+ c ′ ( w t ) ′ ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ 2 ( w t )( w − w t )+ 1 2 η ( w − w t ) 2 c ( w ) = c ( w t ) + c ′ • By exactly the same derivation as before: w t +1 ← w t − η c ′ ( w t )

Partial Derivatives • So far: Optimizing univariate function c : ℝ → ℝ c : ℝ d → ℝ • But actually: Optimizing multivariate function is typically h u g e ( is not uncommon) d d ≫ 10,000 • • First derivative of a multivariate function is a vector of partial derivatives Definiton:   ∂ f The partial derivative ( x 1 , …, x d ) ∂ x i of a function at with respect to is , where f ( x 1 , …, x d ) x 1 , …, x d x i g ′ ( x i ) g ( y ) = f ( x 1 , …, x i − 1 , y , x i +1 , …, x d )

Gradients The multivariate analog to a first derivative is called a gradient . Definition: f : ℝ d → ℝ x ∈ ℝ d The gradient of a function at is a vector of all the ∇ f ( x ) partial derivatives of at : f x ∂ f ∂ x 1 ( x ) ∂ f ∂ x 2 ( x ) ∇ f ( x ) = ⋮ ∂ f ∂ xd ( x )

Multivariate Gradient Descent First-order gradient descent for multivariate functions is just: c : ℝ → ℝ w t +1 ← w t − η t ∇ c ( w t ) • Notice the subscript on t η We can choose a different for each iteration η t • Indeed, for univariate functions, Newton-Raphson can be understood as first- • 1 order gradient descent that chooses a step size of at each iteration. η t = c ′ ′ ( w t ) Choosing a good step size is crucial to efficiently using first-order gradient descent •

Adaptive Step Sizes (a) Step-size too small (b) Step-size too big (c) Adaptive step-size • If the step size is too small , gradient descent will "work", but take forever • Too big , and we can overshoot the optimum η ∈ℝ + c ( w t − η ∇ c ( w t ) ) • Ideally, we would choose η t = arg min • But that's another optimization! • There are some heuristics that we can use to adaptively guess good values for η t

Line Search Intuition: A simple heuristic: line search • Big step sizes are better so long as 1. Try some largest-reasonable step size they don't overshoot η (0) = η max t • Try a big step size! If it increases c ( w t − η ( s ) the objective, try a smaller one. t ∇ c ( w t ) ) < c ( w t ) 2. Is ? w t +1 ← w t − η ( s ) • Keep trying smaller ones until you If yes, t ∇ c ( w t ) decrease the objective; then start η ( s +1) = τη ( s ) 3. Otherwise, try iteration from again. t + 1 η max t t (for ) and goto 2 τ < 1 • Typically τ ∈ [0.5,0.9]

Optimization Properties 1. Maximizing is the same as minimizing : c ( w ) − c ( w ) arg max w c ( w ) = arg min w − c ( w ) 2. Convex functions have a global minimum at every stationary point c is convex ⟺ c ( t w 1 + (1 − t ) w 2 ) ≤ tc ( w 1 ) + (1 − t ) c ( w 2 ) 3. Identifiability: Sometimes we want the actual global minimum ; other times we want a good-enough minimizer (i.e., local minimum might be OK). 4. Equivalence under constant shifts: Adding, subtracting, or multiplying by a positive constant does not change the minimizer of a function: ∀ k ∈ ℝ + arg min w c ( w ) = arg min w c ( w )+ k = arg min w c ( w ) − k = arg min w kc ( w )

Summary • We often want to find the argument that minimizes an objective function : w * c w * = arg min w c ( w ) • Every interior minimum is a stationary point , so check the stationary points • Stationary points usually identified numerically • Typically, by gradient descent • Choosing the step size is important for efficiency and correctness • Common approach: Adaptive step size • E.g., by line search

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 - PowerPoint PPT Presentation

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 Logistics Reminders: Thought Question 1 due TODAY, September 17, by 11:59pm To be handed in via eClass Assignment 1 (due Thursday, September 24 ) Tutorial:

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

MATHEMATICS 1 CONTENTS Unconstrained optimization Constrained optimization Lagrange method

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz

Stochastic constrained optimization in Hilbert spaces with applications Georg Ch. Pflug/C.

Introduction to Artificial Neural Networks Ahmed Guessoum Natural Language Processing and

A Formal Model Approach for the Analysis and Validation of the Cooperative Path Planning of a UAV

Target Client <<interface>> Original operationA() operationB() Adapter

Moving Target Defense for the Placement of Intrusion Detection Systems in the Cloud Sailik

New Evidence for (0 , 2) Target Space Duality He Feng Department of Physics, Virginia Tech based

demographics, and outcomes to target support for vulnerable groups Progress to Date:

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 - PowerPoint PPT Presentation

Optimization CMPUT 296: Basics of Machine Learning Textbook 4.1-4.4 Logistics Reminders: Thought Question 1 due TODAY, September 17, by 11:59pm To be handed in via eClass Assignment 1 (due Thursday, September 24 ) Tutorial:

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

MATHEMATICS 1 CONTENTS Unconstrained optimization Constrained optimization Lagrange method

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Natural Language Processing with Deep Learning Neural Networks a Walkthrough Navid Rekab-Saz

Stochastic constrained optimization in Hilbert spaces with applications Georg Ch. Pflug/C.

Introduction to Artificial Neural Networks Ahmed Guessoum Natural Language Processing and

A Formal Model Approach for the Analysis and Validation of the Cooperative Path Planning of a UAV

Target Client &lt;&lt;interface&gt;&gt; Original operationA() operationB() Adapter

Moving Target Defense for the Placement of Intrusion Detection Systems in the Cloud Sailik

New Evidence for (0 , 2) Target Space Duality He Feng Department of Physics, Virginia Tech based

demographics, and outcomes to target support for vulnerable groups Progress to Date:

Target Client <<interface>> Original operationA() operationB() Adapter