15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline - PowerPoint PPT Presentation

15-780: Optimization J. Zico Kolter March 14-16, 2015 1

Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 2

General (continuous) optimization where is optimization variable, is the objective function , are inequality constraints , and are equality constraints subject to minimize minimize Beyond linear programming Linear programming c T x x subject to Gx ≤ h Ax = b 4

minimize minimize Beyond linear programming Linear programming General (continuous) optimization c T x f ( x ) x x subject to Gx ≤ h subject to g i ( x ) ≤ 0 , i = 1 , . . . , m Ax = b h i ( x ) = 0 , i = 1 , . . . , p where x ∈ R n is optimization variable, f : R n → R is the objective function , g i : R m → R are inequality constraints , and h i : R n → R are equality constraints 4

minimize Example: image deblurring Original image Blurred image Reconstruction Figures from (Wang et. al, 2009) Given corrupted m × n image represented as vector y ∈ R m · n , find x ∈ R m · n by solving the optimization problem ( n − 1 m − 1 ) ∥ K ∗ x − y ∥ 2 ∑ ∑ 2 + λ | x mi − x m ( i +1) | + | x ni − x n ( i +1) | x i =1 i =1 where K ∗ denotes 2D convolution with some filter K 5

minimize Example: machine learning Virtually all machine learning algorithms can be expressed as minimizing a loss function over observed data Given inputs x ( i ) ∈ X , desired outputs y ( i ) ∈ Y , hypothesis function h θ : X → Y defined by parameters θ ∈ R n , and loss function ℓ : Y × Y → R + Machine learning algorithms solve optimization problem m ( , y ( i ) ) ∑ x ( i ) ) ( ℓ h θ θ i =1 6

𝑒 𝑠 minimize Example: robot trajectory planning Figure from (Schulman et al., 2013) Robot state x t and control inputs u t T − 1 ∑ ∥ x t − x t +1 ∥ 2 2 + ∥ u t ∥ 2 2 x 1: T , u 1: T − 1 i =1 subject to x t +1 = f dynamics ( x t , u t ) , (robot dynamics) f collision ( x t ) ≥ 0 . 1 (avoid collisions) x 1 = x init , x T = x goal 7

Many other applications We’ve already seen many applications (i.e., any linear programming setting is also an example of continuous optimization, but there are many other non-linear problems) Applications in control, machine learning, finance, forecasting, signal processing, communications, structural design, any many others The move to optimization-based formalisms has been one of the primary trends in AI in the past 15+ years 8

Classes of optimization problems Constrained Unconstrained Convex Nonconvex Smooth Nonsmooth Many different classifications for (continuous) optimization problems (linear programming, nonlinear programming, quadratic programming, semidefinite programming, second order cone programming, geometric programming, etc) can get overwhelming We focus on three distinctions: unconstrained vs. constrained, convex vs. nonconvex, and (less so) smooth vs. nonsmooth 10

minimize minimize Unconstrained vs. constrained optimization f ( x ) x vs. f ( x ) subject to g i ( x ) ≤ 0 , i = 1 , . . . , m x h i ( x ) = 0 , i = 1 , . . . , p In unconstrained optimization, every point x ∈ R n is “feasible”, so singular focus is on finding a low value of f ( x ) In constrained optimization (where constraints truly need to hold exactly) it may be difficult to find an initial feasible point, and maintain feasibility during optimization Typically leads to different classes of algorithms 11

minimize Convex vs. nonconvex optimization Originally researchers distinguished between linear (easy) and nonlinear (hard) optimization problems But in the 80s and 90s, it became clear that this wasn’t the right line: the real distinction is between convex (easy) and nonconvex (hard) problems The optimization problem f ( x ) x subject to g i ( x ) ≤ 0 , i = 1 , . . . , m h i ( x ) = 0 , i = 1 , . . . , p if f and the g i ’s are all convex functions and the h i ’s are affine functions 12

Convex functions ( y, f ( y )) ( x, f ( x )) A function f : R n → R is convex if, for any x , y ∈ R n and θ ∈ [0 , 1] , f ( θ x + (1 − θ ) y ) ≤ θ f ( x ) + (1 − θ ) f ( y ) f is concave if − f is convex f is affine if it is both convex and concave, must take form f ( x ) = a T x + b for a ∈ R n , b ∈ R 13

Why is convex optimization easy? f 1 ( x ) f 2 ( x ) Nonconvex function Convex function Convex function “curve upward everywhere”, and convex constraints define a convex set (for any x , y that is feasible, so is θ x + (1 − θ ) y for θ ∈ [0 , 1] ) Together, these properties imply that any local optima must also be a global optima Thus, for convex problems we can use local methods to find the globally optimal solution (cf. linear programming vs. integer programming) 14

Smooth vs. Nonsmooth optimization f 1 ( x ) f 2 ( x ) Smooth function Nonsmooth function In optimization, we care about smoothness in terms of whether functions are (first or second order) continuously differentiable A function f is first order continuously differentiable if it’s derivative f ′ exists and is continuous; the Lipschitz constant of its derivative is a constant L such that for all x , y | f ′ ( x ) − f ′ ( y ) | ≤ L | x − y | In the next section, we will use first and second derivative information to optimize functions, so whether or not these exist affect which methods we can apply. 15

Solving optimization problems Starting with the unconstrained, smooth, one dimensional case f ( x ) x To find minimum point x ⋆ , we can look at the derivative of the function f ′ ( x ) : any location where f ′ ( x ) = 0 will be a “flat” point in the function For convex problems, this is guaranteed to be a minimum (instead of a maximum) 17

The gradient ∇ x f ( x ) x 1 x 2 For a multivariate function f : R n , its gradient is a n -dimensional vector containing partial derivatives with respect to each dimension  ∂ f ( x )  ∂ x 1 . .   ∇ x f ( x ) = .     ∂ f ( x ) ∂ x n For continuously differentiable f and unconstrained optimization, optimal point must have ∇ x f ( x ⋆ ) = 0 18

Properties of the gradient x 0 f ( x ) f ( x 0 ) + ∇ x f ( x ) T ( x − x 0 ) x Gradient defines the first order Taylor approximation to the function f around a point x 0 f ( x ) ≈ f ( x 0 ) + ∇ x f ( x 0 ) T ( x − x 0 ) For convex f , first order Taylor approximation is always an underestimate f ( x ) ≥ f ( x 0 ) + ∇ x f ( x 0 ) T ( x − x 0 ) 19

Some common gradients For f ( x ) = a T x gradient is given by ∇ x f ( x ) = a n ∂ f ( x ) = ∂ ∑ a i x i = a i ∂ x i ∂ x i i =1 For f ( x ) = 1 2 x T Qx , gradient is given by ∇ x f ( x ) = 1 2 ( Q + Q T ) x or just ∇ x f ( x ) = Qx if Q is symmetric ( Q = Q T ) 20

= = How do we find ∇ x f ( x ) = 0 ? Direct solution : In some cases, it is possible to analytically compute the x ⋆ such that ∇ x f ( x ⋆ ) = 0 Example: f ( x ) = 2 x 2 1 + x 2 2 + x 1 x 2 − 6 x 1 − 5 x 2 [ 4 x 1 + x 2 + 6 ] ⇒ ∇ x f ( x ) = 2 x 2 + x 1 + 5 [ 4 ] − 1 [ 6 [ 1 ] ] 1 ⇒ x ⋆ = = 1 2 5 2 Iterative methods : more commonly the condition that the gradient equal zero will not have an analytical solution, require iterative methods 21

Gradient descent The gradient doesn’t just give us the optimality condition, it also points in the direction of “steepest ascent” for the function f ∇ x f ( x ) x 1 x 2 Motivates the gradient descent algorithm, which repeatedly takes steps in the direction of the negative gradient Repeat: x ← x − α ∇ x f ( x ) for some step size α > 0 22

10 1 3.0 10 -1 2.5 10 -3 2.0 10 -5 f - f* x2 1.5 10 -7 10 -9 1.0 10 -11 0.5 10 -13 0.0 10 -15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 20 40 60 80 100 x1 Iteration 100 iterations of gradient descent on function f ( x ) = 2 x 2 1 + x 2 2 + x 1 x 2 − 6 x 1 − 5 x 2 23

How do we choose step size α ? Choice of α plays a big role in convergence of algorithm 3.0 3.0 2.5 2.5 2.0 2.0 x2 x2 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 x1 α = 0 . 05 α = 0 . 42 24

10 1 alpha = 0.2 10 -1 alpha = 0.42 10 -3 alpha = 0.05 10 -5 f - f* 10 -7 10 -9 10 -11 10 -13 10 -15 0 20 40 60 80 100 Iteration Convergence of gradient descent for different step sizes 25

If we know gradient is Lipschitz continuous with constant L , step size α = 1/ L is good in theory and practice But what if we don’t know Lipschitz constant, or derivative has unbounded Lipschitz constant? Idea #1 (“exact” line search): want to choose α to minimize f ( x 0 + α ∇ f ( x 0 )) for current iterate x 0 ; this is just another optimization problem, but with a single variable α Idea #2 (backtracking line search): try a few α ’s on each iteration until we get one that causes a suitable decrease in the function 26

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline - PowerPoint PPT Presentation

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 2 Outline Introduction to optimization

DO EDMONTON, Alberta T6E 6A5 Phone [780) 438-1460 F a x (780) 4 3 7 - 7 1 2 5 THURBER

15-780 Graduate Artificial Intelligence: Optimization J. Zico Kolter (this lecture) and Ariel

15-780: Graduate AI Lecture 1. Intro & Logic Geoff Gordon (this lecture) Tuomas Sandholm

Vernon Road, Alexandria, VA 22309-2008, 703-780- 8894,

15-780: Graduate AI Lecture 1. Intro & Logic Geoff Gordon (this lecture) Tuomas Sandholm

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

A Composite Randomized Incremental Gradient Method Junyu Zhang (University of Minnesota) and

optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax ) x f ,

Logistics Midterm we will be in two rooms The room you are assigned to depends on the first

Foundations of Chemical Kinetics Lecture 5: The Boltzmann distribution Marc R. Roussel

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating

RTX-RSim Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging Peter Thoman,

Supplemental notes: Kuhn-Tucker first-order conditions P. Dybvig Minimization problem (like in

Generalized Polynomial Decomposition for S-boxes with Application to Side-Channel Countermeasures

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline - PowerPoint PPT Presentation

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization Types of optimization problems Unconstrained optimization Constrained optimization Practical optimization 2 Outline Introduction to optimization

DO EDMONTON, Alberta T6E 6A5 Phone [780) 438-1460 F a x (780) 4 3 7 - 7 1 2 5 THURBER

15-780 Graduate Artificial Intelligence: Optimization J. Zico Kolter (this lecture) and Ariel

15-780: Graduate AI Lecture 1. Intro &amp; Logic Geoff Gordon (this lecture) Tuomas Sandholm

Vernon Road, Alexandria, VA 22309-2008, 703-780- 8894,

15-780: Graduate AI Lecture 1. Intro &amp; Logic Geoff Gordon (this lecture) Tuomas Sandholm

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

A Composite Randomized Incremental Gradient Method Junyu Zhang (University of Minnesota) and

optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax ) x f ,

Logistics Midterm we will be in two rooms The room you are assigned to depends on the first

Foundations of Chemical Kinetics Lecture 5: The Boltzmann distribution Marc R. Roussel

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating

RTX-RSim Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging Peter Thoman,

Supplemental notes: Kuhn-Tucker first-order conditions P. Dybvig Minimization problem (like in

Generalized Polynomial Decomposition for S-boxes with Application to Side-Channel Countermeasures

15-780: Graduate AI Lecture 1. Intro & Logic Geoff Gordon (this lecture) Tuomas Sandholm

15-780: Graduate AI Lecture 1. Intro & Logic Geoff Gordon (this lecture) Tuomas Sandholm