Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Direction of maximum variation ∇ f is direction of maximum increase - ∇ f is direction of maximum decrease � x ) T � � � = � f ′ � � u ( � x ) � ∇ f ( � u � � � � ≤ ||∇ f ( � x ) || 2 || � u || 2 Cauchy-Schwarz inequality = ||∇ f ( � x ) || 2 ∇ f ( � x ) equality holds if and only if � u = ± ||∇ f ( � x ) || 2

Gradient

First-order approximation The first-order or linear approximation of f : R n → R at � x is x ) T ( � f 1 x ( � y ) := f ( � x ) + ∇ f ( � y − � x ) � If f is continuously differentiable at � x y ) − f 1 f ( � x ( � y ) � lim = 0 || � y − � x || 2 y → � � x

First-order approximation f 1 x ( � y ) x f ( � y )

Convexity A differentiable function f : R n → R is convex if and only if for every y ∈ R n � x , � x ) T ( � f ( � y ) ≥ f ( � x ) + ∇ f ( � y − � x ) It is strictly convex if and only if x ) T ( � f ( � y ) > f ( � x ) + ∇ f ( � y − � x )

Optimality condition If f is convex and ∇ f ( � x ) = 0, then for any � y ∈ R f ( � y ) ≥ f ( � x ) If f is strictly convex then for any � y � = � x f ( � y ) > f ( � x )

Epigraph The epigraph of f : R n → R is     � x [ 1 ]      ≤ � epi ( f ) :=  � x | f · · · x [ n + 1 ]    � x [ n ] 

Epigraph epi ( f ) f

Supporting hyperplane A hyperplane H is a supporting hyperplane of a set S at � x if ◮ H and S intersect at � x ◮ S is contained in one of the half-spaces bounded by H

Geometric intuition Geometrically, f is convex if and only if for every � x the plane       y [ 1 ] �   y [ n + 1 ] = f 1 H f ,� x :=  � y | � · · · � x     y [ n ] �  is a supporting hyperplane of the epigraph at � x If ∇ f ( � x ) = 0 the hyperplane is horizontal

Convexity f 1 x ( � y ) x f ( � y )

Hessian matrix If f has a Hessian matrix at every point, it is twice differentiable   ∂ 2 f ( � x ) ∂ 2 f ( � x ) ∂ 2 f ( � x ) · · · ∂� x [ 1 ] 2 ∂� x [ 1 ] ∂� x [ 2 ] ∂� x [ 1 ] ∂� x [ n ]    ∂ 2 f ( � ∂ 2 f ( � ∂ 2 f ( �  x ) x ) x ) · · ·   ∇ 2 f ( � x [ 1 ] 2 x ) = ∂� x [ 1 ] ∂� x [ 2 ] ∂� ∂� x [ 2 ] ∂� x [ n ]      · · ·    ∂ 2 f ( � ∂ 2 f ( � ∂ 2 f ( �  x ) x ) x )  · · · ∂� x [ 1 ] ∂� x [ n ] ∂� x [ 2 ] ∂� x [ n ] ∂� x [ n ] 2

Curvature The second directional derivative f ′′ u of f at � x equals � f ′′ u T ∇ 2 f ( � u ( � x ) = � x ) � u � u ∈ R n for any unit-norm vector �

Second-order approximation The second-order or quadratic approximation of f at � x is x ) + 1 x ) T ∇ 2 f ( � f 2 x ( � y ) := f ( � x ) + ∇ f ( � x ) ( � y − � 2 ( � y − � x ) ( � y − � x ) �

Second-order approximation f 2 x ( � y ) x f ( � y )

Quadratic form Second order polynomial in several dimensions x + � x T A � b T � q ( � x ) := � x + c b ∈ R n and parametrized by symmetric matrix A ∈ R n × n , a vector � a constant c

Quadratic approximation x : R n → R at � x ∈ R n of a The quadratic approximation f 2 � twice-continuously differentiable function f : R n → R satisfies y ) − f 2 f ( � x ( � y ) � lim = 0 x || 2 || � y − � y → � � x 2

Eigendecomposition of symmetric matrices Let A = U Λ U T be the eigendecomposition of a symmetric matrix A Eigenvalues: λ 1 ≥ · · · ≥ λ n (which can be negative or 0) Eigenvectors: � u 1 , . . . , � u n , orthonormal basis x T A � λ 1 = max � x { || � x ∈ R n } x || 2 = 1 | � x T A � u 1 = � arg max � x { || � x ∈ R n } x || 2 = 1 | � x T A � λ n = min � x { || � x ∈ R n } x || 2 = 1 | � x T A � � u n = arg min � x { || � x ∈ R n } x || 2 = 1 | �

Maximum and minimum curvature x ) = U Λ U T be the eigendecomposition of the Hessian at � Let ∇ 2 f ( � x Direction of maximum curvature: � u 1 Direction of minimum curvature (or maximum negative curvature): � u n

Positive semidefinite matrices For any � x x T A � x T U Λ U T � � x = � x n � x � 2 λ i � � u i , � = i = 1 All eigenvalues are nonnegative if and only if x T A � � x ≥ 0 for all � x The matrix is positive semidefinite

Positive (negative) (semi)definite matrices Positive (semi)definite: all eigenvalues are positive (nonnegative), equivalently for all � x x T A � � x > ( ≥ ) 0 Quadratic form: All directions have positive curvature Negative (semi)definite: all eigenvalues are negative (nonpositive), equivalently for all � x x T A � � x < ( ≤ ) 0 Quadratic form: All directions have negative curvature

Convexity A twice-differentiable function g : R → R is convex if and only if g ′′ ( x ) ≥ 0 for all x ∈ R A twice-differentiable function in R n is convex if and only if their Hessian is positive semidefinite at every point If the Hessian is positive definite, the function is strictly convex

Second-order approximation f 2 x ( � y ) x f ( � y )

Convex

Concave

Neither

Convexity Differentiable convex functions Minimizing differentiable convex functions

Problem Challenge: Minimizing differentiable convex functions min f ( � x ) x ∈ R n �

Gradient descent Intuition: Make local progress in the steepest direction −∇ f ( � x ) x ( 0 ) to an arbitrary value Set the initial point � Update by setting x ( k + 1 ) := � � x ( k ) � x ( k ) − α k ∇ f � � where α k > 0 is the step size, until a stopping criterion is met

Gradient descent

Gradient descent 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5

Small step size 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5

Large step size 100 90 80 70 60 50 40 30 20 10

Line search Idea: Find minimum of α k := arg min α h ( α ) � x ( k ) − α k ∇ f � x ( k ) �� = arg min � � α ∈ R f

Backtracking line search with Armijo rule Given α 0 ≥ 0 and β, η ∈ ( 0 , 1 ) , set α k := α 0 β i for smallest i such that x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � satisfies − 1 2 � � x ( k ) �� x ( k + 1 ) � � x ( k ) � � � ≤ f � 2 α k � ∇ f � f � � � � � � � 2 a condition known as Armijo rule

Backtracking line search with Armijo rule 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5

Gradient descent for least squares Aim: Use n examples � x ( 1 ) � � x ( 2 ) � � x ( n ) � y ( 1 ) , � y ( 2 ) , � y ( n ) , � , , . . . , to fit a linear model by minimizing least-squares cost function 2 � � � � y − X � � � minimize � β � � � � β ∈ R p � � � 2

Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β )

Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y

Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y Gradient descent updates are β ( k + 1 ) = � β ( k ) + 2 α k X T � β ( k ) � � y − X � �

Gradient descent for least squares The gradient of the quadratic function 2 � � � � f ( � y − X � β ) := � � β � � � � � � � 2 = � β T X T X � β − 2 � β T X T � y T � y + � y equals ∇ f ( � β ) = 2 X T X � β − 2 X T � y Gradient descent updates are β ( k + 1 ) = � β ( k ) + 2 α k X T � β ( k ) � � y − X � � n β ( k ) + 2 α k � y ( i ) − � x ( i ) , � � = � � β ( k ) � x ( i ) � i = 1

Gradient ascent for logistic regression Aim: Use n examples � x ( 1 ) � � x ( 2 ) � � x ( n ) � y ( 1 ) , � y ( 2 ) , � y ( n ) , � , , . . . , to fit logistic-regression model by maximizing log-likelihood cost function n y ( i ) log g � � � 1 − y ( i ) � � � �� f ( � � x ( i ) , � x ( i ) , � � � � � β ) := β � + log 1 − g β � i = 1 where 1 g ( t ) = 1 − exp − t

Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals ∇ f ( � β )

Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1

Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1 The gradient ascent updates are β ( k + 1 ) �

Gradient ascent for logistic regression g ′ ( t ) = g ( t ) ( 1 − g ( t )) ( 1 − g ( t )) ′ = − g ( t ) ( 1 − g ( t )) The gradient of the cost function equals n y ( i ) � � x ( i ) − � 1 − y ( i ) � ∇ f ( � � x ( i ) , � x ( i ) , � x ( i ) β ) = 1 − g ( � � β � ) � g ( � � β � ) � i = 1 The gradient ascent updates are β ( k + 1 ) := � � β ( k ) n y ( i ) � � x ( i ) − � 1 − y ( i ) � � x ( i ) , � x ( i ) , � β ( k ) � ) β ( k ) � ) � x ( i ) + α k 1 − g ( � � � g ( � � i = 1

Convergence of gradient descent Does the method converge? How fast (slow)? For what step sizes?

Convergence of gradient descent Does the method converge? How fast (slow)? For what step sizes? Depends on function

Lipschitz continuity A function f : R n → R m is Lipschitz continuous if for any � y ∈ R n x , � || f ( � y ) − f ( � x ) || 2 ≤ L || � y − � x || 2 . L is the Lipschitz constant

Lipschitz-continuous gradients If ∇ f is Lipschitz continuous with Lipschitz constant L ||∇ f ( � y ) − ∇ f ( � x ) || 2 ≤ L || � y − � x || 2 y ∈ R n we have a quadratic upper bound then for any � x , � x ) + L x ) T ( � x || 2 f ( � y ) ≤ f ( � x ) + ∇ f ( � y − � 2 || � y − � 2

Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f �

Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2

Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2 � � � 1 − α k L 2 � x ( k ) �� x ( k ) � � = f � − α k � ∇ f � � � � � 2 � � � 2

Local progress of gradient descent x ( k + 1 ) := � x ( k ) − α k ∇ f � x ( k ) � � � � x ( k + 1 ) � f � x ( k ) � T � + L 2 � x ( k ) � � x ( k + 1 ) − � x ( k ) � � � x ( k + 1 ) − � x ( k ) � � ≤ f � + ∇ f � � � � � � � � 2 � � � 2 � � � 1 − α k L 2 � x ( k ) �� x ( k ) � � = f � − α k � ∇ f � � � � � 2 � � � 2 If α k ≤ 1 L − α k 2 � � x ( k ) �� x ( k + 1 ) � � x ( k ) � � � ≤ f � � ∇ f � f � � � � 2 � � � 2

Convergence of gradient descent ◮ f is convex ◮ ∇ f is L -Lipschitz continuous x ∗ at which f achieves a finite minimum ◮ There exists a point � ◮ The step size is set to α k := α ≤ 1 / L x ( 0 ) − � � 2 � �� x ∗ � �� x ( k ) � x ∗ ) ≤ 2 f � − f ( � 2 α k

Convergence of gradient descent − α k 2 � � x ( k − 1 ) �� x ( k ) � � x ( k − 1 ) � � � ≤ f � � ∇ f � f � � � � 2 � � � 2 x ( k − 1 ) � T � � x ( k − 1 ) � � x ∗ − � x ( k − 1 ) � x ∗ ) f � + ∇ f � � ≤ f ( � � x ( k ) � x ∗ ) � − f ( � f

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Convexity Differentiable convex functions Minimizing differentiable convex

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

16. Review of convex optimization Convex sets and functions Convex programming models

Faster convex optimization Simulated annealing & Interior point Elad Hazan Joint work with

CS675: Convex and Combinatorial Optimization Spring 2018 Duality of Convex Sets and Functions

FINITE DIFFERENCE METHODS Dr. Sreenivas Jayanti Department of Chemical Engineering IIT-Madras

Statistical Geometry Processing Winter Semester 2011/2012 n r u v Differential Geometry

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

Perturbation methods for DSGE models St ephane Adjemian stephane.adjemian@univ-lemans.fr

Second-Order Effect of Estimated Weights Recall the general mean-variance specification E( Y | x )

Computational Optimization Mathematical Programming Fundamentals 1/25 (revised) If you dont

1 Regularization with priors: quick refresher 1.1 MAP inference We have previously discussed

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based - PowerPoint PPT Presentation

Convex Optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/OBDA_fall17/index.html Carlos Fernandez-Granda Convexity Differentiable convex functions Minimizing differentiable convex

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

16. Review of convex optimization Convex sets and functions Convex programming models

Faster convex optimization Simulated annealing &amp; Interior point Elad Hazan Joint work with

CS675: Convex and Combinatorial Optimization Spring 2018 Duality of Convex Sets and Functions

FINITE DIFFERENCE METHODS Dr. Sreenivas Jayanti Department of Chemical Engineering IIT-Madras

Statistical Geometry Processing Winter Semester 2011/2012 n r u v Differential Geometry

An Investigation into Neural Net Optimization via Hessian Eigenvalue Density Behrooz Ghorbani

Scalable natural gradient using probabilistic models of backprop Roger Grosse Overview

Perturbation methods for DSGE models St ephane Adjemian stephane.adjemian@univ-lemans.fr

Second-Order Effect of Estimated Weights Recall the general mean-variance specification E( Y | x )

Computational Optimization Mathematical Programming Fundamentals 1/25 (revised) If you dont

1 Regularization with priors: quick refresher 1.1 MAP inference We have previously discussed

Faster convex optimization Simulated annealing & Interior point Elad Hazan Joint work with