modern computational statistics lecture 2 optimization
play

Modern Computational Statistics Lecture 2: Optimization Cheng Zhang - PowerPoint PPT Presentation

Modern Computational Statistics Lecture 2: Optimization Cheng Zhang School of Mathematical Sciences, Peking University September 11, 2019 Least Square Regression Models 2/38 Consider the following least square problem L ( ) = 1 2 Y


  1. Modern Computational Statistics Lecture 2: Optimization Cheng Zhang School of Mathematical Sciences, Peking University September 11, 2019

  2. Least Square Regression Models 2/38 ◮ Consider the following least square problem L ( β ) = 1 2 � Y − Xβ � 2 minimize ◮ Note that this is a quadratic problem, which can be solved by setting the gradient to zero ∇ β L ( β ) = − X T ( Y − X ˆ β ) = 0 β = ( X T X ) − 1 X T Y ˆ given that the Hessian is positive definite: ∇ 2 L ( β ) = X T X ≻ 0 which is true iff X has independent columns.

  3. Regularized Regression Models 3/38 ◮ In practice, we would like to solve the least square problems with some constraints on the parameters to control the complexity of the resulting model ◮ One common approach is to use Bridge regression models (Frank and Friedman, 1993) L ( β ) = 1 2 � Y − Xβ � 2 minimize p | β j | γ ≤ s � subject to j =1 ◮ Two important special cases are ridge regression (Hoerl and Kennard, 1970) γ = 2 and Lasso (Tibshirani, 1996) γ = 1

  4. General Optimization Problems 4/38 ◮ In general, optimization problems take the following form: minimize f 0 ( x ) subject to f i ( x ) ≤ 0 , i = 1 , . . . , m h j ( x ) = 0 , j = 1 , . . . , p ◮ We are mostly interested in convex optimization problems, where the objective function f 0 ( x ), the inequality constraints f i ( x ) and the equality constraints h j ( x ) are all convex functions.

  5. Convex Sets 5/38 ◮ A set C is convex if the line segment between any two points in C also lies in C , i.e., θx 1 + (1 − θ ) x 2 ∈ C, ∀ x 1 , x 2 ∈ C, 0 ≤ θ ≤ 1 ◮ If C is a convex set in R n and f ( x ) : R n → R n is an affine function, then f ( C ), i.e., the image of C is also a convex set.

  6. Convex Functions 6/38 ◮ A function f : R n → R is convex if its domain D f is a convex set, and ∀ x, y ∈ D f and 0 ≤ θ ≤ 1 f ( θx + (1 − θ ) y ) ≤ θf ( x ) + (1 − θ ) f ( y ) ◮ For example, many norms are convex functions � | x i | p ) 1 /p , � x � p = ( p ≥ 1 i

  7. Convex Functions 7/38 ◮ First order conditions. Suppose f is differentiable, then f is convex iff D f is convex and f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) , ∀ x, y ∈ D f Corollary : For convex function f , f ( E ( X )) ≤ E ( f ( X )) ◮ Second order conditions. ∇ 2 f ( x ) � 0 , ∀ x ∈ D f

  8. Basic Terminology and Notations 8/38 ◮ Optimial value p ∗ = inf { f 0 ( x ) | f i ( x ) ≤ 0 , h j ( x ) = 0 } m p ◮ x is feasible if x ∈ D = � D f i ∩ � D h j and satisfies the i =0 j =1 constraints. ◮ A feasible x ∗ is optimal if f ( x ∗ ) = p ∗ ◮ Optimality criterion. Assuming f 0 is convex and differentiable, x is optimal iff ∇ f 0 ( x ) T ( y − x ) ≥ 0 , ∀ feasible y Remark: for unconstrained problems, x is optimial iff ∇ f 0 ( x ) = 0

  9. Basic Terminology and Notations 9/38 Local Optimality x is locally optimal if for a given R > 0, it is optimal for minimize f 0 ( z ) f i ( z ) ≤ 0 , subject to i = 1 , . . . , m h j ( z ) = 0 , j = 1 , . . . , p � z − x � ≤ R In convex optimization problems, any locally optimal point is also globally optimal.

  10. The Lagrangian 10/38 ◮ Consider a general optimization problem minimize f 0 ( x ) f i ( x ) ≤ 0 , subject to i = 1 , . . . , m h j ( x ) = 0 , j = 1 , . . . , p ◮ To take the constraints into account, we augment the objective function with a weighted sum of the constraints and define the Lagrangian L : R n × R m × R p → R as m p � � L ( x, λ, ν ) = f 0 ( x ) + λ i f i ( x ) + ν j h j ( x ) i =1 j =1 where λ and ν are dual variables or Lagrangian multipliers .

  11. The Lagrangian Dual Function 11/38 ◮ We define the Lagrangian dual function as follows g ( λ, ν ) = inf x ∈ D L ( x, λ, ν ) ◮ The dual function is the pointwise infimum of a family of affine functions of ( λ, ν ), it is concave, even when the original problem is not convex. ◮ If λ ≥ 0, for each feasible point ˜ x x ∈ D L ( x, λ, ν ) ≤ L (˜ x, λ, ν ) ≤ f 0 (˜ g ( λ, ν ) = inf x ) ◮ Therefore, g ( λ, ν ) is a lower bound for the optimial value ∀ λ ≥ 0 , ν ∈ R p g ( λ, ν ) ≤ p ∗ ,

  12. The Lagrangian Dual Problem 12/38 ◮ Finding the best lower bound leads to the Lagrangian dual problem subject to λ ≥ 0 maximize g ( λ, ν ) , ◮ The above problem is a convex optimization problem. ◮ We denote the optimal value as d ∗ , and call the corresponding solution ( λ ∗ , ν ∗ ) the dual optimal ◮ In contrast, the original problem is called the primal problem, whose solution x ∗ is called primal optimal

  13. Weak vs. Strong Duality 13/38 ◮ d ∗ is the best lower bound for p ∗ that can be obtained from the Lagrangian dual function. ◮ Weak Duality d ∗ ≤ p ∗ ◮ The difference p ∗ − d ∗ is called the optimal dual gap ◮ Strong Duality d ∗ = p ∗

  14. Slater’s Condition 14/38 ◮ Strong duality doesn’t hold in general, but if the primal is convex, it usually holds under some conditions called constraint qualifications ◮ A simple and well-known constraint qualification is Slater’s condition: there exist an x in the relative interior of D such that f i ( x ) < 0 , i = 1 , . . . , m, Ax = b

  15. Complementary Slackness 15/38 ◮ Consider primal optmial x ∗ and dual optimal ( λ ∗ , ν ∗ ) ◮ If strong duality holds f 0 ( x ∗ ) = g ( λ ∗ , ν ∗ ) m p � � � � λ ∗ v ∗ = inf f 0 ( x ) + i f i ( x ) + j h i ( x ) x i =1 i =1 m p � � ≤ f 0 ( x ∗ ) + λ ∗ i f i ( x ∗ ) + v ∗ j h i ( x ∗ ) i =1 i =1 ≤ f 0 ( x ∗ ) . ◮ Therefore, these are all equalities

  16. Complementary Slackness 16/38 ◮ Important conclusions: ◮ x ∗ minimize L ( x, λ ∗ , ν ∗ ) ◮ λ ∗ i f i ( x ∗ ) = 0 , i = 1 , . . . , m ◮ The latter is called complementary slackness, which indicates λ ∗ ⇒ f i ( x ∗ ) = 0 i > 0 f i ( x ∗ ) < 0 λ ∗ ⇒ i = 0 ◮ When the dual problem is easier to solve, we can find ( λ ∗ , ν ∗ ) and then minimize L ( x, λ ∗ , ν ∗ ). If the resulting solution is primal feasible, then it is primal optimal.

  17. Entropy Maximization 17/38 ◮ Consider the entropy maximization problem � n minimize f 0 ( x ) = i =1 x i log x i subject to − x i ≤ 0 , i = 1 , . . . , n � n i =1 x i = 1 ◮ Lagrangian n n n � � � L ( x, λ, ν ) = x i log x i − λ i x i + ν ( x i − 1) i =1 i =1 i =1 ◮ We minimize L ( x, λ, µ ) by setting ∂L ∂x to zero log ˆ x i + 1 − λ i + ν = 0 ⇒ ˆ x i = exp( λ i − ν − 1)

  18. Entropy Maximization 18/38 ◮ The dual function is n � g ( λ, ν ) = − exp( λ i − ν − 1) − ν i =1 ◮ Dual: n � maximize g ( λ, ν ) = − exp( − ν − 1) exp( λ i ) − ν, λ ≥ 0 i =1 ◮ We find the dual optimal ν ∗ = − 1 + log n λ ∗ i = 0 , i = 0 , . . . , n,

  19. Entropy Maximization 19/38 ◮ We now minimize L ( x, λ ∗ , ν ∗ ) i = 1 i + ν ∗ = 0 log x ∗ i + 1 − λ ∗ ⇒ x ∗ n ◮ Therefore, the discrete probability distribution that has maximum entropy is the uniform distribution Exercise Show that X ∼ N ( µ, σ 2 ) is the maximum entropy distribution such that EX = µ and EX 2 = µ 2 + σ 2 . How about fixing the first k moments at EX i = m i , i = 1 , . . . , k ?

  20. Karush-Kun-Tucker (KKT) conditions 20/38 ◮ Suppose the functions f 0 , f 1 , . . . , f m , h 1 , . . . , h p are all differentiable; x ∗ and ( λ ∗ , ν ∗ ) are primal and dual optimal points with zero duality gap ◮ Since x ∗ minimize L ( x, λ ∗ , ν ∗ ), the gradient vanishes at x ∗ m p � � ∇ f 0 ( x ∗ ) + λ ∗ i ∇ f i ( x ∗ ) + ν ∗ i ∇ h j ( x ∗ ) = 0 i =1 j =1 ◮ Additionally f i ( x ∗ ) ≤ 0 , i = 1 , . . . , m h j ( x ∗ ) = 0 , j = 1 , . . . , p λ ∗ ≥ 0 , i = 1 , . . . , m i λ ∗ i f i ( x ∗ ) = 0 , i = 1 , . . . , m ◮ These are called Karush-Kuhn-Tucker (KKT) conditions

  21. KKT conditions for convex problems 21/38 ◮ When the primal problem is convex, the KKT conditions are also sufficient for the points to be primal and dual optimal with zero duality gap. x, ˜ ◮ Let ˜ λ, ˜ ν be any points that satisfy the KKT conditions, ˜ x x, ˜ is primal feasible and minimizes L (˜ λ, ˜ ν ) g (˜ x, ˜ λ, ˜ ν ) = L (˜ λ, ˜ ν ) p m ˜ � � = f 0 (˜ x ) + λ i f i (˜ x ) + ν j h j (˜ ˜ x ) i =1 j =1 = f 0 (˜ x ) ◮ Therefore, for convex optimization problems with differentiable functions that satisfy Slater’s condition, the KKT condtions are necessary and sufficient

  22. Example 22/38 ◮ Consider the following problem: 1 2 x T Px + q T x + r, P � 0 minimize subject to Ax = b ◮ KKT conditions: Px ∗ + q + A T ν ∗ = 0 Ax ∗ = b ◮ To find x ∗ , v ∗ , we can solve the above system of linear equations

  23. Descent Methods 23/38 ◮ We now focus on numerical solutions for unconstrained optimization problems minimize f ( x ) where f : R n → R is twice differentiable ◮ Descent method. We can set up a sequence x ( k +1) = x ( k ) + t ( k ) ∆ x ( k ) , t ( k ) > 0 such that f ( x ( k +1) ) < f ( x ( k ) ) , k = 0 , 1 , . . . , ◮ ∆ x ( k ) is called the search direction; t ( k ) is called the step size or learning rate in machine learning.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend