Karush-Kuhn-Tucker conditions Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Karush-Kuhn-Tucker conditions Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Remember duality Given a minimization problem x ∈ R n f ( x ) min subject to h i ( x ) ≤ 0 , i = 1 , . . . m ℓ j ( x ) = 0 , j = 1 , . . . r we defined the Lagrangian: m r � � L ( x, u, v ) = f ( x ) + u i h i ( x ) + v j ℓ j ( x ) i =1 j =1 and Lagrange dual function: g ( u, v ) = min x ∈ R n L ( x, u, v ) 2

The subsequent dual problem is: u ∈ R m , v ∈ R r g ( u, v ) max subject to u ≥ 0 Important properties: • Dual problem is always convex, i.e., g is always concave (even if primal problem is not convex) • The primal and dual optimal values, f ⋆ and g ⋆ , always satisfy weak duality: f ⋆ ≥ g ⋆ • Slater’s condition: for convex primal, if there is an x such that h 1 ( x ) < 0 , . . . h m ( x ) < 0 and ℓ 1 ( x ) = 0 , . . . ℓ r ( x ) = 0 then strong duality holds: f ⋆ = g ⋆ . (Can be further refined to strict inequalities over nonaffine h i , i = 1 , . . . m ) 3

Duality gap Given primal feasible x and dual feasible u, v , the quantity f ( x ) − g ( u, v ) is called the duality gap between x and u, v . Note that f ( x ) − f ⋆ ≤ f ( x ) − g ( u, v ) so if the duality gap is zero, then x is primal optimal (and similarly, u, v are dual optimal) From an algorithmic viewpoint, provides a stopping criterion: if f ( x ) − g ( u, v ) ≤ ǫ , then we are guaranteed that f ( x ) − f ⋆ ≤ ǫ Very useful, especially in conjunction with iterative methods ... more dual uses in coming lectures 4

Dual norms Let � x � be a norm , e.g., • ℓ p norm: � x � p = ( � n i =1 | x i | p ) 1 /p , for p ≥ 1 • Nuclear norm: � X � nuc = � r i =1 σ i ( X ) We define its dual norm � x � ∗ as � z �≤ 1 z T x � x � ∗ = max Gives us the inequality | z T x | ≤ � z �� x � ∗ , like Cauchy-Schwartz. Back to our examples, • ℓ p norm dual: ( � x � p ) ∗ = � x � q , where 1 /p + 1 /q = 1 • Nuclear norm dual: ( � X � nuc ) ∗ = � X � spec = σ max ( X ) Dual norm of dual norm: it turns out that � x � ∗∗ = � x � ... connections to duality (including this one) in coming lectures 5

Outline Today: • KKT conditions • Examples • Constrained and Lagrange forms • Uniqueness with 1-norm penalties 6

Karush-Kuhn-Tucker conditions Given general problem x ∈ R n f ( x ) min subject to h i ( x ) ≤ 0 , i = 1 , . . . m ℓ j ( x ) = 0 , j = 1 , . . . r The Karush-Kuhn-Tucker conditions or KKT conditions are: m r � � • 0 ∈ ∂f ( x ) + u i ∂h i ( x ) + v j ∂ℓ j ( x ) (stationarity) i =1 j =1 • u i · h i ( x ) = 0 for all i (complementary slackness) • h i ( x ) ≤ 0 , ℓ j ( x ) = 0 for all i, j (primal feasibility) • u i ≥ 0 for all i (dual feasibility) 7

Necessity Let x ⋆ and u ⋆ , v ⋆ be primal and dual solutions with zero duality gap (strong duality holds, e.g., under Slater’s condition). Then f ( x ⋆ ) = g ( u ⋆ , v ⋆ ) m r � u ⋆ � v ⋆ = min x ∈ R n f ( x ) + i h i ( x ) + j ℓ j ( x ) i =1 j =1 m r ≤ f ( x ⋆ ) + � u ⋆ i h i ( x ⋆ ) + � v ⋆ j ℓ j ( x ⋆ ) i =1 j =1 ≤ f ( x ⋆ ) In other words, all these inequalities are actually equalities 8

Two things to learn from this: • The point x ⋆ minimizes L ( x, u ⋆ , v ⋆ ) over x ∈ R n . Hence the subdifferential of L ( x, u ⋆ , v ⋆ ) must contain 0 at x = x ⋆ —this is exactly the stationarity condition • We must have � m i =1 u ⋆ i h i ( x ⋆ ) = 0 , and since each term here is ≤ 0 , this implies u ⋆ i h i ( x ⋆ ) = 0 for every i —this is exactly complementary slackness Primal and dual feasibility obviously hold. Hence, we’ve shown: If x ⋆ and u ⋆ , v ⋆ are primal and dual solutions, with zero duality gap, then x ⋆ , u ⋆ , v ⋆ satisfy the KKT conditions (Note that this statement assumes nothing a priori about convexity of our problem, i.e. of f, h i , ℓ j ) 9

Sufficiency If there exists x ⋆ , u ⋆ , v ⋆ that satisfy the KKT conditions, then m r g ( u ⋆ , v ⋆ ) = f ( x ⋆ ) + � u ⋆ i h i ( x ⋆ ) + � v ⋆ j ℓ j ( x ⋆ ) i =1 j =1 = f ( x ⋆ ) where the first equality holds from stationarity, and the second holds from complementary slackness Therefore duality gap is zero (and x ⋆ and u ⋆ , v ⋆ are primal and dual feasible) so x ⋆ and u ⋆ , v ⋆ are primal and dual optimal. I.e., we’ve shown: If x ⋆ and u ⋆ , v ⋆ satisfy the KKT conditions, then x ⋆ and u ⋆ , v ⋆ are primal and dual solutions 10

Putting it together In summary, KKT conditions: • always sufficient • necessary under strong duality Putting it together: For a problem with strong duality (e.g., assume Slater’s condition: convex problem and there exists x strictly satisfying nonaffine inequality contraints), x ⋆ and u ⋆ , v ⋆ are primal and dual solutions x ⋆ and u ⋆ , v ⋆ satisfy the KKT conditions ⇔ (Warning, concerning the stationarity condition: for a differentiable function f , we cannot use ∂f ( x ) = {∇ f ( x ) } unless f is convex) 11

What’s in a name? Older folks will know these as the KT (Kuhn-Tucker) conditions: • First appeared in publication by Kuhn and Tucker in 1951 • Later people found out that Karush had the conditions in his unpublished master’s thesis of 1939 Many people (including instructor!) use the term KKT conditions for unconstrained problems, i.e., to refer to stationarity condition Note that we could have alternatively derived the KKT conditions from studying optimality entirely via subgradients m r � � 0 ∈ ∂f ( x ⋆ ) + N { h i ≤ 0 } ( x ⋆ ) + N { ℓ j =0 } ( x ⋆ ) i =1 j =1 where recall N C ( x ) is the normal cone of C at x 12

Quadratic with equality constraints Consider for Q � 0 , 1 2 x T Qx + c T x min x ∈ R n subject to Ax = 0 E.g., as in Newton step for min x ∈ R n f ( x ) subject to Ax = b Convex problem, no inequality constraints, so by KKT conditions: x is a solution if and only if � Q � � x � − c A T � � = A 0 u 0 for some u . Linear system combines stationarity, primal feasibility (complementary slackness and dual feasibility are vacuous) 13

Water-filling Example from B & V page 245: consider problem n � x ∈ R n − min log( α i + x i ) i =1 subject to x ≥ 0 , 1 T x = 1 Information theory: think of log( α i + x i ) as communication rate of i th channel. KKT conditions: − 1 / ( α i + x i ) − u i + v = 0 , i = 1 , . . . n 1 T x = 1 , u i · x i = 0 , x ≥ 0 , u ≥ 0 i = 1 , . . . n, Eliminate u : 1 / ( α i + x i ) ≤ v, i = 1 , . . . n 1 T x = 1 x i ( v − 1 / ( α i + x i )) = 0 , i = 1 , . . . n, x ≥ 0 , 14

Can argue directly stationarity and complementary slackness imply � 1 /v − α if v ≤ 1 /α x i = if v > 1 /α = max { 0 , 1 /v − α } , i = 1 , . . . n 0 Still need x to be feasible, i.e., 1 T x = 1 , and this gives n � max { 0 , 1 /v − α i } = 1 i =1 Univariate equation, piecewise linear in 1 /v and not hard to solve This reduced problem is called water-filling (From B & V page 246) 15

Lasso Let’s return the lasso problem: given response y ∈ R n , predictors A ∈ R n × p (columns A 1 , . . . A p ), solve 1 2 � y − Ax � 2 + λ � x � 1 min x ∈ R p KKT conditions: A T ( y − Ax ) = λs where s ∈ ∂ � x � 1 , i.e.,  { 1 } if x i > 0   s i ∈ {− 1 } if x i < 0  [ − 1 , 1] if x i = 0  Now we read off important fact: if | A T i ( y − Ax ) | < λ , then x i = 0 ... we’ll return to this problem shortly 16

Group lasso Suppose predictors A = [ A (1) A (2) . . . A ( G ) ] , split up into groups, with each A ( i ) ∈ R n × p ( i ) . If we want to select entire groups rather than individual predictors, then we solve the group lasso problem: G 1 2 � y − Ax � 2 + λ � min � p ( i ) � x ( i ) � 2 x =( x (1) ,...x ( G ) ) ∈ R p i =1 (From Yuan and Lin (2006), “Model selection and estimation in regression with grouped variables”) 17

KKT conditions: A T ( i ) ( y − Ax ) = λ � p ( i ) s ( i ) , i = 1 , . . . G where each s ( i ) ∈ ∂ � x ( i ) � 2 , i.e., � { x ( i ) / � x ( i ) � 2 } if x ( i ) � = 0 s ( i ) ∈ if x ( i ) = 0 , i = 1 , . . . G { z ∈ R p ( i ) : � z � 2 ≤ 1 } 2 < λ √ p ( i ) , then x ( i ) = 0 . On the other � � A T � Hence if ( i ) ( y − Ax ) � hand, if x ( i ) � = 0 , then λ √ p ( i ) � − 1 � A T A T x ( i ) = ( i ) A ( i ) + I ( i ) r − ( i ) , � x ( i ) � 2 � where r − ( i ) = y − A ( j ) x ( j ) j � = i 18

Constrained and Lagrange forms Often in statistics and machine learning we’ll switch back and forth between constrained form, where t ∈ R is a tuning parameter, x ∈ R n f ( x ) subject to h ( x ) ≤ t min (C) and Lagrange form, where λ ≥ 0 is a tuning parameter, x ∈ R n f ( x ) + λ · h ( x ) min (L) and claim these are equivalent. Is this true (assuming convex f, h )? (C) to (L): if problem (C) is strictly feasible, then strong duality holds, and there exists some λ ≥ 0 (dual solution) such that any solution x ⋆ in (C) minimizes f ( x ) + λ · ( f ( x ) − t ) so x ⋆ is also a solution in (L) 19

Karush-Kuhn-Tucker conditions Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Karush-Kuhn-Tucker conditions Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember duality Given a minimization problem x R n f ( x ) min subject to h i ( x ) 0 , i = 1 , . . . m j ( x ) = 0 , j = 1 , . . . r

Unit: Optimality Conditions and Karush Kuhn Tucker Theorem Goals 1. What is the Gradient of a

Supplemental notes: Kuhn-Tucker first-order conditions P. Dybvig Minimization problem (like in

Consulting: Coding Is Only Half the Work Beth Tucker Long Who am I? Beth Tucker Long

An Introduction to caret Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Nonclinical

Communications draft-kuhn-nwcrg-network-coding-satellites-00 IETF99 July 2017 N. Kuhn (CNES)

Security Metric Friedwart Digitally signed by Friedwart Kuhn, Heinrich Wiederkehr, Nina Matysiak

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Research Statistics

Emergency Communication Tucker Dunham, KD2JPM Abbie Heim, KD2PUA 1 About Me Tucker

SculptPrint SculptPrint Subtractive 3D Printing Subtractive 3D Printing Tommy Tucker, PhD

Mark Tucker mark@thinkgenealogy.com 10 Things Genealogy Software Should DO Mark Tucker

CI:IRL By Beth Tucker Long Who am I? Beth Tucker Long (@e3betht) Editor in Chief

Accessibility for Everyone Beth Tucker Long Beth Tucker Long PHP Developer Stay-at-home

Get Your Team Talking About Usability Beth Tucker Long @e3betht Beth Tucker Long PHP

Accessibility for Everyone Beth Tucker Long Beth Tucker Long PHP Developer Stay-at-home

Sperner, Tucker and Ky Fans lemmas for manifolds Oleg R. Musin University of Texas at

Normalisation: Friend or Foe Beth Tucker Long Who am I? Beth Tucker Long (@e3betht) Editor

The Price for Bearing Default Risk Darrell Duffie, Stanford University Q Group, October, 2005

Facial structure of convex sets 2225 May 2017, Vancouver SIAM Conference on Optimization Vera

Generalizing the Borel condition Chris Francisco Oklahoma State University Joint work with Jeff

3.13: Equivalence-testing and Minimization of Deterministic Finite Automata In this section, we

Computing Generator in Cyclotomic Integer Rings 1 A subfield algorithm for the Principal

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Rules of Inference Section 1.6

CS 4100: Artificial Intelligence Reinforcement Learning Ja Jan-Wi Willem van de Meent

Slide 1 / 40 1 Two charges +Q and -3Q are placed in opposite corners of a square. The work

Karush-Kuhn-Tucker conditions Geoff Gordon & Ryan Tibshirani - PowerPoint PPT Presentation

Karush-Kuhn-Tucker conditions Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember duality Given a minimization problem x R n f ( x ) min subject to h i ( x ) 0 , i = 1 , . . . m j ( x ) = 0 , j = 1 , . . . r

Unit: Optimality Conditions and Karush Kuhn Tucker Theorem Goals 1. What is the Gradient of a

Supplemental notes: Kuhn-Tucker first-order conditions P. Dybvig Minimization problem (like in

Consulting: Coding Is Only Half the Work Beth Tucker Long Who am I? Beth Tucker Long

An Introduction to caret Max Kuhn max.kuhn@pfizer.com Pfizer Global R &amp; D Nonclinical

Communications draft-kuhn-nwcrg-network-coding-satellites-00 IETF99 July 2017 N. Kuhn (CNES)

Security Metric Friedwart Digitally signed by Friedwart Kuhn, Heinrich Wiederkehr, Nina Matysiak

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global R &amp; D Research Statistics

Emergency Communication Tucker Dunham, KD2JPM Abbie Heim, KD2PUA 1 About Me Tucker

SculptPrint SculptPrint Subtractive 3D Printing Subtractive 3D Printing Tommy Tucker, PhD

Mark Tucker mark@thinkgenealogy.com 10 Things Genealogy Software Should DO Mark Tucker

CI:IRL By Beth Tucker Long Who am I? Beth Tucker Long (@e3betht) Editor in Chief

Accessibility for Everyone Beth Tucker Long Beth Tucker Long PHP Developer Stay-at-home

Get Your Team Talking About Usability Beth Tucker Long @e3betht Beth Tucker Long PHP

Accessibility for Everyone Beth Tucker Long Beth Tucker Long PHP Developer Stay-at-home

Sperner, Tucker and Ky Fans lemmas for manifolds Oleg R. Musin University of Texas at

Normalisation: Friend or Foe Beth Tucker Long Who am I? Beth Tucker Long (@e3betht) Editor

The Price for Bearing Default Risk Darrell Duffie, Stanford University Q Group, October, 2005

Facial structure of convex sets 2225 May 2017, Vancouver SIAM Conference on Optimization Vera

Generalizing the Borel condition Chris Francisco Oklahoma State University Joint work with Jeff

3.13: Equivalence-testing and Minimization of Deterministic Finite Automata In this section, we

Computing Generator in Cyclotomic Integer Rings 1 A subfield algorithm for the Principal

The Foundations: Logic and Proofs Chapter 1, Part III: Proofs Rules of Inference Section 1.6

CS 4100: Artificial Intelligence Reinforcement Learning Ja Jan-Wi Willem van de Meent

Slide 1 / 40 1 Two charges +Q and -3Q are placed in opposite corners of a square. The work

An Introduction to caret Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Nonclinical

About the caret Package Max Kuhn max.kuhn@pfizer.com Pfizer Global R & D Research Statistics