Uses of duality Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Remember conjugate functions Given f : R n → R , the function x ∈ R n y T x − f ( x ) f ∗ ( y ) = max is called its conjugate • Conjugates appear frequently in dual programs, as x ∈ R n f ( x ) − y T x − f ∗ ( y ) = min • If f is closed and convex, then f ∗∗ = f . Also, f ( z ) − y T z x ∈ ∂f ∗ ( y ) ⇔ y ∈ ∂f ( x ) ⇔ x ∈ argmin z ∈ R n and for strictly convex f , ∇ f ∗ ( y ) = argmin z ∈ R n ( f ( z ) − y T z ) 2

Uses of duality We already discussed two key uses of duality: • For x primal feasible and u, v dual feasible, f ( x ) − g ( u, v ) is called the duality gap between x and u, v . Since f ( x ) − f ( x ⋆ ) ≤ f ( x ) − g ( u, v ) a zero duality gap implies optimality. Also, the duality gap can be used as a stopping criterion in algorithms • Under strong duality, given dual optimal u ⋆ , v ⋆ , any primal solution minimizes L ( x, u ⋆ , v ⋆ ) over x ∈ R n (i.e., satisfies stationarity condition). This can be used to characterize or compute primal solutions 3

Outline • Examples • Dual gradient methods • Dual decomposition • Augmented Lagrangians (And many more uses of duality—e.g., dual certificates in recovery theory, dual simplex algorithm, dual smoothing) 4

Lasso and projections onto polyhedra Recall the lasso problem: 1 2 � y − Ax � 2 + λ � x � 1 min x ∈ R p and its dual problem: u ∈ R n � y − u � 2 subject to � A T u � ∞ ≤ λ min According to stationarity condition (with respect to z, x blocks): Ax ⋆ = y − u ⋆  if x ⋆ { λ } i > 0  i u ⋆ ∈  A T if x ⋆ , i = 1 , . . . p {− λ } i < 0  if x ⋆ [ − λ, λ ] i = 0  where A 1 , . . . A p are columns of A . I.e., | A T i u ⋆ | < λ implies x ⋆ i = 0 5

Directly from dual problem, u ∈ R n � y − u � 2 subject to � A T u � ∞ ≤ λ min we see that u ⋆ = P C ( y ) projection of y onto polyhedron C = { u ∈ R n : � A T u � ∞ ≤ λ } p � { u : A T i u ≤ λ } ∩ { u : A T = i u ≥ − λ } i =1 Therefore the lasso fit is Ax ⋆ = ( I − P C )( y ) residual from projecting onto C 6

Consider the lasso fit Ax ⋆ as a function of y ∈ R n , for fixed A, λ . From the dual perspective (and some geometric arguments): • The lasso fit Ax ⋆ is nonexpansive with respect to y , i.e., it is Lipschitz with constant L = 1 : � Ax ⋆ ( y ) − Ax ⋆ ( y ′ ) � ≤ � y − y ′ � for all y, y ′ • Each face of polyhedron C corresponds to a particular active set S for lasso solutions 1 • For almost every y ∈ R n , if we move y slightly, it will still project to the same face of C • Therefore, for almost every y , the active set S of the lasso solution is locally constant, 2 and the lasso fit is a locally affine projection map 1,2 These statements assume that the lasso solution is unique; analogous statements exist for the nonunique case 8

Safe rules For the lasso problem, somewhat amazingly, we have a safe rule: 3 i y | < λ − � A i �� y � λ max − λ | A T x ⋆ ⇒ i = 0 , all i = 1 , . . . p λ max where λ max = � A T y � ∞ (the smallest value of λ such that x ⋆ = 0 ), i.e., we can eliminate features apriori, without solving the problem. (Note: this is not an if and only if statement!) Why this rule? Construction comes from lasso dual: u ∈ R n g ( u ) subject to � A T u � ∞ ≤ λ max where g ( u ) = ( � y � 2 − � y − u � 2 ) / 2 . Suppose u 0 is a dual feasible point (e.g., take u 0 = y · λ/λ max ). Then γ = g ( u 0 ) lower bounds dual optimal value, so dual problem is equivalent to u ∈ R n g ( u ) subject to � A T u � ∞ ≤ λ, g ( u ) ≥ γ max 3 L. El Ghaoui et al. (2010), Safe feature elimination in sparse learning . Safe rules extend to lasso logistic regression and 1-norm SVMs, only g changes 9

Now consider computing u ∈ R n | A T m i = max i u | subject to g ( u ) ≥ γ, for i = 1 , . . . p Note that m i < λ ⇒ | A T i u ⋆ | < λ ⇒ x ⋆ i = 0 4 Through another dual argument, we can explicitly compute m i , and � y � 2 − 2 γ · � x � | A T � m i < λ ⇔ i y | < λ − Substituting γ = g ( y · λ/λ max ) then gives safe rule on previous slide 4 From L. El Ghaoui et al. (2010), Safe feature elimination in sparse learning 10

Beyond pure sparsity Consider something like a reverse lasso problem (also called 1-norm analysis): 1 2 � y − x � 2 + λ � Dx � 1 min x ∈ R p where D ∈ R m × n is a given penalty matrix (analysis operator). Note this cannot be turned into a lasso problem if rank( D ) < m Basic idea: Dx ⋆ is now sparse, and we choose D so that this gives some type of desired structure in x ⋆ . E.g., fused lasso (also called total variation denoising problems), where D is chosen so that � � Dx � 1 = | x i − x j | ( i,j ) ∈ E for some set of pairs E . In other words, D is incidence matrix for graph G = ( { 1 , . . . p } , E ) , with arbitrary edge orientations 11

original image noisy version fused lasso solution Here D is incidence matrix on 2d grid 12

For each state, we have log proportion of H1N1 cases in 2009 (from the CDC) observed data fused lasso solution Here D is the incidence matrix on the graph formed by joining US states to their geographic neighbors 13

Using similar steps as in lasso dual derivation, here dual problem is: u ∈ R m � y − D T u � 2 subject to � u � ∞ ≤ λ min and primal-dual relationship is x ⋆ = y − D T u ⋆  if ( Dx ⋆ ) i > 0 { λ }   u ⋆ ∈ if ( Dx ⋆ ) i < 0 , i = 1 , . . . m {− λ }  if ( Dx ⋆ ) i = 0 [ − λ, λ ]  Clearly D T u ⋆ = P C ( y ) , where now C = { D T u : � u � ∞ ≤ λ } also a polyhedron, and therefore x ⋆ = ( I − P C )( y ) 14

Same arguments as before show that: • Primal solution x ⋆ is Lipschitz continuous as a function of y (for fixed D, λ ) with constant L = 1 • Each face of polyhedron C corresponds to a nonzero pattern in Dx ⋆ • Almost everywhere in y , primal solution x ⋆ admits a locally constant structure S = supp( Dx ⋆ ) , and therefore is a locally affine projection map Dual is also very helpful for algorithmic reasons: it uncomplicates (disentagles) involvement of linear operator D with 1-norm Prox function in dual problem now very easy (projection onto ∞ -norm ball) so we can use, e.g., generalized gradient descent or accelerated generalized gradient method on the dual problem 16

Dual gradient methods What if we can’t derive dual (conjugate) in closed form, but want to utilize dual relationship? Turns out we can still use dual-based subradient or gradient methods E.g., consider the problem x ∈ R n f ( x ) subject to Ax = b min Its dual problem is u ∈ R m − f ∗ ( − A T u ) − b T u max where f ∗ is conjugate of f . Defining g ( u ) = f ∗ ( − A T u ) , note that ∂g ( u ) = − A∂f ∗ ( − A T u ) , and recall x ∈ ∂f ∗ ( − A T u ) f ( z ) + u T Az ⇔ x ∈ argmin z ∈ R n 17

Therefore the dual subgradient method (for minimizing negative of dual objective) starts with an initial dual guess u (0) , and repeats for k = 1 , 2 , 3 , . . . x ( k ) ∈ argmin f ( x ) + ( u ( k − 1) ) T Ax x ∈ R n u ( k ) = u ( k − 1) + t k ( Ax ( k − 1) − b ) where t k are step sizes, chosen in standard ways Recall that if f is strictly convex, then f ∗ is differentiable, and so we get dual gradient ascent , which repeats for k = 1 , 2 , 3 , . . . x ( k ) = argmin f ( x ) + ( u ( k − 1) ) T Ax x ∈ R n u ( k ) = u ( k − 1) + t k ( Ax ( k − 1) − b ) (difference is that x ( k ) is unique, here) 18

In fact, f strongly convex with parameter d ⇒ ∇ f ∗ Lipschitz with parameter 1 /d Check: if f strongly convex and x is its minimizer, then f ( y ) ≥ f ( x ) + d 2 � y − x � , all y Hence defining x u = ∇ f ∗ ( u ) , x v = ∇ f ∗ ( v ) , f ( x v ) − u T x v ≥ f ( x u ) − u T x u + d 2 � x u − x v � 2 f ( x u ) − v T x u ≥ f ( x v ) − v T x v + d 2 � x u − x v � 2 Adding these together: d � x u − x v � 2 ≤ ( u − v ) T ( x u − x v ) Use Cauchy-Schwartz and rearrange, � x u − x v � ≤ (1 /d ) · � u − v � 19

Applying what we know about gradient descent: if f is strongly convex with parameter d , then dual gradient ascent with constant step size t k ≤ d converges at rate O (1 /k ) . (Note: this is quite a strong assumption leading to a modest rate!) Dual generalized gradient ascent and accelerated dual generalized gradient method carry through in similar manner Disadvantages of dual methods: • Can be slow to converge (think of subgradient method) • Poor convergence properties: even though we may achieve convergence in dual objective value, convergence of u ( k ) , x ( k ) to solutions requires strong assumptions (primal iterates x ( k ) can even end up being infeasible in limit) Advantage: decomposability 20

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R , the function x R n y T x f ( x ) f ( y ) = max is called its conjugate Conjugates appear

Review of duality so far LP/QP duality, cone duality, set duality All are halfspace bounds

Duality of abelian groups stacks and T -duality U. Bunke September 6, 2006 String

Computational Geometry Lecture 11: Arrangements and Duality Computational Geometry Lecture 11:

Stone duality, more duality, and dynamics in Will Brian May 22, 2014 Will Brian Stone

CS675: Convex and Combinatorial Optimization Spring 2018 Duality of Convex Sets and Functions

CS675: Convex and Combinatorial Optimization Fall 2019 Geometric Duality of Convex Sets and

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The

First-Order Logical Duality Henrik Forssell June 2008 First-Order Logical Duality Introduction

Duality Sensitivity Analysis Marco Chiarandini Department of Mathematics & Computer Science

A word on duality Jonathan Turk Arizona State University October 21, 2020 Overview

Introduction to Priestley duality 1 / 24 Outline What is a distributive lattice? Priestley

CS672: Approximation Algorithms Spring 14 Introduction to Linear Programming II Instructor:

Conflicts in the World of Duality PL 81 Presentation Notes 1: Fear of Death

Duality based error estimation for electrostatic force computation Author: Simon Pintarelli

5. Duality Lagrange dual problem weak and strong duality geometric interpretation

PageRank: Ranking of nodes in graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data

t t tt r t

Building large-scale conic optimization models using MOSEK Fusion Andrea Cassioli Erling D.

On Recent Improvements in the Interior-Point Optimizer in MOSEK ISMP2015 14 July 2015

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve

Advanced Machine Learning - Exercise 3 Deep learning essentials Introduction Whats the plan?

Lecture 4.4: Finitely generated abelian groups Matthew Macauley Department of Mathematical

The moment-LP and moment-SOS approaches Jean B. Lasserre LAAS-CNRS and Institute of Mathematics,

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R , the function x R n y T x f ( x ) f ( y ) = max is called its conjugate Conjugates appear

Review of duality so far LP/QP duality, cone duality, set duality All are halfspace bounds

Duality of abelian groups stacks and T -duality U. Bunke September 6, 2006 String

Computational Geometry Lecture 11: Arrangements and Duality Computational Geometry Lecture 11:

Stone duality, more duality, and dynamics in Will Brian May 22, 2014 Will Brian Stone

CS675: Convex and Combinatorial Optimization Spring 2018 Duality of Convex Sets and Functions

CS675: Convex and Combinatorial Optimization Fall 2019 Geometric Duality of Convex Sets and

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

10701 Recitation 5 Duality and SVM Ahmed Hefny Outline Langrangian and Duality The

First-Order Logical Duality Henrik Forssell June 2008 First-Order Logical Duality Introduction

Duality Sensitivity Analysis Marco Chiarandini Department of Mathematics &amp; Computer Science

A word on duality Jonathan Turk Arizona State University October 21, 2020 Overview

Introduction to Priestley duality 1 / 24 Outline What is a distributive lattice? Priestley

CS672: Approximation Algorithms Spring 14 Introduction to Linear Programming II Instructor:

Conflicts in the World of Duality PL 81 Presentation Notes 1: Fear of Death

Duality based error estimation for electrostatic force computation Author: Simon Pintarelli

5. Duality Lagrange dual problem weak and strong duality geometric interpretation

PageRank: Ranking of nodes in graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data

t t tt r t

Building large-scale conic optimization models using MOSEK Fusion Andrea Cassioli Erling D.

On Recent Improvements in the Interior-Point Optimizer in MOSEK ISMP2015 14 July 2015

ADMM and Mirror Descent Geoff Gordon &amp; Ryan Tibshirani (I am Aaditya Ramdas and I approve

Advanced Machine Learning - Exercise 3 Deep learning essentials Introduction Whats the plan?

Lecture 4.4: Finitely generated abelian groups Matthew Macauley Department of Mathematical

The moment-LP and moment-SOS approaches Jean B. Lasserre LAAS-CNRS and Institute of Mathematics,

Duality Sensitivity Analysis Marco Chiarandini Department of Mathematics & Computer Science

ADMM and Mirror Descent Geoff Gordon & Ryan Tibshirani (I am Aaditya Ramdas and I approve