Convex Optimization in Machine Learning and Inverse Problems Part - PowerPoint PPT Presentation

Summary: Linear Convergence, Strictly Convex f � � 1 − 2 Steepest descent: Linear rate approx ; κ � � 1 − 2 Heavy-ball: Linear rate approx √ κ . Big difference! To reduce � x k − x ∗ � by a factor ǫ , need k large enough that � � k 1 − 2 k ≥ κ ≤ ǫ ⇐ 2 | log ǫ | (steepest descent) κ � � k √ κ 1 − 2 √ κ ≤ ǫ ⇐ k ≥ 2 | log ǫ | (heavy-ball) A factor of √ κ difference; e.g. if κ = 1000 (not at all uncommon in inverse problems), need ∼ 30 times fewer steps. M. Figueiredo and S. Wright First-Order Methods April 2016 18 / 68

Conjugate Gradient Basic conjugate gradient (CG) step is p k = −∇ f ( x k ) + γ k p k − 1 . x k +1 = x k + α k p k , Can be identified with heavy-ball, with β k = α k γ k . α k − 1 However, CG can be implemented in a way that doesn’t require knowledge (or estimation) of L and µ . Choose α k to (approximately) miminize f along p k ; Choose γ k by a variety of formulae (Fletcher-Reeves, Polak-Ribiere, etc), all of which are equivalent if f is convex quadratic. e.g. �∇ f ( x k ) � 2 γ k = �∇ f ( x k − 1 ) � 2 M. Figueiredo and S. Wright First-Order Methods April 2016 19 / 68

Conjugate Gradient Nonlinear CG: Variants include Fletcher-Reeves, Polak-Ribiere, Hestenes. Restarting periodically with p k = −∇ f ( x k ) is useful (e.g. every n iterations, or when p k is not a descent direction). For quadratic f , convergence analysis is based on eigenvalues of A and Chebyshev polynomials, min-max arguments. Get Finite termination in as many iterations as there are distinct eigenvalues; Asymptotic linear convergence with rate approx 1 − 2 √ κ . (like heavy-ball.) (Nocedal and Wright, 2006) M. Figueiredo and S. Wright First-Order Methods April 2016 20 / 68

Accelerated First-Order Methods Accelerate the rate to 1 / k 2 for weakly convex, while retaining the linear rate (related to √ κ ) for strongly convex case. Nesterov (1983) describes a method that requires κ . Initialize: Choose x 0 , α 0 ∈ (0 , 1); set y 0 ← x 0 . Iterate: x k +1 ← y k − 1 L ∇ f ( y k ); (*short-step*) k + α k +1 find α k +1 ∈ (0 , 1): α 2 k +1 = (1 − α k +1 ) α 2 κ ; set β k = α k (1 − α k ) ; α 2 k + α k +1 set y k +1 ← x k +1 + β k ( x k +1 − x k ). Still works for weakly convex ( κ = ∞ ). M. Figueiredo and S. Wright First-Order Methods April 2016 21 / 68

y k+2 x k+2 y k+1 x k+1 x k y k Separates the “gradient descent” and “momentum” step components. M. Figueiredo and S. Wright First-Order Methods April 2016 22 / 68

Convergence Results: Nesterov If α 0 ≥ 1 / √ κ , have �� k 1 − 1 4 L f ( x k ) − f ( x ∗ ) ≤ c 1 min √ κ , √ , L + c 2 k ) 2 ( where constants c 1 and c 2 depend on x 0 , α 0 , L . Linear convergence “heavy-ball” rate for strongly convex f ; 1 / k 2 sublinear rate otherwise. In the special case of α 0 = 1 / √ κ , this scheme yields 1 2 √ κ, √ κ + 1 . α k ≡ β k ≡ 1 − M. Figueiredo and S. Wright First-Order Methods April 2016 23 / 68

Beck and Teboulle Beck and Teboulle (2009) propose a similar algorithm, with a fairly short and elementary analysis (though still not intuitive). Initialize: Choose x 0 ; set y 1 = x 0 , t 1 = 1; x k ← y k − 1 Iterate: L ∇ f ( y k ); � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 For (weakly) convex f , converges with f ( x k ) − f ( x ∗ ) ∼ 1 / k 2 . When L is not known, increase an estimate of L until it’s big enough. Beck and Teboulle (2009) do the convergence analysis in 2-3 pages; elementary, but “technical.” M. Figueiredo and S. Wright First-Order Methods April 2016 24 / 68

A Non-Monotone Gradient Method: Barzilai-Borwein Barzilai and Borwein (1988) (BB) proposed an unusual choice of α k . Allows f to increase (sometimes a lot) on some steps: non-monotone. � s k − α z k � 2 , x k +1 = x k − α k ∇ f ( x k ) , α k := arg min α where s k := x k − x k − 1 , z k := ∇ f ( x k ) − ∇ f ( x k − 1 ) . Explicitly, we have α k = s T k z k . z T k z k Note that for f ( x ) = 1 2 x T Ax , we have � 1 � α k = s T k As k L , 1 ∈ . s T k A 2 s k µ BB can be viewed as a quasi-Newton method, with the Hessian approximated by α − 1 k I . M. Figueiredo and S. Wright First-Order Methods April 2016 25 / 68

Comparison: BB vs Greedy Steepest Descent M. Figueiredo and S. Wright First-Order Methods April 2016 26 / 68

There Are Many BB Variants use α k = s T k s k / s T k z k in place of α k = s T k z k / z T k z k ; alternate between these two formulae; hold α k constant for a number (2, 3, 5) of successive steps; take α k to be the steepest descent step from the previous iteration. Nonmonotonicity appears essential to performance. Some variants get global convergence by requiring a sufficient decrease in f over the worst of the last M (say 10) iterates. The original 1988 analysis in BB’s paper is nonstandard and illuminating (just for a 2-variable quadratic). In fact, most analyses of BB and related methods are nonstandard, and consider only special cases. The precursor of such analyses is Akaike (1959) . More recently, see Ascher, Dai, Fletcher, Hager and others. M. Figueiredo and S. Wright First-Order Methods April 2016 27 / 68

Extending to the Constrained Case: x ∈ Ω How to change these methods to handle the constraint x ∈ Ω ? (assuming that Ω is a closed convex set) Some algorithms and theory stay much the same, ...if we can involve the constraint x ∈ Ω explicity in the subproblems. Example: Nesterov’s constant step scheme requires just one calculation to be changed from the unconstrained version. Initialize: Choose x 0 , α 0 ∈ (0 , 1); set y 0 ← x 0 . Iterate: x k +1 ← arg min y ∈ Ω 1 2 � y − [ y k − 1 L ∇ f ( y k )] � 2 2 ; k + α k +1 find α k +1 ∈ (0 , 1): α 2 k +1 = (1 − α k +1 ) α 2 κ ; set β k = α k (1 − α k ) k + α k +1 ; α 2 set y k +1 ← x k +1 + β k ( x k +1 − x k ). Convergence theory is unchanged. M. Figueiredo and S. Wright First-Order Methods April 2016 28 / 68

Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68

Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . Often, all that is needed is to change the update step to x � x − Φ( x k ) � 2 x k = arg min 2 + λψ ( x ) . where Φ( x k ) is a gradient descent step M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68

Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . Often, all that is needed is to change the update step to x � x − Φ( x k ) � 2 x k = arg min 2 + λψ ( x ) . where Φ( x k ) is a gradient descent step or something more complicated; e.g. , an accelerated method with Φ( x k , x k − 1 ) instead of Φ( x k ) M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68

Regularized Optimization How to change these methods to handle regularized optimization? min f ( x ) + τψ ( x ) , x where f is convex and smooth, while ψ is convex but usually nonsmooth . Often, all that is needed is to change the update step to x � x − Φ( x k ) � 2 x k = arg min 2 + λψ ( x ) . where Φ( x k ) is a gradient descent step or something more complicated; e.g. , an accelerated method with Φ( x k , x k − 1 ) instead of Φ( x k ) This is the shrinkage/tresholding step; how to solve it with nonsmooth ψ ? That’s the topic of the following slides. M. Figueiredo and S. Wright First-Order Methods April 2016 29 / 68

Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68

Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f ( x ′ ) ≥ f ( x ) + v T ( x ′ − x ) Subdifferential: ∂ f ( x ) = { all subgradients of f at x } linear lower bound M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68

Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f ( x ′ ) ≥ f ( x ) + v T ( x ′ − x ) Subdifferential: ∂ f ( x ) = { all subgradients of f at x } If f is differentiable, ∂ f ( x ) = {∇ f ( x ) } linear lower bound M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68

Handling Nonsmoothness (e.g. ℓ 1 Norm) Convexity ⇒ continuity (on the domain of the function). Convexity �⇒ differentiability (e.g., ψ ( x ) = � x � 1 ). Subgradients generalize gradients for general convex functions: v is a subgradient of f at x if f ( x ′ ) ≥ f ( x ) + v T ( x ′ − x ) Subdifferential: ∂ f ( x ) = { all subgradients of f at x } If f is differentiable, ∂ f ( x ) = {∇ f ( x ) } linear lower bound nondifferentiable case M. Figueiredo and S. Wright First-Order Methods April 2016 30 / 68

More on Subgradients and Subdifferentials The subdifferential is a set-valued function: ∂ f : R d → 2 R d (power set of R d ) f : R d → R ⇒ M. Figueiredo and S. Wright First-Order Methods April 2016 31 / 68

More on Subgradients and Subdifferentials The subdifferential is a set-valued function: ∂ f : R d → 2 R d (power set of R d ) f : R d → R ⇒  Example: − 2 x − 1 , x ≤ − 1  f ( x ) = − x , − 1 < x ≤ 0  x 2 / 2 , x > 0  {− 2 } , x < − 1     [ − 2 , − 1] , x = − 1  {− 1 } , − 1 < x < 0 ∂ f ( x ) =    [ − 1 , 0] , x = 0   { x } , x > 0 M. Figueiredo and S. Wright First-Order Methods April 2016 31 / 68

More on Subgradients and Subdifferentials The subdifferential is a set-valued function: ∂ f : R d → 2 R d (power set of R d ) f : R d → R ⇒  Example: − 2 x − 1 , x ≤ − 1  f ( x ) = − x , − 1 < x ≤ 0  x 2 / 2 , x > 0  {− 2 } , x < − 1     [ − 2 , − 1] , x = − 1  {− 1 } , − 1 < x < 0 ∂ f ( x ) =    [ − 1 , 0] , x = 0   { x } , x > 0 Fermat’s Rule: x ∈ arg min x f ( x ) ⇔ 0 ∈ ∂ f ( x ) M. Figueiredo and S. Wright First-Order Methods April 2016 31 / 68

A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68

A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. Example: (seen above) prox τ |·| ( y ) = soft( y , τ ) = sign( y ) max {| y | − τ, 0 } M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68

A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. Example: (seen above) prox τ |·| ( y ) = soft( y , τ ) = sign( y ) max {| y | − τ, 0 } Block separability: x = ( x 1 , ..., x N ) (a partition of the components of x ) � ψ ( x ) = ψ i ( x i ) ⇒ (prox ψ ( y )) i = prox ψ i ( y i ) i Relationship with subdifferential: z = prox ψ ( y ) ⇔ z − y ∈ ∂ψ ( z ) M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68

A Key Tool: Moreau’s Proximity Operators Moreau (1962) proximity operator 1 2 � x − y � 2 � x ∈ arg min 2 + ψ ( x ) =: prox ψ ( y ) x ...well defined for convex ψ , since � · − y � 2 2 is coercive and strictly convex. Example: (seen above) prox τ |·| ( y ) = soft( y , τ ) = sign( y ) max {| y | − τ, 0 } Block separability: x = ( x 1 , ..., x N ) (a partition of the components of x ) � ψ ( x ) = ψ i ( x i ) ⇒ (prox ψ ( y )) i = prox ψ i ( y i ) i Relationship with subdifferential: z = prox ψ ( y ) ⇔ z − y ∈ ∂ψ ( z ) Resolvent: z = prox ψ ( y ) ⇔ 0 ∈ ∂ψ ( z ) + ( z − y ) ⇔ y ∈ ( ∂ψ + I ) z prox ψ ( y ) = ( ∂ψ + I ) − 1 y M. Figueiredo and S. Wright First-Order Methods April 2016 32 / 68

Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68

Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. Consider the indicator ι S of a convex set S ; 1 1 2 � x − u � 2 2 � x − y � 2 prox ι S ( u ) = arg min 2 + ι S ( x ) = arg min 2 = P S ( u ) x x ∈S ...the Euclidean projection on S . M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68

Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. Consider the indicator ι S of a convex set S ; 1 1 2 � x − u � 2 2 � x − y � 2 prox ι S ( u ) = arg min 2 + ι S ( x ) = arg min 2 = P S ( u ) x x ∈S ...the Euclidean projection on S . Squared Euclidean norm (separable, smooth): y x � x − y � 2 2 + τ � x � 2 prox τ �·� 2 2 ( y ) = arg min 2 = 1 + τ M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68

Important Proximity Operators Soft-thresholding is the proximity operator of the ℓ 1 norm. Consider the indicator ι S of a convex set S ; 1 1 2 � x − u � 2 2 � x − y � 2 prox ι S ( u ) = arg min 2 + ι S ( x ) = arg min 2 = P S ( u ) x x ∈S ...the Euclidean projection on S . Squared Euclidean norm (separable, smooth): y x � x − y � 2 2 + τ � x � 2 prox τ �·� 2 2 ( y ) = arg min 2 = 1 + τ Euclidean norm (not separable, nonsmooth): � x � x � 2 ( � x � 2 − τ ) , if � x � 2 > τ prox τ �·� 2 ( y ) = if � x � 2 ≤ τ 0 M. Figueiredo and S. Wright First-Order Methods April 2016 33 / 68

More Proximity Operators (Combettes and Pesquet, 2011) M. Figueiredo and S. Wright First-Order Methods April 2016 34 / 68

Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68

Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f ∗∗ = f . M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68

Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f ∗∗ = f . Moreau’s decomposition: prox f ( u ) + prox f ∗ ( u ) = u ...meaning that, if you know prox f , you know prox f ∗ , and vice-versa. M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68

Another Key Tool: Fenchel-Legendre Conjugates The Fenchel-Legendre conjugate of a proper convex function f — denoted by f ∗ : R n → ¯ R — is defined by f ∗ ( u ) = sup x x T u − f ( x ) Main properties and relationship with proximity operators: Biconjugation: if f is convex and proper, f ∗∗ = f . Moreau’s decomposition: prox f ( u ) + prox f ∗ ( u ) = u ...meaning that, if you know prox f , you know prox f ∗ , and vice-versa. Conjugate of indicator: if f ( x ) = ι C ( x ), where C is a convex set, f ∗ ( u ) = sup x x T u − ι C ( x ) = sup x T u ≡ σ C ( u ) (support function of C ). x ∈ C M. Figueiredo and S. Wright First-Order Methods April 2016 35 / 68

From Conjugates to Proximity Operators Notice that | u | = sup x ∈ [ − 1 , 1] x T u = σ [ − 1 , 1] ( u ), thus | · | ∗ = ι [ − 1 , 1] . Using Moreau’s decomposition, we easily derive the soft-threshold: prox τ |·| = 1 − prox ι [ − τ,τ ] = 1 − P [ − τ,τ ] = soft( · , τ ) Conjugate of a norm: if f ( x ) = τ � x � p then f ∗ = ι { x : � x � q ≤ τ } , where 1 q + 1 p = 1 (a H¨ older pair, or H¨ older conjugates). M. Figueiredo and S. Wright First-Order Methods April 2016 36 / 68

From Conjugates to Proximity Operators Notice that | u | = sup x ∈ [ − 1 , 1] x T u = σ [ − 1 , 1] ( u ), thus | · | ∗ = ι [ − 1 , 1] . Using Moreau’s decomposition, we easily derive the soft-threshold: prox τ |·| = 1 − prox ι [ − τ,τ ] = 1 − P [ − τ,τ ] = soft( · , τ ) Conjugate of a norm: if f ( x ) = τ � x � p then f ∗ = ι { x : � x � q ≤ τ } , where 1 q + 1 p = 1 (a H¨ older pair, or H¨ older conjugates). That is, � · � p and � · � q are dual norms: � z � q = sup { x T z : � x � p ≤ 1 } = x T z = σ B p (1) ( z ) sup x ∈ B p (1) M. Figueiredo and S. Wright First-Order Methods April 2016 36 / 68

From Conjugates to Proximity Operators Proximity of norm: prox τ �·� p = I − P B q ( τ ) where B q ( τ ) = { x : � x � q ≤ τ } and 1 q + 1 p = 1. M. Figueiredo and S. Wright First-Order Methods April 2016 37 / 68

From Conjugates to Proximity Operators Proximity of norm: prox τ �·� p = I − P B q ( τ ) where B q ( τ ) = { x : � x � q ≤ τ } and 1 q + 1 p = 1. Example: computing prox �·� ∞ (notice ℓ ∞ is not separable): ∞ + 1 1 Since 1 = 1, prox τ �·� ∞ = I − P B 1 ( τ ) ... the proximity operator of ℓ ∞ norm is the residual of the projection on an ℓ 1 ball. M. Figueiredo and S. Wright First-Order Methods April 2016 37 / 68

From Conjugates to Proximity Operators Proximity of norm: prox τ �·� p = I − P B q ( τ ) where B q ( τ ) = { x : � x � q ≤ τ } and 1 q + 1 p = 1. Example: computing prox �·� ∞ (notice ℓ ∞ is not separable): ∞ + 1 1 Since 1 = 1, prox τ �·� ∞ = I − P B 1 ( τ ) ... the proximity operator of ℓ ∞ norm is the residual of the projection on an ℓ 1 ball. Projection on ℓ 1 ball has no closed form, but there are efficient (linear cost) algorithms (Brucker, 1984) , (Maculan and de Paula, 1989) . M. Figueiredo and S. Wright First-Order Methods April 2016 37 / 68

Geometry and Effect of prox ℓ ∞ Whereas ℓ 1 promotes sparsity, ℓ ∞ promotes equality (in absolute value). M. Figueiredo and S. Wright First-Order Methods April 2016 38 / 68

From Conjugates to Proximity Operators The dual of the ℓ 2 norm is the ℓ 2 norm. M. Figueiredo and S. Wright First-Order Methods April 2016 39 / 68

Group Norms and their Prox Operators M � Group-norm regularizer: ψ ( x ) = λ m � x G m � p . m =1 In the non-overlapping case ( G 1 , ..., G m is a partition of { 1 , ..., n } ), simply use separability: � � � � prox ψ ( u ) G m = prox λ m �·� p u G m . M. Figueiredo and S. Wright First-Order Methods April 2016 40 / 68

Group Norms and their Prox Operators M � Group-norm regularizer: ψ ( x ) = λ m � x G m � p . m =1 In the non-overlapping case ( G 1 , ..., G m is a partition of { 1 , ..., n } ), simply use separability: � � � � prox ψ ( u ) G m = prox λ m �·� p u G m . In the tree-structured case, can get a complete ordering of the groups: G 1 � G 2 ... � G M , where ( G � G ′ ) ⇔ ( G ⊂ G ′ ) or ( G ∩ G ′ = ∅ ). M. Figueiredo and S. Wright First-Order Methods April 2016 40 / 68

Group Norms and their Prox Operators M � Group-norm regularizer: ψ ( x ) = λ m � x G m � p . m =1 In the non-overlapping case ( G 1 , ..., G m is a partition of { 1 , ..., n } ), simply use separability: � � � � prox ψ ( u ) G m = prox λ m �·� p u G m . In the tree-structured case, can get a complete ordering of the groups: G 1 � G 2 ... � G M , where ( G � G ′ ) ⇔ ( G ⊂ G ′ ) or ( G ∩ G ′ = ∅ ). Define Π m : R n → R N : (Π m ( u )) G m = prox λ m �·� p ( u G m ) , G m , where ¯ (Π m ( u )) ¯ G m = u ¯ G m = { 1 , ..., n } \ G m Then prox ψ = Π M ◦ · · · ◦ Π 2 ◦ Π 1 ...only valid for p ∈ { 1 , 2 , ∞} (Jenatton et al., 2011) . M. Figueiredo and S. Wright First-Order Methods April 2016 40 / 68

Matrix Nuclear Norm and its Prox Operator min { m , n } � Recall the trace/nuclear norm: � X � ∗ = σ i . i =1 The dual of a Schatten p -norm is a Schatten q -norm, with q + 1 1 p = 1. Thus, the dual of the nuclear norm is the spectral norm: � � � X � ∞ = max σ 1 , ..., σ min { m , n } . If Y = U Λ V T is the SVD of Y , we have prox τ �·� ∗ ( Y ) = U Λ V T − P { X :max { σ 1 ,...,σ min { m , n } }≤ τ } ( U Λ V T ) � � V T . = U soft Λ , τ M. Figueiredo and S. Wright First-Order Methods April 2016 41 / 68

Atomic Norms: A Unified View M. Figueiredo and S. Wright First-Order Methods April 2016 42 / 68

Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68

Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x Often this has the form: min g ( A x ) + ψ ( x ) x M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68

Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x Often this has the form: min g ( A x ) + ψ ( x ) x Using the definition of conjugate g ( A x ) = sup u u T A x − g ∗ ( u ) u u T A x − g ∗ ( u ) + ψ ( x ) min g ( A x ) + ψ ( x ) = inf x sup x u ( − g ∗ ( u )) + inf x u T A x + ψ ( x ) = sup u ( − g ∗ ( u )) − sup − x T A T u − ψ ( x ) = sup x � �� ψ ∗ ( − A T u ) u g ∗ ( u ) + ψ ∗ ( − A T u ) = − inf M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68

Another Use of Fenchel-Legendre Conjugates The original problem: min f ( x ) + ψ ( x ) x Often this has the form: min g ( A x ) + ψ ( x ) x Using the definition of conjugate g ( A x ) = sup u u T A x − g ∗ ( u ) u u T A x − g ∗ ( u ) + ψ ( x ) min g ( A x ) + ψ ( x ) = inf x sup x u ( − g ∗ ( u )) + inf x u T A x + ψ ( x ) = sup u ( − g ∗ ( u )) − sup − x T A T u − ψ ( x ) = sup x � �� ψ ∗ ( − A T u ) u g ∗ ( u ) + ψ ∗ ( − A T u ) = − inf The problem inf u g ∗ ( u ) + ψ ∗ ( − A T u ) is sometimes easier to handle. M. Figueiredo and S. Wright First-Order Methods April 2016 43 / 68

Basic Proximal-Gradient Algorithm Use basic structure: x � x − Φ( x k ) � 2 x k = arg min 2 + ψ ( x ) . with Φ( x k ) a simple gradient descent step, thus � � x k +1 = prox α k ψ x k − α k ∇ f ( x k ) This approach goes by different names, such as “proximal gradient algorithm” (PGA), “iterative shrinkage/thresholding” (IST or ISTA), “forward-backward splitting” (FBS) It has been proposed in different communities: optimization, PDEs, convex analysis, signal processing, machine learning. M. Figueiredo and S. Wright First-Order Methods April 2016 44 / 68

Convergence of the Proximal-Gradient Algorithm � � Basic algorithm: x k +1 = prox α k ψ x k − α k ∇ f ( x k ) M. Figueiredo and S. Wright First-Order Methods April 2016 45 / 68

Convergence of the Proximal-Gradient Algorithm � � Basic algorithm: x k +1 = prox α k ψ x k − α k ∇ f ( x k ) generalized (possibly inexact) version: � � � � x k +1 = (1 − λ k ) x k + λ k prox α k ψ x k − α k ∇ f ( x k ) + b k + a k where a k and b k are “errors” in computing the prox and the gradient; λ k is an over-relaxation parameter. M. Figueiredo and S. Wright First-Order Methods April 2016 45 / 68

Convergence of the Proximal-Gradient Algorithm � � Basic algorithm: x k +1 = prox α k ψ x k − α k ∇ f ( x k ) generalized (possibly inexact) version: � � � � x k +1 = (1 − λ k ) x k + λ k prox α k ψ x k − α k ∇ f ( x k ) + b k + a k where a k and b k are “errors” in computing the prox and the gradient; λ k is an over-relaxation parameter. Convergence is guaranteed (Combettes and Wajs, 2006) if � 0 < inf α k ≤ sup α k < 2 L � λ k ∈ (0 , 1], with inf λ k > 0 � � ∞ k � a k � < ∞ and � ∞ k � b k � < ∞ M. Figueiredo and S. Wright First-Order Methods April 2016 45 / 68

Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68

Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68

Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . Very important in signal/image processing, where fast algorithms exist to compute these products ( e.g. FFT, DWT) M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68

Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . Very important in signal/image processing, where fast algorithms exist to compute these products ( e.g. FFT, DWT) In this case, some more refined convergence results are available. M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68

Proximal-Gradient Algorithm: Quadratic Case 2 � B x − b � 2 Consider the quadratic case (of great interest): f ( x ) = 1 2 . Here, ∇ f ( x ) = B T ( B x − b ) and the IST/PGA/FBS algorithm is � � x k − α k B T ( B x − b ) x k +1 = prox α k ψ requires only matrix-vector multiplications with B and B T . Very important in signal/image processing, where fast algorithms exist to compute these products ( e.g. FFT, DWT) In this case, some more refined convergence results are available. Even more refined results are available if ψ ( x ) = τ � x � 1 M. Figueiredo and S. Wright First-Order Methods April 2016 46 / 68

More on IST/FBS/PGA for the ℓ 2 - ℓ 1 Case 1 2 � B x − b � 2 2 + τ � x � 1 (recall B T B � LI ) Problem: � x ∈ G = arg min x ∈ R n � � x k − α B T ( B x − b ) , ατ IST/FBS/PGA becomes x k +1 = soft with α < 2 / L . M. Figueiredo and S. Wright First-Order Methods April 2016 47 / 68

More on IST/FBS/PGA for the ℓ 2 - ℓ 1 Case 1 2 � B x − b � 2 2 + τ � x � 1 (recall B T B � LI ) Problem: � x ∈ G = arg min x ∈ R n � � x k − α B T ( B x − b ) , ατ IST/FBS/PGA becomes x k +1 = soft with α < 2 / L . The zero set: Z ⊆ { 1 , ..., n } : � x ∈ G ⇒ � x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008) : after a finite number of iterations ( x k ) Z = 0. M. Figueiredo and S. Wright First-Order Methods April 2016 47 / 68

More on IST/FBS/PGA for the ℓ 2 - ℓ 1 Case 1 2 � B x − b � 2 2 + τ � x � 1 (recall B T B � LI ) Problem: � x ∈ G = arg min x ∈ R n � � x k − α B T ( B x − b ) , ατ IST/FBS/PGA becomes x k +1 = soft with α < 2 / L . The zero set: Z ⊆ { 1 , ..., n } : � x ∈ G ⇒ � x Z = 0 Zeros are found in a finite number of iterations (Hale et al., 2008) : after a finite number of iterations ( x k ) Z = 0. After that, if B T Z B Z � µ I , with µ > 0 (thus κ ( B T Z B Z ) = L /µ < ∞ ), we have linear convergence x � 2 ≤ 1 − κ � x k +1 − � 1 + κ � x k − � x � 2 for the optimal choice α = 2 / ( L + µ ) (see unconstrained theory). M. Figueiredo and S. Wright First-Order Methods April 2016 47 / 68

Heavy Ball Acceleration: FISTA FISTA ( fast iterative shrinkage-thresholding algorithm ) is heavy-ball-type acceleration of IST (based on Nesterov (1983) ) (Beck and Teboulle, 2009) . Initialize: Choose α ≤ 1 / L , x 0 ; set y 1 = x 0 , t 1 = 1; � � x k ← prox ταψ y k − α ∇ f ( y k ) Iterate: ; � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 M. Figueiredo and S. Wright First-Order Methods April 2016 48 / 68

Heavy Ball Acceleration: FISTA FISTA ( fast iterative shrinkage-thresholding algorithm ) is heavy-ball-type acceleration of IST (based on Nesterov (1983) ) (Beck and Teboulle, 2009) . Initialize: Choose α ≤ 1 / L , x 0 ; set y 1 = x 0 , t 1 = 1; � � x k ← prox ταψ y k − α ∇ f ( y k ) Iterate: ; � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 Acceleration: � 1 � � 1 � FISTA: f ( x k ) − f ( � x ) ∼ O IST: f ( x k ) − f ( � x ) ∼ O . k 2 k M. Figueiredo and S. Wright First-Order Methods April 2016 48 / 68

Heavy Ball Acceleration: FISTA FISTA ( fast iterative shrinkage-thresholding algorithm ) is heavy-ball-type acceleration of IST (based on Nesterov (1983) ) (Beck and Teboulle, 2009) . Initialize: Choose α ≤ 1 / L , x 0 ; set y 1 = x 0 , t 1 = 1; � � x k ← prox ταψ y k − α ∇ f ( y k ) Iterate: ; � � � t k +1 ← 1 1 + 4 t 2 1 + ; 2 k y k +1 ← x k + t k − 1 ( x k − x k − 1 ). t k +1 Acceleration: � 1 � � 1 � FISTA: f ( x k ) − f ( � x ) ∼ O IST: f ( x k ) − f ( � x ) ∼ O . k 2 k When L is not known, increase an estimate of L until it’s big enough. M. Figueiredo and S. Wright First-Order Methods April 2016 48 / 68

Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68

Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ Analysis in the strongly convex case: µ I � B T B � LI , with µ > 0. Condition number κ = L /µ < ∞ . M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68

Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ Analysis in the strongly convex case: µ I � B T B � LI , with µ > 0. Condition number κ = L /µ < ∞ . µ + L , where ρ = 1 −√ κ Optimal parameters: γ = ρ 2 + 1, β = 2 α 1+ √ κ , yield linear convergence x � 2 ≤ 1 − √ κ � x k +1 − � 1 + √ κ � x k − � x � 2 M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68

Heavy Ball Acceleration: TwIST TwIST ( two-step iterative shrinkage-thresholding (Bioucas-Dias and Figueiredo, 2007) ) is a heavy-ball-type acceleration of IST, for 1 2 � B x − b � 2 min 2 + τψ ( x ) x Iterations (with α < 2 / L ) � � x k − α B T ( B x − b ) x k +1 = ( γ − β ) x k + (1 − γ ) x k − 1 + β prox ατψ Analysis in the strongly convex case: µ I � B T B � LI , with µ > 0. Condition number κ = L /µ < ∞ . µ + L , where ρ = 1 −√ κ Optimal parameters: γ = ρ 2 + 1, β = 2 α 1+ √ κ , yield linear convergence x � 2 ≤ 1 − √ κ � � versus 1 − κ � x k +1 − � 1 + √ κ � x k − � x � 2 1+ κ for IST M. Figueiredo and S. Wright First-Order Methods April 2016 49 / 68

Illustration of the TwIST Acceleration M. Figueiredo and S. Wright First-Order Methods April 2016 50 / 68

Acceleration via Larger Steps: SpaRSA The standard step-size α k ≤ 2 L in IST too timid M. Figueiredo and S. Wright First-Order Methods April 2016 51 / 68

Acceleration via Larger Steps: SpaRSA The standard step-size α k ≤ 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009) : � Barzilai-Borwein, to mimic Newton steps. � keep increasing α k until monotonicity is violated: backtrack. M. Figueiredo and S. Wright First-Order Methods April 2016 51 / 68

Acceleration via Larger Steps: SpaRSA The standard step-size α k ≤ 2 L in IST too timid The SpARSA (sparse reconstruction by separable approximation) framework proposes bolder choices of α k (Wright et al., 2009) : � Barzilai-Borwein, to mimic Newton steps. � keep increasing α k until monotonicity is violated: backtrack. Convergence to critical points (minima in the convex case) is guaranteed for a safeguarded version: ensure sufficient decrease w.r.t. the worst value in previous M iterations. M. Figueiredo and S. Wright First-Order Methods April 2016 51 / 68

Another Approach: Gradient Projection min x 1 2 � B x − b � 2 2 + τ � x � 1 can be written as a standard QP: 1 2 � B ( u − v ) − b � 2 2 + τ u T 1 + τ u T 1 min s.t. u ≥ 0 , v ≥ 0 , u , v where u i = max { 0 , x i } and v i = max { 0 , − x i } . M. Figueiredo and S. Wright First-Order Methods April 2016 52 / 68

Another Approach: Gradient Projection min x 1 2 � B x − b � 2 2 + τ � x � 1 can be written as a standard QP: 1 2 � B ( u − v ) − b � 2 2 + τ u T 1 + τ u T 1 min s.t. u ≥ 0 , v ≥ 0 , u , v where u i = max { 0 , x i } and v i = max { 0 , − x i } . � u � With z = , problem can be written in canonical form v 1 2 z T Q z + c T z min s.t. z ≥ 0 z M. Figueiredo and S. Wright First-Order Methods April 2016 52 / 68

Another Approach: Gradient Projection min x 1 2 � B x − b � 2 2 + τ � x � 1 can be written as a standard QP: 1 2 � B ( u − v ) − b � 2 2 + τ u T 1 + τ u T 1 min s.t. u ≥ 0 , v ≥ 0 , u , v where u i = max { 0 , x i } and v i = max { 0 , − x i } . � u � With z = , problem can be written in canonical form v 1 2 z T Q z + c T z min s.t. z ≥ 0 z Solving this problem with projected gradient using Barzilai-Borwein steps: GPSR (gradient projection for sparse reconstruction) (Figueiredo et al., 2007) . M. Figueiredo and S. Wright First-Order Methods April 2016 52 / 68

Speed Comparisons Lorenz (2011) proposed a way of generating problem instances with known solution � x : useful for speed comparison. Define: R k = � x k − � x � 2 and r k = L ( x k ) − L ( � x ) ( where L ( x ) = f ( x ) + τψ ( x ) ). � � x � 2 L ( � x ) M. Figueiredo and S. Wright First-Order Methods April 2016 53 / 68

More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April 2016 54 / 68

Even More Speed Comparisons M. Figueiredo and S. Wright First-Order Methods April 2016 55 / 68

Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. M. Figueiredo and S. Wright First-Order Methods April 2016 56 / 68

Acceleration by Continuation IST/FBS/PGA can be very slow if τ is very small and/or f is poorly conditioned. A very simple acceleration strategy: continuation/homotopy Initialization: Set τ 0 ≫ τ , starting point ¯ x , factor σ ∈ (0 , 1), and k = 0. Iterations: Find approx solution x ( τ k ) of min x f ( x ) + τ k ψ ( x ), starting from ¯ x ; if τ k = τ STOP; Set τ k +1 ← max( τ, στ k ) and ¯ x ← x ( τ k ); M. Figueiredo and S. Wright First-Order Methods April 2016 56 / 68

Convex Optimization in Machine Learning and Inverse Problems Part - PowerPoint PPT Presentation

Convex Optimization in Machine Learning and Inverse Problems Part 2: First-Order Methods ario A. T. Figueiredo 1 and Stephen J. Wright 2 M 1 Instituto de Telecomunica c oes, Instituto Superior T ecnico, Lisboa, Portugal 2 Computer

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

16. Review of convex optimization Convex sets and functions Convex programming models

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

Certification of Safety-Critical, Software-Intensive Systems EECS4312: Software Engineering

Improving Relational Consistency Algorithms Using Dynamic Relation Partitioning Anthony

Intrinsic Charm at LHCb Philip Ilten on behalf of the LHCb Collaboration Massachusetts Institute

Universal Range Effects to Efimov Features Chen Ji ECT* / INFN-TIFPA EMMI RRTF, 02.06.2016 In

CSP-based inference of function block finite-state models from execution traces Daniil

Physics Plans and Machines in Germany Major Collaborations and places in Germany: BMW: J

Multi-D wavelet construction using Quillen-Suslin theorem for Laurent polynomials Youngmi Hur

Splitting Envelopes Accelerated Second-order Proximal Methods Panos Patrinos (joint work with