Proximal methods S. Villa 21st October 2013 0.1 Review of the - PDF document

Proximal methods S. Villa 21st October 2013

0.1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem of the form c ∈ R d � y − Kc � 2 , min for various choices of the loss function. Another typical problem is the regularized one, e.g. Tikhonov regularization where, for linear kernels one looks for n 1 � min V ( � w, x i � , y i ) + λR ( w ) . n w ∈ R d i =1 More generally, we are interested in solving a minimization problem w ∈ R d F ( w ) . min We review the basic concepts that allow to study the problem. We will consider extended real valued functions F : R d → R ∪ { + ∞} . The Existence of a minimizer domain of F is dom F = { w ∈ R d : F ( w ) < + ∞} . This all F is proper if the domain is nonempty. It is useful to consider extended valued functions since they allow to include constraints in the regularization. F is lower semicontinuous if epi F is closed (example). F is coercive if lim � w �→ + ∞ F ( w ) = + ∞ . Theorem 0.1.1. If F is lower semicontinuous and coercive then there exists w ∗ such that F ( w ∗ ) = min F . We will always assume that the functions we consider are lower semicontinuous. 0.1.1 Convexity concepts Convexity F is convex if ( ∀ w, w ′ ∈ dom F )( ∀ λ ∈ [0 , 1]) F ( λw + (1 − λ ) w ′ ) ≤ λF ( w ) + (1 − λ ) F ( w ′ ) . If F is differentiable, we can write an equivalent characterization of convexity based on the gradient: ( ∀ w, w ′ ∈ R d ) F ( w ′ ) ≥ F ( w ) + �∇ F ( w ) , w ′ − w � If F is twice differentiable, and ∇ 2 F is the Hessian matrix, convexity is equivalent to ∇ 2 F ( w ) positive semidefinite for all w ∈ R d . If a function is convex and differentiable, then ∇ F ( w ) = 0 implies that w is a global minimizer. F is strictly convex if ( ∀ w, w ′ ∈ dom F )( ∀ λ ∈ (0 , 1)) Strict Convexity F ( λw + (1 − λ ) w ′ ) < λF ( w ) + (1 − λ ) F ( w ′ ) . If F is differentiable, we can write an equivalent charcterization of strct convexity based on the gradient: ( ∀ w, w ′ ∈ R d ) F ( w ′ ) > F ( w ) + �∇ F ( w ) , w ′ − w � If F is twice differentiable, and ∇ 2 F is the Hessian matrix, convexity is implied by ∇ 2 F ( w ) positive definite for all w ∈ R d . The minimizer of a strictly convex function is unique (if it exists) 1

F is µ -strongly convex if the function f − µ �·� 2 is convex, i.e. ( ∀ w, w ′ ∈ dom F )( ∀ λ ∈ Strong Convexity [0 , 1]) F ( λw + (1 − λ ) w ′ ) ≤ λF ( w ) + (1 − λ ) F ( w ′ ) − µ 2 λ (1 − λ ) � w − w ′ � 2 . If F is differentiable, then strong convexity is equivalent to ( ∀ w, w ′ ∈ R d ) F ( w ′ ) ≥ F ( w ) + �∇ F ( w ) , w ′ − w � + µ 2 � w − w ′ � 2 If F is twice differentiable, and ∇ 2 F is the Hessian matrix, strong convexity is equivalent to ∇ 2 F ( w ) ≥ µI for all w ∈ R d . If F is strongly convex then it is coercive. Therefore if it is lsc, it admits a unique minimizer. Moreover F ( w ) − F ( w ∗ ) ≥ µ 2 � w − w ∗ � 2 . We will often assume Lipschitz continuity of the gradient � F ( w ) − F ( w ′ ) � ≤ L � w − w ′ � . This gives a useful quadratic upper bound of F F ( w ′ ) ≤ F ( w ) + �∇ F ( w ) , w − w ′ � + L 2 � w ′ − w � 2 ( ∀ w, w ′ ∈ dom F ) (1) Moreover, for every w ∈ dom F and w ∗ is a minimizer, 2 L �∇ F ( w ) � 2 ≤ F ( w ) − F ( w ∗ ) ≤ L 1 2 � w − w ∗ � 2 . The second inequality follows by substituting in the quadratic upper bound w = w ∗ and w ′ = w . The first follows by substituting w ′ = w − 1 L ∇ F ( w ). 0.2 Convergence of the gradient method with constant step-size Assume F to be convex, differentiable, with L Lipschitz continuous gradient, and that a minimizer exists. The first order necessary condition is ∇ F ( w ) = 0. Therefore w ∗ − α ∇ F ( w ∗ ) = w ∗ This suggests an algorithm based on the fixed point iteration w k +1 = w k − α ∇ F ( w k ) . We want to study convergence of this algorithm. Convergence can be intended in two senses, towards the minimum or towards a minimizer. Start from the first one. Different strategis to choose stepsize. We keep α fixed and determine a priori conditions guaranteeing convergence. From the quadratic upper bound (1) we get F ( w k +1 ) ≤ F ( w k ) − α �∇ F ( w k ) � 2 + Lα 2 2 �∇ F ( w k ) � 2 � 1 − L � �∇ F ( w k ) � 2 = F ( w k ) − α 2 α 2

If 0 < α < 2 /L the iteration decreases the function value. Choose α = 1 /L (which gives the maximum decrease) and get F ( w k +1 ) ≤ F ( w k ) − 1 2 L �∇ F ( w k ) � 2 ≤ F ( w ∗ ) + �∇ F ( w k ) , w k − w ∗ � − 1 2 L �∇ F ( w k ) � 2 = F ( w ∗ ) + L � �∇ 1 LF ( w k ) , w k − w ∗ � − 1 � L 2 �∇ F ( w k ) � 2 − � w k − w ∗ � 2 + � w k − w ∗ � 2 2 = F ( w ∗ ) + L 2 ( � w k − w ∗ � 2 − � w k − 1 L ∇ F ( w k ) − w ∗ � 2 ) = F ( w ∗ ) + L 2 ( � w k − w ∗ � 2 − � w k +1 − w ∗ � 2 ) Summing the above inequality for k = 0 , . . . , K − 1 we get K − 1 K − 1 L 2 ( � w k − w ∗ � 2 − � w k +1 − w ∗ � 2 ) � � F ( w k ) − F ( w ∗ ) ≤ k =0 k =0 K − 1 F ( w k ) − F ( w ∗ ) ≤ L � 2 � w 0 − w ∗ � 2 k =0 Noting that F ( w k ) is decreasing, F ( w K ) − F ( w ∗ ) ≤ F ( w k ) − F ( w ∗ ) for every k , therefore we obtain F ( w K ) − F ( w ∗ ) ≤ L 2 K � w 0 − w ∗ � 2 . This is called sublinear rate of convergence. For strongly convex functions, it is possible to prove that the operator I − α ∇ F is a contraction, and therefore we get linear convergence rate: � 2 K � L − µ � w K − w ∗ � 2 ≤ � w 0 − w ∗ � 2 L + µ which gives, using the bound following (1) � 2 K F ( w K ) − F ( w ∗ ) ≤ L � L − µ � w 0 − w ∗ � 2 2 L + µ which is much better. It is known that for general convex problems problems, with Lipschitz continuous gradient, the perfor- mance of any first order method is lower bounded by 1 /k 2 . Nesterov in 1983 devised an algorithm reaching the lower bound. The algorithm is called accelerated gradient descent and is very similar to the gradient. It needs to store two iterates, instead of only one. It is of the form w k +1 = u k − 1 L ∇ F ( u k ) u k +1 = a k w k + b k w k +1 , for some w 0 ∈ dom F , and u 1 = w 0 and a suitable (a priori determined) sequence of parameters a k and b k . More precisely, choose w 0 ∈ dom F , and u 1 = w 0 . Set t 1 = 1. Then define w k +1 = u k − 1 L ∇ F ( u k ) � 1 + 4 t 2 t k +1 = 1 + k 2 � 1 + t k − 1 � w k + 1 − t k u k +1 = w k +1 . t k +1 t k +1 3

We obtain F ( w k ) − F ( w ∗ ) ≤ L � w 0 − w ∗ � 2 2 k 2 0.3 Regularized optimization We often want to minimize w ∈ R d F ( w ) + R ( w ) , min where either F is smooth (e.g. square loss) and R is convex and nonsmooth, either R is smooth and F is not (SVM). We would like to write a similar condition to ∇ = 0 to characterize a minimizer. We use the subdifferential. Let R be a convex, lsc proper function. η ∈ R d is a subgradient of R at w if R ( w ′ ) ≥ R ( w ) + � η, w ′ − w � . The subdifferential ∂R ( w ) is the set of all subgradients. It is easy to see that R ( w ∗ ) = min R ⇐ ⇒ 0 ∈ ∂R ( w ∗ ) . If R is differentiable, the subdifferential is a singleton and coincides with the gradient. Example 1) Indicator function of a convex set C (constrained regularization). Let w �∈ C . Then ∂i C = ∅ . If w ∈ C , then η ∈ ∂i C ( w ) if and only if, for all v ∈ C i C ( v ) − i C ( w ) ≥ � η, w − v � ⇐ ⇒ 0 ≥ � η, w − v � . This is the normal cone to C . 2) Subdifferential of R ( w ) = � w � 1 . n n � � | v i | − | w i | ≥ � η, v − w � . i =1 i =1 If, η is such that for all i = 1 , . . . , d | v i | − | w i | ≥ η i ( v i − w i ) , then η ∈ ∂R ( w ) . Vice versa, taking v j = w j for all j � = i we get that η ∈ ∂R ( w ) implies that | v i | − | w i | ≥ η i ( v i − w i ), and thus η i ∈ ∂ | · | ( w i ). We therefore proved that ∂R ( w ) = ( ∂ | · | ( w 1 ) , . . . , ∂ | · | ( w d )) . Let R be lsc, convex, proper. Then Proximity operator prox R ( v ) = argmin w ∈ R d { R ( w ) + 1 2 � w − v � 2 } is well-defined and is unique. Imposing the first order necessary conditions, we get ⇒ u = ( I + ∂R ) − 1 ( v ) u = prox R ( v ) ⇐ ⇒ 0 ∈ ∂R ( u ) + ( u − v ) ⇐ ⇒ v − u ∈ ∂R ( u ) ⇐ Examples If R = 0 then prox( v ) = v . If R = i C then prox R ( v ) = P C ( v ). Proximity operator of the l 1 norm. Let v ∈ R d and u = prox R ( v ). Then v − u ∈ ∂ � · � 1 ( u ). SInce the subdifferential can be computed componentwise, the same holds for the prox. In particular, u = ( I + ∂R ) − 1 ( v ) By the previous example, this is equivalent to u = ( I + ∂R ) − 1 ( v ). To compute this quantity first note that  v i + 1 if v i > 1   (( I + ∂R )( v )) i = [ − 1 , 1] if v i = 0  v i − 1 if v i < − 1  4

Proximal methods S. Villa 21st October 2013 0.1 Review of the - PDF document

Proximal methods S. Villa 21st October 2013 0.1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem of the form c R d y Kc

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html

On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal methods; Incremental methods

PT Considerations for the Nonoperatively Treated Proximal Humerus Fractures John Cavanaugh PT

Risk Factors In Proximal Humerus Fractures: Males Vs. Females Jerjes, W / Callear, J /

Proximal point algorithm in Hadamard spaces Miroslav Bacak T el ecom ParisTech

Deep Unfolded Proximal Interior Point Algorithm for Image Restoration C. Bertocchi 1 , E.

Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov

Deep Unfolding of a Proximal Interior Point Method for Image Restoration M.-C. Corbineau 1 in

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence

Efficient Meta Learning via Minibatch Proximal Update Pan Zhou Joint work with Xiao-Tong Yuan,

Zone of Proximal Development Vygotsky called the range of developmentally appropriate expecta7ons

Some applications of proximal methods Caroline CHAUX Joint work with P. L. Combettes, L. Duval,

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

The Rumen The Rumen The rumen , also known as the fermentation vat or paunch forms the larger

Reduction of Boolean network models Elena Dimitrova School of Mathematical and Statistical

NATURE A d o p t t h e p a c e o f n a t u r e : h e r s e c r e t i s p a t i e n c e .

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil ,

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

ProxSDP.jl: New developments on Semidefinite Programming in Julia/JuMP Mario Souto and Joaquim

Proximal methods S. Villa 21st October 2013 0.1 Review of the - PDF document

Proximal methods S. Villa 21st October 2013 0.1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem of the form c R d y Kc

Convergence of perturbed Proximal Gradient algorithms Gersende Fort Institut de Math ematiques

Asymmetric Proximal Point Algorithms with Moving Proximal Centers Deren Han

Stochastic Perturbations of Proximal-Gradient methods for nonsmooth convex optimization: the

Lecture: Fast Proximal Gradient Methods http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html

On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal methods; Incremental methods

PT Considerations for the Nonoperatively Treated Proximal Humerus Fractures John Cavanaugh PT

Risk Factors In Proximal Humerus Fractures: Males Vs. Females Jerjes, W / Callear, J /

Proximal point algorithm in Hadamard spaces Miroslav Bacak T el ecom ParisTech

Deep Unfolded Proximal Interior Point Algorithm for Image Restoration C. Bertocchi 1 , E.

Proximal Method with Contractions for Smooth Convex Optimization Nikita Doikov Yurii Nesterov

Deep Unfolding of a Proximal Interior Point Method for Image Restoration M.-C. Corbineau 1 in

Nonnegative Tensor Factorization using a proximal algorithm: application to 3D fluorescence

Efficient Meta Learning via Minibatch Proximal Update Pan Zhou Joint work with Xiao-Tong Yuan,

Zone of Proximal Development Vygotsky called the range of developmentally appropriate expecta7ons

Some applications of proximal methods Caroline CHAUX Joint work with P. L. Combettes, L. Duval,

Generalized gradient descent Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

The Rumen The Rumen The rumen , also known as the fermentation vat or paunch forms the larger

Reduction of Boolean network models Elena Dimitrova School of Mathematical and Statistical

NATURE A d o p t t h e p a c e o f n a t u r e : h e r s e c r e t i s p a t i e n c e .

RFID Hacking Live Free or RFID Hard 01 Aug 2013 Black Hat USA 2013 Las Vegas, NV Presen

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil ,

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

ProxSDP.jl: New developments on Semidefinite Programming in Julia/JuMP Mario Souto and Joaquim

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1