harnessing structure in optimization for machine learning
play

Harnessing Structure in Optimization for Machine Learning Franck - PowerPoint PPT Presentation

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM 9-13 March 2020 >>> Regularization in Learning Structure Regularization Linear inverse


  1. Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM – 9-13 March 2020

  2. >>> Regularization in Learning Structure Regularization Linear inverse problems: for a chosen sparsity r = � · � 1 regularization, we seek anti-sparsity r = � · � ∞ x ⋆ ∈ arg min r ( x ) such that Ax = b low rank r = � · � ∗ . . x . . . . Regularized Empirical Risk Minimization problem: x ⋆ ∈ arg min R ( x ; { a i , b i } m Find i = 1 ) + λ r ( x ) x ∈ R n obtained from chosen statistical modeling regularization x ⋆ ∈ arg min � m 2 ( a ⊤ 1 i x − b i ) 2 + λ � x � 1 e.g. Lasso: Find i = 1 x ∈ R n Regularization can improve statistical properties (generalization, stability, ...). ⋄ Tibshirani: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (1996) ⋄ Tibshirani et al. : Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society (2004) ⋄ Vaiter, Peyré, Fadili: Model consistency of partly smooth regularizers. IEEE Trans. on Information Theory (2017) 1 / 18

  3. >>> Optimization for Machine Learning Composite minimization x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min Find f ( x ) + g ( x ) x ∈ R n smooth non-smooth > f : differentiable surrogate of the empirical risk ⇒ Gradient non-linear smooth function that depends on all the data > g : non-smooth but chosen regularization ⇒ Proximity operator non-differentiability on some manifolds implies structure on the solutions closed form/easy for many regularizations: � � – g ( x ) = � x � 1 2 γ � y − u � 2 1 prox γ g ( u ) = arg min y ∈ R n g ( y ) + 2 – g ( x ) = TV ( x ) – g ( x ) = indicator C ( x ) Natural optimization method: proximal gradient � u k + 1 = x k − γ ∇ f ( x k ) x k + 1 = prox γ g ( u k + 1 ) and its stochastic variants: proximal sgd, etc. 2 / 18

  4. >>> Structure, Non-differentiability, and Proximity operator Example: LASSO x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min 2 � Ax − b � 2 1 Find + λ � x � 1 2 x ∈ R n smooth non-smooth Structure ↔ Optimality conditions Coordinates i ( Ax ⋆ − b ) ∈ [ − λ, λ ] x ⋆ A ⊤ i = 0 ⇔ ∀ i Proximity Operator: per coordinate  | · | u i − λγ if u i > λγ � �  prox γλ �·� 1 ( u ) i = 0 if u i ∈ [ − λγ ; λγ ] 2  u i + λγ if u i < − λγ 1 SoftThresholding Proximal Gradient (aka ISTA) : � u k + 1 = x k − γ A ⊤ ( Ax k − b ) − 3 − 2 − 1 1 2 3 x k + 1 = prox γλ �·� 1 ( u k + 1 ) − 1 [ − 1 , 1 ] → { 0 } per coord. 3 / 18

  5. >>> Structure, Non-differentiability, and Proximity operator Example: LASSO x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min 2 � Ax − b � 2 1 Find + λ � x � 1 2 x ∈ R n smooth non-smooth Structure ↔ Optimality conditions ↔ Proximity operation Coordinates � � i ( Ax ⋆ − b ) ∈ [ − λ, λ ] x ⋆ A ⊤ prox γλ �·� 1 ( u ⋆ ) i = 0 ⇔ ⇔ i = 0 ∀ i u ⋆ = x ⋆ − γ A ⊤ ( Ax ⋆ − b ) 9 2 . 1 11 . 4 Proximal Gradient 3 . 4 1 1 . 5 0 . 2 . 3 2 1 . 1  u i − λγ if u i > λγ � �  4 . 5 5 6 . 8 prox γλ �·� 1 ( u ) i = 0 if u i ∈ [ − λγ ; λγ ] . 8 x ⋆ 7  1 u i + λγ if u i < − λγ Proximal Gradient (aka ISTA) : 1 . 1 3 . 4 0 . 5 � u k + 1 = x k − γ A ⊤ ( Ax k − b ) 3 . 2 0 2 . 3 5 . 7 x k + 1 = prox γλ �·� 1 ( u k + 1 ) 4 . 5 3 . 4 5 . 7 − 0 . 5 3 . 4 6 . 8 4 . 5 8 − 1 − 1 0 1 2 3 4 Iterates ( x k ) reach the same structure as x ⋆ in finite time! 3 / 18

  6. >>> Mathematical properties of Proximal Algorithms 9 . 1 2 11 . 4 Proximal Gradient 3 . 4 10 . 2 1 . 5 1 . 1 2 . 3 4 . 5 5 6 . 8 8 Proximal Algorithms: x ⋆ . 7 1 � u k + 1 = x k − γ ∇ f ( x k ) 1 . 1 3 . 4 0 . 5 x k + 1 = prox γ g ( u k + 1 ) 3 . 2 0 2 . 3 4 . 5 5 . 7 3 . 4 5 . 7 − 0 . 5 6 . 8 3 . 4 4 . 5 8 − 1 − 1 0 1 2 3 4 > project on manifolds Let M be a manifold and u k such that u k − x k x k = prox γ g ( u k ) ∈ M ∈ ri ∂ g ( x k ) and γ If g is partly smooth at x k relative to M , then prox γ g ( u ) ∈ M for any u close to u k . ⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Daniilidis, Hare, Malick: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization (2006) 4 / 18

  7. >>> Mathematical properties of Proximal Algorithms 2 9 . 1 11 . 4 Proximal Gradient 3 . 4 u ⋆ 10 . 2 1 . 5 2 . 3 1 . 1 4 . 5 5 6 . 8 Proximal Algorithms: x ⋆ . 8 7 1 x ⋆ SoftThresholding � u k + 1 = x k − γ ∇ f ( x k ) 0 . 5 1 . 1 3 . 4 x k + 1 = prox γ g ( u k + 1 ) 3 2 . 2 0 . 3 5 4 . 5 . 7 3 . 4 5 . 7 − 0 . 5 3 . 4 6 . 8 4 . 5 8 − 1 − 1 0 1 2 3 4 > project on manifolds > identify the optimal structure Let ( x k ) and ( u k ) be a pair of sequences such that x k = prox γ g ( u k ) → x ⋆ = prox γ g ( u ⋆ ) and M be a manifold. If x ⋆ ∈ M and ∃ ε > 0 such that for all u ∈ B ( u ⋆ , ε ) , prox γ g ( u ) ∈ M (QC) holds, then, after some finite but unknown time, x k ∈ M . ⋄ Lewis: Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization (2002) ⋄ Fadili, Malick, Peyré: Sensitivity analysis for mirror-stratifiable convex functions. SIAM Journal on Optimization (2018) 4 / 18

  8. >>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. ⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Lemarechal, Oustry, Sagastizabal: The U-Lagrangian of a convex function. Transactions of the AMS (2000) ⋄ Bolte, Daniilidis, Lewis: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization (2007) ⋄ Chen, Teboulle: A proximal-based decomposition method for convex minimization problems. Mathematical Programming (1994) 5 / 18

  9. >>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. > ...but often suffered because of lack of structure/expression. Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc. ⋄ Nesterov: Smooth minimization of non-smooth functions. Mathematical Programming (2005) ⋄ Burke, Lewis, Overton: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization (2005) ⋄ Solodov, Svaiter: A hybrid projection-proximal point algorithm. Journal of convex analysis (1999) ⋄ de Oliveira, Sagastizábal: Bundle methods in the XXIst century: A bird’s-eye view. Pesquisa Operacional (2014) 5 / 18

  10. >>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. > ...but often suffered because of lack of structure/expression. Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc. > For Machine Learning objectives , it can often be harnessed - Explicit/“proximable” regularizations ℓ 1 , nuclear norm - We know the expressions and activity of sought structures sparsity, rank See the talks of ... ⋄ Bach, et al.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning (2012) ⋄ Massias, Salmon, Gramfort: Celer: a fast solver for the lasso with dual extrapolation. ICML (2018) ⋄ Liang, Fadili, Peyré: Local linear convergence of forward–backward under partial smoothness. NeurIPS (2014) ⋄ O’Donoghue, Candes: Adaptive restart for accelerated gradient schemes. Foundations of computational mathematic (2015) 5 / 18

  11. >>> Noticeable Structure x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min Find f ( x ) + g ( x ) x ∈ R n smooth non-smooth A reason why the nonsmoothness of ML problems can be leveraged is their noticeable structure , that is: We can design a lookout collection C = {M 1 , .., M p } of closed sets such that: (i) we have a projection mapping proj M i onto M i for all i ; (ii) prox γ g ( u ) is a singleton and can be computed explicitly for any u and γ ; (iii) upon computation of x = prox γ g ( u ) , we know if x ∈ M i or not for all i . ⇒ Identification can be directly harnessed . Example: Sparse structure and g = � · � 1 , � · � 0 . 5 0 . 5 , � · � 0 , ... with M i = { x ∈ R n : x i = 0 } C = {M 1 , . . . , M n } 6 / 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend