Harnessing Structure in Optimization for Machine Learning Franck - PowerPoint PPT Presentation

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM – 9-13 March 2020

>>> Regularization in Learning Structure Regularization Linear inverse problems: for a chosen sparsity r = � · � 1 regularization, we seek anti-sparsity r = � · � ∞ x ⋆ ∈ arg min r ( x ) such that Ax = b low rank r = � · � ∗ . . x . . . . Regularized Empirical Risk Minimization problem: x ⋆ ∈ arg min R ( x ; { a i , b i } m Find i = 1 ) + λ r ( x ) x ∈ R n obtained from chosen statistical modeling regularization x ⋆ ∈ arg min � m 2 ( a ⊤ 1 i x − b i ) 2 + λ � x � 1 e.g. Lasso: Find i = 1 x ∈ R n Regularization can improve statistical properties (generalization, stability, ...). ⋄ Tibshirani: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (1996) ⋄ Tibshirani et al. : Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society (2004) ⋄ Vaiter, Peyré, Fadili: Model consistency of partly smooth regularizers. IEEE Trans. on Information Theory (2017) 1 / 18

>>> Optimization for Machine Learning Composite minimization x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min Find f ( x ) + g ( x ) x ∈ R n smooth non-smooth > f : differentiable surrogate of the empirical risk ⇒ Gradient non-linear smooth function that depends on all the data > g : non-smooth but chosen regularization ⇒ Proximity operator non-differentiability on some manifolds implies structure on the solutions closed form/easy for many regularizations: � � – g ( x ) = � x � 1 2 γ � y − u � 2 1 prox γ g ( u ) = arg min y ∈ R n g ( y ) + 2 – g ( x ) = TV ( x ) – g ( x ) = indicator C ( x ) Natural optimization method: proximal gradient � u k + 1 = x k − γ ∇ f ( x k ) x k + 1 = prox γ g ( u k + 1 ) and its stochastic variants: proximal sgd, etc. 2 / 18

>>> Structure, Non-differentiability, and Proximity operator Example: LASSO x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min 2 � Ax − b � 2 1 Find + λ � x � 1 2 x ∈ R n smooth non-smooth Structure ↔ Optimality conditions Coordinates i ( Ax ⋆ − b ) ∈ [ − λ, λ ] x ⋆ A ⊤ i = 0 ⇔ ∀ i Proximity Operator: per coordinate  | · | u i − λγ if u i > λγ � �  prox γλ �·� 1 ( u ) i = 0 if u i ∈ [ − λγ ; λγ ] 2  u i + λγ if u i < − λγ 1 SoftThresholding Proximal Gradient (aka ISTA) : � u k + 1 = x k − γ A ⊤ ( Ax k − b ) − 3 − 2 − 1 1 2 3 x k + 1 = prox γλ �·� 1 ( u k + 1 ) − 1 [ − 1 , 1 ] → { 0 } per coord. 3 / 18

>>> Structure, Non-differentiability, and Proximity operator Example: LASSO x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min 2 � Ax − b � 2 1 Find + λ � x � 1 2 x ∈ R n smooth non-smooth Structure ↔ Optimality conditions ↔ Proximity operation Coordinates � � i ( Ax ⋆ − b ) ∈ [ − λ, λ ] x ⋆ A ⊤ prox γλ �·� 1 ( u ⋆ ) i = 0 ⇔ ⇔ i = 0 ∀ i u ⋆ = x ⋆ − γ A ⊤ ( Ax ⋆ − b ) 9 2 . 1 11 . 4 Proximal Gradient 3 . 4 1 1 . 5 0 . 2 . 3 2 1 . 1  u i − λγ if u i > λγ � �  4 . 5 5 6 . 8 prox γλ �·� 1 ( u ) i = 0 if u i ∈ [ − λγ ; λγ ] . 8 x ⋆ 7  1 u i + λγ if u i < − λγ Proximal Gradient (aka ISTA) : 1 . 1 3 . 4 0 . 5 � u k + 1 = x k − γ A ⊤ ( Ax k − b ) 3 . 2 0 2 . 3 5 . 7 x k + 1 = prox γλ �·� 1 ( u k + 1 ) 4 . 5 3 . 4 5 . 7 − 0 . 5 3 . 4 6 . 8 4 . 5 8 − 1 − 1 0 1 2 3 4 Iterates ( x k ) reach the same structure as x ⋆ in finite time! 3 / 18

>>> Mathematical properties of Proximal Algorithms 9 . 1 2 11 . 4 Proximal Gradient 3 . 4 10 . 2 1 . 5 1 . 1 2 . 3 4 . 5 5 6 . 8 8 Proximal Algorithms: x ⋆ . 7 1 � u k + 1 = x k − γ ∇ f ( x k ) 1 . 1 3 . 4 0 . 5 x k + 1 = prox γ g ( u k + 1 ) 3 . 2 0 2 . 3 4 . 5 5 . 7 3 . 4 5 . 7 − 0 . 5 6 . 8 3 . 4 4 . 5 8 − 1 − 1 0 1 2 3 4 > project on manifolds Let M be a manifold and u k such that u k − x k x k = prox γ g ( u k ) ∈ M ∈ ri ∂ g ( x k ) and γ If g is partly smooth at x k relative to M , then prox γ g ( u ) ∈ M for any u close to u k . ⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Daniilidis, Hare, Malick: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization (2006) 4 / 18

>>> Mathematical properties of Proximal Algorithms 2 9 . 1 11 . 4 Proximal Gradient 3 . 4 u ⋆ 10 . 2 1 . 5 2 . 3 1 . 1 4 . 5 5 6 . 8 Proximal Algorithms: x ⋆ . 8 7 1 x ⋆ SoftThresholding � u k + 1 = x k − γ ∇ f ( x k ) 0 . 5 1 . 1 3 . 4 x k + 1 = prox γ g ( u k + 1 ) 3 2 . 2 0 . 3 5 4 . 5 . 7 3 . 4 5 . 7 − 0 . 5 3 . 4 6 . 8 4 . 5 8 − 1 − 1 0 1 2 3 4 > project on manifolds > identify the optimal structure Let ( x k ) and ( u k ) be a pair of sequences such that x k = prox γ g ( u k ) → x ⋆ = prox γ g ( u ⋆ ) and M be a manifold. If x ⋆ ∈ M and ∃ ε > 0 such that for all u ∈ B ( u ⋆ , ε ) , prox γ g ( u ) ∈ M (QC) holds, then, after some finite but unknown time, x k ∈ M . ⋄ Lewis: Active sets, nonsmoothness, and sensitivity. SIAM Journal on Optimization (2002) ⋄ Fadili, Malick, Peyré: Sensitivity analysis for mirror-stratifiable convex functions. SIAM Journal on Optimization (2018) 4 / 18

>>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. ⋄ Hare, Lewis: Identifying active constraints via partial smoothness and prox-regularity. Journal of Convex Analysis (2004) ⋄ Lemarechal, Oustry, Sagastizabal: The U-Lagrangian of a convex function. Transactions of the AMS (2000) ⋄ Bolte, Daniilidis, Lewis: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization (2007) ⋄ Chen, Teboulle: A proximal-based decomposition method for convex minimization problems. Mathematical Programming (1994) 5 / 18

>>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. > ...but often suffered because of lack of structure/expression. Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc. ⋄ Nesterov: Smooth minimization of non-smooth functions. Mathematical Programming (2005) ⋄ Burke, Lewis, Overton: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization (2005) ⋄ Solodov, Svaiter: A hybrid projection-proximal point algorithm. Journal of convex analysis (1999) ⋄ de Oliveira, Sagastizábal: Bundle methods in the XXIst century: A bird’s-eye view. Pesquisa Operacional (2014) 5 / 18

>>> “Nonsmoothness can help” > Nonsmoothness is actively studied in Numerical Optimization... Subgradients, Partial Smoothness/prox-regularity, Bregman metrics, Error Bounds/Kurdyka-Łojasiewicz, etc. > ...but often suffered because of lack of structure/expression. Bundle methods, Gradient Sampling, Smoothing, Inexact proximal methods, etc. > For Machine Learning objectives , it can often be harnessed - Explicit/“proximable” regularizations ℓ 1 , nuclear norm - We know the expressions and activity of sought structures sparsity, rank See the talks of ... ⋄ Bach, et al.: Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning (2012) ⋄ Massias, Salmon, Gramfort: Celer: a fast solver for the lasso with dual extrapolation. ICML (2018) ⋄ Liang, Fadili, Peyré: Local linear convergence of forward–backward under partial smoothness. NeurIPS (2014) ⋄ O’Donoghue, Candes: Adaptive restart for accelerated gradient schemes. Foundations of computational mathematic (2015) 5 / 18

>>> Noticeable Structure x ⋆ ∈ arg min R ( x ; { a i , b i } m i = 1 ) + λ r ( x ) Find x ∈ R n x ⋆ ∈ arg min Find f ( x ) + g ( x ) x ∈ R n smooth non-smooth A reason why the nonsmoothness of ML problems can be leveraged is their noticeable structure , that is: We can design a lookout collection C = {M 1 , .., M p } of closed sets such that: (i) we have a projection mapping proj M i onto M i for all i ; (ii) prox γ g ( u ) is a singleton and can be computed explicitly for any u and γ ; (iii) upon computation of x = prox γ g ( u ) , we know if x ∈ M i or not for all i . ⇒ Identification can be directly harnessed . Example: Sparse structure and g = � · � 1 , � · � 0 . 5 0 . 5 , � · � 0 , ... with M i = { x ∈ R n : x i = 0 } C = {M 1 , . . . , M n } 6 / 18

Harnessing Structure in Optimization for Machine Learning Franck - PowerPoint PPT Presentation

Harnessing Structure in Optimization for Machine Learning Franck Iutzeler LJK, Univ. Grenoble Alpes Optimization for Machine Learning CIRM 9-13 March 2020 >>> Regularization in Learning Structure Regularization Linear inverse

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Digital Learning Trail: Harnessing ICT to Facilitate Inquiry-Based Learning Tan Yew Hock

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

Harnessing the potential of stem cells Harnessing the potential of stem cells for the treatment

HARNESSING HARNESSING THE THE DA DATA Elizabeth Elizabeth Lukanen, Lukanen, MPH MPH Sta

Harnessing Harnessing Grid Resources with Grid Resources with Data- -Centric Task Farms

HARNESSING THE BULL MARKET HARNESSING THE BULL MARKET FOR FREE CASH FLOW FOR FREE CASH FLOW

Harnessing technology for better social outcomes Presented by: Andrew Peckham General Manager -

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Proximity-based Clustering Clustering with no distance information What if one wants to

CS-5630 / CS-6630 Visualization for Data Science Design Guidelines Alexander Lex

Proverbs Series Lesson #017 May 19, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert L.

Proverbs Series Lesson #022 June 23, 2013 Dean Bible Ministries www.deanbible.org Dr. Robert

Inexact variable metric proximal gradient methods with line-search for convex and nonconvex

Keyword-based Queries Single words

Constrained Tensor Factorization with Accelerated AO-ADMM Shaden Smith 1 , Alec Beri 2 , and

Optimization for data processing at a large scale Sparsity4PSL Summer School Emilie Chouzenoux