Structured sparsity through convex optimization Francis Bach INRIA - PowerPoint PPT Presentation

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journ´ ees INRIA - Apprentissage - December 2011

Outline • SIERRA project-team • Introduction: Sparse methods for machine learning – Need for structured sparsity: Going beyond the ℓ 1 -norm • Classical approaches to structured sparsity – Linear combinations of ℓ q -norms • Structured sparsity through submodular functions – Relaxation of the penalization of supports – Unified algorithms and analysis

SIERRA - created January 1 st , 2011 Composition of the INRIA/ENS/CNRS team • 3 Researchers (Sylvain Arlot, Francis Bach, Guillaume Obozinski) • 4 Post-docs (Simon Lacoste-Julien, Nicolas Le Roux, Ronny Luss, Mark Schmidt) • 9 PhD students (Louise Benoit, Florent Couzinie-Devy, Edouard Grave, Toby Hocking, Armand Joulin, Augustin Lef` evre, Anil Nelakanti, Fabian Pedregosa, Matthieu Solnon)

Machine learning Computer science and applied mathematics • Modelisation, prediction and control from training examples • Theory – Analysis of statistical performance • Algorithms – Numerical efficiency and stability • Applications – Computer vision, bioinformatics, neuro-imaging, text, audio

Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited

Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning

Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning • Scientific objectives – Supervised learning – Parsimony – Optimization – Unsupervised learning

Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning • Scientific objectives • Interdisciplinary collaborations – Supervised learning – Computer vision – Parsimony – Bioinformatics – Optimization – Neuro-imaging – Unsupervised learning – Text, audio, natural language

Supervised learning • Data ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Goal : predict y ∈ Y from x ∈ X , i.e., find f : X → Y • Empirical risk minimization n 1 λ � 2 � f � 2 ℓ ( y i , f ( x i )) + n i =1 Data-fitting + Regularization • SIERRA Scientific objectives : – Studying generalization error (S. Arlot, M. Solnon, F. Bach) – Improving calibration (S. Arlot, M. Solnon, F. Bach) – Two main types of norms: ℓ 2 vs. ℓ 1 (G. Obozinski, F. Bach)

Sparsity in supervised machine learning • Observed data ( x i , y i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y, Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – square loss + ℓ 1 -norm ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996) – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n )

Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1

Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn X = ( x 1 , . . . , x p ) ∈ R n × p such that ∀ j, � x j � 2 � 1 k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � x j � 2 � 1 by Θ( x j ) � 1

Sparsity in signal processing • Multiple responses/signals x = ( x 1 , . . . , x k ) ∈ R n × k k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn D = ( d 1 , . . . , d p ) ∈ R n × p such that ∀ j, � d j � 2 � 1 k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � d j � 2 � 1 by Θ( d j ) � 1

Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

Modelling of text corpora (Jenatton et al., 2010)

Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010) • Stability and identifiability – Optimization problem min w ∈ R p L ( y, Xw ) + λ � w � 1 is unstable – “Codes” w j often used in later processing (Mairal et al., 2009c) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Numerical efficiency – Non-linear variable selection with 2 p subsets (Bach, 2008)

Classical approaches to structured sparsity • Many application domains – Computer vision (Cevher et al., 2008; Mairal et al., 2009b) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

Outline • SIERRA project-team • Introduction: Sparse methods for machine learning – Need for structured sparsity: Going beyond the ℓ 1 -norm • Classical approaches to structured sparsity – Linear combinations of ℓ q -norms • Structured sparsity through submodular functions – Relaxation of the penalization of supports – Unified algorithms and analysis

Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006)

Unit norm balls Geometric interpretation � w 2 1 + w 2 � w � 2 � w � 1 2 + | w 3 |

Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006) • However, the ℓ 1 - ℓ 2 norm encodes fixed/static prior information , requires to know in advance how to group the variables • What happens if the set of groups H is not a partition anymore?

Structured sparsity with overlapping groups (Jenatton, Audibert, and Bach, 2009a) G 1 • When penalizing by the ℓ 1 - ℓ 2 norm, G2 � 1 / 2 � � � � w 2 2 � w G � 2 = j G ∈ H G ∈ H j ∈ G – The ℓ 1 norm induces sparsity at the group level: G ∗ Some w G ’s are set to zero 3 – Inside the groups, the ℓ 2 norm does not promote sparsity

Structured sparsity through convex optimization Francis Bach INRIA - PowerPoint PPT Presentation

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journ ees INRIA - Apprentissage - December 2011 Outline SIERRA

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure,

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

16. Review of convex optimization Convex sets and functions Convex programming models

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration

Concentration of risk measures: A Wasserstein distance approach 1 Prashanth L. A. Joint work

The Failure of a Clearinghouse: Empirical Evidence Vincent Bignon Guillaume Vuillemey Banque de

Recent Results on Algorithmic Fairness and Meta-Learning Massimiliano Pontil Computational

Decision Trees COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Decision

Using Strengths Based Measures to Assess and Manage Risk of Future Negative outcomes Simone

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata

Crying Wolf: An Empirical Study of SSL Warning Effectiveness Joshua Sunshine Serge Egelman