structured sparsity through convex optimization
play

Structured sparsity through convex optimization Francis Bach INRIA - PowerPoint PPT Presentation

Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journ ees INRIA - Apprentissage - December 2011 Outline SIERRA


  1. Structured sparsity through convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France Joint work with R. Jenatton, J. Mairal, G. Obozinski Journ´ ees INRIA - Apprentissage - December 2011

  2. Outline • SIERRA project-team • Introduction: Sparse methods for machine learning – Need for structured sparsity: Going beyond the ℓ 1 -norm • Classical approaches to structured sparsity – Linear combinations of ℓ q -norms • Structured sparsity through submodular functions – Relaxation of the penalization of supports – Unified algorithms and analysis

  3. SIERRA - created January 1 st , 2011 Composition of the INRIA/ENS/CNRS team • 3 Researchers (Sylvain Arlot, Francis Bach, Guillaume Obozinski) • 4 Post-docs (Simon Lacoste-Julien, Nicolas Le Roux, Ronny Luss, Mark Schmidt) • 9 PhD students (Louise Benoit, Florent Couzinie-Devy, Edouard Grave, Toby Hocking, Armand Joulin, Augustin Lef` evre, Anil Nelakanti, Fabian Pedregosa, Matthieu Solnon)

  4. Machine learning Computer science and applied mathematics • Modelisation, prediction and control from training examples • Theory – Analysis of statistical performance • Algorithms – Numerical efficiency and stability • Applications – Computer vision, bioinformatics, neuro-imaging, text, audio

  5. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited

  6. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning

  7. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning • Scientific objectives – Supervised learning – Parsimony – Optimization – Unsupervised learning

  8. Scientific objectives - SIERRA tenet - Machine learning does not exist in the void - Specific domain knowledge must be exploited • Scientific challenges – Fully automated data processing – Incorporating structure – Large-scale learning • Scientific objectives • Interdisciplinary collaborations – Supervised learning – Computer vision – Parsimony – Bioinformatics – Optimization – Neuro-imaging – Unsupervised learning – Text, audio, natural language

  9. Supervised learning • Data ( x i , y i ) ∈ X × Y , i = 1 , . . . , n • Goal : predict y ∈ Y from x ∈ X , i.e., find f : X → Y • Empirical risk minimization n 1 λ � 2 � f � 2 ℓ ( y i , f ( x i )) + n i =1 Data-fitting + Regularization • SIERRA Scientific objectives : – Studying generalization error (S. Arlot, M. Solnon, F. Bach) – Improving calibration (S. Arlot, M. Solnon, F. Bach) – Two main types of norms: ℓ 2 vs. ℓ 1 (G. Obozinski, F. Bach)

  10. Sparsity in supervised machine learning • Observed data ( x i , y i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y, Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – square loss + ℓ 1 -norm ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996) – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n )

  11. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1

  12. Sparsity in unsupervised machine learning • Multiple responses/signals y = ( y 1 , . . . , y k ) ∈ R n × k k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn X = ( x 1 , . . . , x p ) ∈ R n × p such that ∀ j, � x j � 2 � 1 k � � � L ( y j , Xw j ) + λ Ω( w j ) min min X =( x 1 ,...,x p ) w 1 ,...,w k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � x j � 2 � 1 by Θ( x j ) � 1

  13. Sparsity in signal processing • Multiple responses/signals x = ( x 1 , . . . , x k ) ∈ R n × k k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 • Only responses are observed ⇒ Dictionary learning – Learn D = ( d 1 , . . . , d p ) ∈ R n × p such that ∀ j, � d j � 2 � 1 k � � � L ( x j , Dα j ) + λ Ω( α j ) min min D =( d 1 ,...,d p ) α 1 ,...,α k ∈ R p j =1 – Olshausen and Field (1997); Elad and Aharon (2006); Mairal et al. (2009a) • sparse PCA : replace � d j � 2 � 1 by Θ( d j ) � 1

  14. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  15. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  16. Structured sparse PCA (Jenatton et al., 2009b) raw data sparse PCA • Unstructed sparse PCA ⇒ many zeros do not lead to better interpretability

  17. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  18. Structured sparse PCA (Jenatton et al., 2009b) raw data Structured sparse PCA • Enforce selection of convex nonzero patterns ⇒ robustness to occlusion in face identification

  19. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  20. Modelling of text corpora (Jenatton et al., 2010)

  21. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  22. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2010; Mairal et al., 2010) • Stability and identifiability – Optimization problem min w ∈ R p L ( y, Xw ) + λ � w � 1 is unstable – “Codes” w j often used in later processing (Mairal et al., 2009c) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Numerical efficiency – Non-linear variable selection with 2 p subsets (Bach, 2008)

  23. Classical approaches to structured sparsity • Many application domains – Computer vision (Cevher et al., 2008; Mairal et al., 2009b) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

  24. Outline • SIERRA project-team • Introduction: Sparse methods for machine learning – Need for structured sparsity: Going beyond the ℓ 1 -norm • Classical approaches to structured sparsity – Linear combinations of ℓ q -norms • Structured sparsity through submodular functions – Relaxation of the penalization of supports – Unified algorithms and analysis

  25. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006)

  26. Unit norm balls Geometric interpretation � w 2 1 + w 2 � w � 2 � w � 1 2 + | w 3 |

  27. Sparsity-inducing norms • Popular choice for Ω – The ℓ 1 - ℓ 2 norm, G � 1 / 2 � � � � w 2 1 � w G � 2 = j G ∈ H G ∈ H j ∈ G G 2 – with H a partition of { 1 , . . . , p } – The ℓ 1 - ℓ 2 norm sets to zero groups of non-overlapping G 3 variables (as opposed to single variables for the ℓ 1 -norm) – For the square loss, group Lasso (Yuan and Lin, 2006) • However, the ℓ 1 - ℓ 2 norm encodes fixed/static prior information , requires to know in advance how to group the variables • What happens if the set of groups H is not a partition anymore?

  28. Structured sparsity with overlapping groups (Jenatton, Audibert, and Bach, 2009a) G 1 • When penalizing by the ℓ 1 - ℓ 2 norm, G2 � 1 / 2 � � � � w 2 2 � w G � 2 = j G ∈ H G ∈ H j ∈ G – The ℓ 1 norm induces sparsity at the group level: G ∗ Some w G ’s are set to zero 3 – Inside the groups, the ℓ 2 norm does not promote sparsity

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend