structured sparsity and convex optimization
play

Structured sparsity and convex optimization Francis Bach INRIA - - PowerPoint PPT Presentation

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015 Structured sparsity


  1. Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

  2. Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France É C O L E N O R M A L E S U P É R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

  3. Structured sparsity and convex optimization Outline • Structured sparsity • Hierarchical dictionary learning – Known topology but unknown location/projection – Tree: Efficient linear-time computations • Non-linear variable selection – Known topology and location – Directed acyclic graph: semi-efficient active-set algorithm

  4. Sparsity in machine learning and statistics • Assumption : y = w ⊤ x + ε , with w ∈ R p sparse – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n ) • Sparsity and convexity ( ℓ 1 -norm regularization): w ∈ R p L ( w ) + � w � 1 min w w 2 2 w w 1 1

  5. Sparsity in supervised machine learning • Observed data ( y i , x i ) ∈ R p × R , i = 1 , . . . , n – Response vector y = ( y 1 , . . . , y n ) ⊤ ∈ R n – Design matrix X = ( x 1 , . . . , x n ) ⊤ ∈ R n × p • Regularized empirical risk minimization: n 1 � ℓ ( y i , w ⊤ x i ) + λ Ω( w ) = min w ∈ R p L ( y , Xw ) + λ Ω( w ) min n w ∈ R p i =1 • Norm Ω to promote sparsity – Main example: ℓ 1 -norm – square loss ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996)

  6. Sparsity in unsupervised machine learning and signal processing : Dictionary learning y ∈ R n , design matrix X ∈ R n × p • Responses – Lasso: w ∈ R p L ( y , Xw ) + λ Ω( w ) min

  7. Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α )

  8. Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α ) • Multiple signals x i ∈ R p , i = 1 , . . . , n , given dictionary D ∈ R p × k n � � � min L ( x i , D α i ) + λ Ω( α i ) α 1 ,..., α n ∈ R k i =1

  9. Sparsity in unsupervised machine learning and signal processing : Dictionary learning • Single signal x ∈ R p , given dictionary D ∈ R p × k – Basis pursuit: min α ∈ R k L ( x , D α ) + λ Ω( α ) • Multiple signals x i ∈ R p , i = 1 , . . . , n , given dictionary D ∈ R p × k n � � � min L ( x i , D α i ) + λ Ω( α i ) α 1 ,..., α n ∈ R k i =1 • Dictionary learning : D = ( d 1 , . . . , d k ) such that ∀ j, � d j � 2 � 1 n � � � min min L ( x i , D α i ) + λ Ω( α i ) D α 1 ,..., α n ∈ R k i =1 • Olshausen and Field (1997); Elad and Aharon (2006)

  10. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010)

  11. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010) • Stability and identifiability – Optimization problem min α ∈ R p L ( x, Dα ) + λ � α � 1 is unstable – Codes α often used in later processing (Mairal et al., 2009b) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

  12. Why structured sparsity? • Interpretability – Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010) • Stability and identifiability – Optimization problem min α ∈ R p L ( x, Dα ) + λ � α � 1 is unstable – Codes α often used in later processing (Mairal et al., 2009b) • Prediction or estimation performance – When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009) • Multi-resolution analysis

  13. Classical approaches to structured sparsity (pre-2011) • Many application domains – Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) - Convex approaches - Design of sparsity-inducing norms

  14. Classical approaches to structured sparsity (pre-2011) • Many application domains – Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011) • Non-convex approaches – Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009) • Convex approaches – Design of sparsity-inducing norms

  15. Unit-norm balls Geometric interpretation

  16. Hierarchical dictionary learning (Jenatton, Mairal, Obozinski, and Bach, 2011b) • Structure on codes α (not on dictionary D ) • Hierarchical penalization: Ω( α ) = � G ∈ G � α G � 2 where groups G in G are equal to set of descendants of some nodes in a tree • Variable selected after its ancestors (Zhao et al., 2009; Bach, 2008b)

  17. Hierarchical dictionary learning Efficient optimization n � � x i − D α i � 2 min 2 + λ Ω( α i ) s.t. ∀ j, � d j � 2 ≤ 1 . A ∈ R k × n i =1 D ∈ R p × k • Minimization with respect to α i : regularized least-squares – Many algorithms dedicated to the ℓ 1 -norm Ω( α ) = � α � 1 • Proximal methods : first-order methods with optimal convergence rate (Nesterov, 2007; Beck and Teboulle, 2009) – Requires solving many times min α ∈ R p 1 2 � y − α � 2 2 + λ Ω( α ) • Tree-structured regularization : Efficient linear time algorithm based on primal-dual decomposition (Jenatton et al., 2011b)

  18. Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators

  19. Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators • In this particular it is! – Which direction?

  20. Decomposability of the proximity operator • Sum of simple norms : Ω( α ) = � G ∈ G � α G � 2 – Each proximity operator is simple (soft-thresholding of ℓ 2 -norm) • In general, the proximity operator of the sum is not the composition of proximity operators • In this particular it is! – From leaves to the root

  21. Application to image denoising - Dictionary tree

  22. Hierarchical dictionary learning Modelling of text corpora • Each document is modelled through word counts • Low-rank matrix factorization of word-document matrix • Probabilistic topic models (Blei et al., 2003) – Similar structures based on non parametric Bayesian methods (Blei et al., 2004) – Can we achieve similar performance with simple matrix factorization formulation?

  23. Modelling of text corpora - Dictionary tree

  24. Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

  25. Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

  26. Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a) • “Brain reading”: prediction of (seen) object size • Multi-scale activity levels through hierarchical penalization

  27. Non-linear variable selection • Given x = ( x 1 , . . . , x q ) ∈ R q , find function f ( x 1 , . . . , x q ) which depends only on a few variables • Sparse generalized additive models (Ravikumar et al., 2008; Bach, 2008a): – restricted to f ( x 1 , . . . , x q ) = f 1 ( x 1 ) + · · · + f q ( x q ) • Cosso (Lin and Zhang, 2006): � – restricted to f ( x 1 , . . . , x q ) = f J ( x J ) J ⊂{ 1 ,...,q } , | J | � 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend