Structured sparsity and convex optimization Francis Bach INRIA - - - PowerPoint PPT Presentation

structured sparsity and convex optimization
SMART_READER_LITE
LIVE PREVIEW

Structured sparsity and convex optimization Francis Bach INRIA - - - PowerPoint PPT Presentation

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France C O L E N O R M A L E S U P R I E U R E Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015 Structured sparsity


slide-1
SLIDE 1

Structured sparsity and convex optimization

Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

slide-2
SLIDE 2

Structured sparsity and convex optimization

Francis Bach INRIA - Ecole Normale Sup´ erieure, Paris, France

É C O L E N O R M A L E S U P É R I E U R E

Joint work with R. Jenatton, J. Mairal, G. Obozinski December 2015

slide-3
SLIDE 3

Structured sparsity and convex optimization Outline

  • Structured sparsity
  • Hierarchical dictionary learning

– Known topology but unknown location/projection – Tree: Efficient linear-time computations

  • Non-linear variable selection

– Known topology and location – Directed acyclic graph: semi-efficient active-set algorithm

slide-4
SLIDE 4

Sparsity in machine learning and statistics

  • Assumption: y = w⊤x + ε, with w ∈ Rp sparse

– Proxy for interpretability – Allow high-dimensional inference: log p = O(n)

  • Sparsity and convexity (ℓ1-norm regularization):

min

w∈Rp L(w) + w1 1 2

w w

1 2

w w

slide-5
SLIDE 5

Sparsity in supervised machine learning

  • Observed data (yi, xi) ∈ Rp × R, i = 1, . . . , n

– Response vector y = (y1, . . . , yn)⊤ ∈ Rn – Design matrix X = (x1, . . . , xn)⊤ ∈ Rn×p

  • Regularized empirical risk minimization:

min

w∈Rp

1 n

n

  • i=1

ℓ(yi, w⊤xi) + λΩ(w) = min

w∈Rp L(y, Xw) + λΩ(w)

  • Norm Ω to promote sparsity

– Main example: ℓ1-norm – square loss ⇒ basis pursuit in signal processing (Chen et al., 2001), Lasso in statistics/machine learning (Tibshirani, 1996)

slide-6
SLIDE 6

Sparsity in unsupervised machine learning and signal processing : Dictionary learning

  • Responses

y ∈ Rn, design matrix X ∈ Rn×p – Lasso: min

w∈Rp L(y, Xw) + λΩ(w)

slide-7
SLIDE 7

Sparsity in unsupervised machine learning and signal processing : Dictionary learning

  • Single signal x ∈ Rp, given dictionary D ∈ Rp×k

– Basis pursuit: min α∈Rk L(x, Dα) + λΩ(α)

slide-8
SLIDE 8

Sparsity in unsupervised machine learning and signal processing : Dictionary learning

  • Single signal x ∈ Rp, given dictionary D ∈ Rp×k

– Basis pursuit: min α∈Rk L(x, Dα) + λΩ(α)

  • Multiple signals xi ∈ Rp, i = 1, . . . , n, given dictionary D ∈ Rp×k

min α1,...,αn∈Rk

n

  • i=1
  • L(xi, Dαi) + λΩ(αi)
slide-9
SLIDE 9

Sparsity in unsupervised machine learning and signal processing : Dictionary learning

  • Single signal x ∈ Rp, given dictionary D ∈ Rp×k

– Basis pursuit: min α∈Rk L(x, Dα) + λΩ(α)

  • Multiple signals xi ∈ Rp, i = 1, . . . , n, given dictionary D ∈ Rp×k

min α1,...,αn∈Rk

n

  • i=1
  • L(xi, Dαi) + λΩ(αi)
  • Dictionary learning: D = (d1, . . . , dk) such that ∀j, dj2 1

min

D

min α1,...,αn∈Rk

n

  • i=1
  • L(xi, Dαi) + λΩ(αi)
  • Olshausen and Field (1997); Elad and Aharon (2006)
slide-10
SLIDE 10

Why structured sparsity?

  • Interpretability

– Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010)

slide-11
SLIDE 11

Why structured sparsity?

  • Interpretability

– Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010)

  • Stability and identifiability

– Optimization problem minα∈Rp L(x, Dα) + λα1 is unstable – Codes α often used in later processing (Mairal et al., 2009b)

  • Prediction or estimation performance

– When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

slide-12
SLIDE 12

Why structured sparsity?

  • Interpretability

– Structured dictionary elements (Jenatton et al., 2009b) – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu et al., 2009; Jenatton et al., 2011b; Mairal et al., 2010)

  • Stability and identifiability

– Optimization problem minα∈Rp L(x, Dα) + λα1 is unstable – Codes α often used in later processing (Mairal et al., 2009b)

  • Prediction or estimation performance

– When prior knowledge matches data (Haupt and Nowak, 2006; Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

  • Multi-resolution analysis
slide-13
SLIDE 13

Classical approaches to structured sparsity (pre-2011)

  • Many application domains

– Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011)

  • Non-convex approaches

– Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009)

  • Convex approaches
  • Design of sparsity-inducing norms
slide-14
SLIDE 14

Classical approaches to structured sparsity (pre-2011)

  • Many application domains

– Computer vision (Cevher et al., 2008; Kavukcuoglu et al., 2009; Mairal et al., 2009a) – Neuro-imaging (Gramfort and Kowalski, 2009; Jenatton et al., 2011a) – Bio-informatics (Rapaport et al., 2008; Kim and Xing, 2010) – Audio processing (Lef` evre et al., 2011)

  • Non-convex approaches

– Haupt and Nowak (2006); Baraniuk et al. (2008); Huang et al. (2009)

  • Convex approaches

– Design of sparsity-inducing norms

slide-15
SLIDE 15

Unit-norm balls Geometric interpretation

slide-16
SLIDE 16

Hierarchical dictionary learning (Jenatton, Mairal, Obozinski, and Bach, 2011b)

  • Structure on codes α (not on dictionary D)
  • Hierarchical penalization: Ω(α) =

G∈G αG2 where groups G in

G are equal to set of descendants of some nodes in a tree

  • Variable selected after its ancestors (Zhao et al., 2009; Bach, 2008b)
slide-17
SLIDE 17

Hierarchical dictionary learning Efficient optimization

min

A∈Rk×n D∈Rp×k n

  • i=1

xi − Dαi2

2 + λΩ(αi) s.t. ∀j, dj2 ≤ 1.

  • Minimization with respect to αi : regularized least-squares

– Many algorithms dedicated to the ℓ1-norm Ω(α) = α1

  • Proximal methods : first-order methods with optimal convergence

rate (Nesterov, 2007; Beck and Teboulle, 2009) – Requires solving many times minα∈Rp 1

2y − α2 2 + λΩ(α)

  • Tree-structured regularization :

Efficient linear time algorithm based on primal-dual decomposition (Jenatton et al., 2011b)

slide-18
SLIDE 18

Decomposability of the proximity operator

  • Sum of simple norms: Ω(α) =

G∈G αG2

– Each proximity operator is simple (soft-thresholding of ℓ2-norm)

  • In general, the proximity operator of the sum is not the

composition of proximity operators

slide-19
SLIDE 19

Decomposability of the proximity operator

  • Sum of simple norms: Ω(α) =

G∈G αG2

– Each proximity operator is simple (soft-thresholding of ℓ2-norm)

  • In general, the proximity operator of the sum is not the

composition of proximity operators

  • In this particular it is!

– Which direction?

slide-20
SLIDE 20

Decomposability of the proximity operator

  • Sum of simple norms: Ω(α) =

G∈G αG2

– Each proximity operator is simple (soft-thresholding of ℓ2-norm)

  • In general, the proximity operator of the sum is not the

composition of proximity operators

  • In this particular it is!

– From leaves to the root

slide-21
SLIDE 21

Application to image denoising - Dictionary tree

slide-22
SLIDE 22

Hierarchical dictionary learning Modelling of text corpora

  • Each document is modelled through word counts
  • Low-rank matrix factorization of word-document matrix
  • Probabilistic topic models (Blei et al., 2003)

– Similar structures based on non parametric Bayesian methods (Blei et al., 2004) – Can we achieve similar performance with simple matrix factorization formulation?

slide-23
SLIDE 23

Modelling of text corpora - Dictionary tree

slide-24
SLIDE 24

Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a)

  • “Brain reading”: prediction of (seen) object size
  • Multi-scale activity levels through hierarchical penalization
slide-25
SLIDE 25

Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a)

  • “Brain reading”: prediction of (seen) object size
  • Multi-scale activity levels through hierarchical penalization
slide-26
SLIDE 26

Application to neuro-imaging (supervised) Structured sparsity for fMRI (Jenatton et al., 2011a)

  • “Brain reading”: prediction of (seen) object size
  • Multi-scale activity levels through hierarchical penalization
slide-27
SLIDE 27

Non-linear variable selection

  • Given x = (x1, . . . , xq) ∈ Rq, find function f(x1, . . . , xq) which

depends only on a few variables

  • Sparse generalized additive models (Ravikumar et al., 2008; Bach,

2008a): – restricted to f(x1, . . . , xq) = f1(x1) + · · · + fq(xq)

  • Cosso (Lin and Zhang, 2006):

– restricted to f(x1, . . . , xq) =

  • J⊂{1,...,q}, |J|2

fJ(xJ)

slide-28
SLIDE 28

Non-linear variable selection

  • Given x = (x1, . . . , xq) ∈ Rq, find function f(x1, . . . , xq) which

depends only on a few variables

  • Sparse generalized additive models (Ravikumar et al., 2008; Bach,

2008a): – restricted to f(x1, . . . , xq) = f1(x1) + · · · + fq(xq)

  • Cosso (Lin and Zhang, 2006):

– restricted to f(x1, . . . , xq) =

  • J⊂{1,...,q}, |J|2

fJ(xJ)

  • Universally consistent non-linear selection requires all 2q subsets

f(x1, . . . , xq) =

  • J⊂{1,...,q}

fJ(xJ)

slide-29
SLIDE 29

Restricting the set of active kernels

  • V = set of subsets of {1, . . . , q}
  • One separate predictor wv for each subset of variables v ⊂ {1, . . . , q}

– Final prediction: f(x) =

v∈V w⊤ v Φv(x)

– Implicit through kernel methods

  • With flat structure

– Consider block ℓ1-norm:

v∈V wv2

– cannot avoid being linear in p = #(V ) = 2q

slide-30
SLIDE 30

Restricting the set of active kernels

  • V is endowed with a directed acyclic graph (DAG) structure:

select a kernel only after all of its ancestors have been selected

  • Select a subset only after all its subsets have been selected

34 14 13 24 123 234 124 134 1234 3 12 1 2 4 23

slide-31
SLIDE 31

DAG-adapted norm (Zhao & Yu, 2008)

  • Graph-based structured regularization

– D(v) is the set of descendants of v ∈ V :

  • v∈V

wD(v)2 =

  • v∈V

 

t∈D(v)

wt2

2

 

1/2

  • Main property: If v is selected, so are all its ancestors

34 14 13 24 123 234 124 134 1234 3 12 1 2 4 23 2 3 4 12 23 34 14 24 234 124 13 134 1234 123 1

slide-32
SLIDE 32

DAG-adapted norm (Zhao & Yu, 2008)

  • Graph-based structured regularization

– D(v) is the set of descendants of v ∈ V :

  • v∈V

wD(v)2 =

  • v∈V

 

t∈D(v)

wt2

2

 

1/2

  • Main property: If v is selected, so are all its ancestors

34 14 13 24 123 234 124 134 1234 3 12 1 2 4 23 3 2 1 124 14 123 1234 134 13 23 234 24 34 12 4

slide-33
SLIDE 33

DAG-adapted norm (Zhao & Yu, 2008)

  • Graph-based structured regularization

– D(v) is the set of descendants of v ∈ V :

  • v∈V

wD(v)2 =

  • v∈V

 

t∈D(v)

wt2

2

 

1/2

  • Main property: If v is selected, so are all its ancestors

34 14 13 24 123 234 124 134 1234 3 12 1 2 4 23 1 124 14 123 1234 134 13 23 234 4 24 34 12 3 2

slide-34
SLIDE 34

DAG-adapted norm (Zhao & Yu, 2008)

  • Graph-based structured regularization

– D(v) is the set of descendants of v ∈ V :

  • v∈V

wD(v)2 =

  • v∈V

 

t∈D(v)

wt2

2

 

1/2

  • Main property: If v is selected, so are all its ancestors

34 14 13 24 123 234 124 134 1234 3 12 1 2 4 23 34 24 4 234 23 13 134 1234 123 14 124 1 2 3 12

slide-35
SLIDE 35

DAG-adapted norm (Zhao & Yu, 2008)

  • Graph-based structured regularization

– D(v) is the set of descendants of v ∈ V :

  • v∈V

wD(v)2 =

  • v∈V

 

t∈D(v)

wt2

2

 

1/2

  • Main property: If v is selected, so are all its ancestors
  • Hierarchical kernel learning (Bach, 2008b) :

– polynomial-time active-set algorithm for this norm – necessary/sufficient conditions for consistent kernel selection – Scaling between p, q, n for consistency – Applications to variable selection or other kernels

slide-36
SLIDE 36

Scaling between p, n and other graph-related quantities

n = number of observations p = number of vertices in the DAG deg(V ) = maximum out degree in the DAG num(V ) = number of connected components in the DAG

  • Proposition (Bach, 2009): Assume consistency condition satisfied,

Gaussian noise and data generated from a sparse function, then the support is recovered with high-probability as soon as: log deg(V ) + log num(V ) = O(n)

slide-37
SLIDE 37

Scaling between p, n and other graph-related quantities

n = number of observations p = number of vertices in the DAG deg(V ) = maximum out degree in the DAG num(V ) = number of connected components in the DAG

  • Proposition (Bach, 2009): Assume consistency condition satisfied,

Gaussian noise and data generated from a sparse function, then the support is recovered with high-probability as soon as: log deg(V ) + log num(V ) = O(n)

  • Unstructured case: num(V ) = p ⇒ log p = O(n)
  • Power set of q elements: deg(V ) = q ⇒ log q = log log p = O(n)
slide-38
SLIDE 38

Conclusion

  • Hierarchical sparsity

– Known topologies within supervised and unsupervised learning – Unknown topologies (Shervashidze and Bach, 2015)

  • Algorithmic issues

– Large datasets – Structured sparsity and convex optimization

  • Theoretical issues

– Identifiability of structures and features – Improved predictive performance – Other approaches to sparsity and structure (e.g., submodularity)

slide-39
SLIDE 39

References

  • F. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning

Research, 9:1179–1225, 2008a.

  • F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in

Neural Information Processing Systems, 2008b.

  • F. Bach.

High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning. Technical Report 0909.0844, arXiv, 2009.

  • R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Technical

report, arXiv:0808.3572, 2008.

  • A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

  • D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research,

3:993–1022, January 2003.

  • D. Blei, T.L. Griffiths, M.I. Jordan, and J.B. Tenenbaum. Hierarchical topic models and the nested

Chinese restaurant process. Advances in neural information processing systems, 16:106, 2004.

  • V. Cevher, M. F. Duarte, C. Hegde, and R. G. Baraniuk. Sparse signal recovery using markov random
  • fields. In Advances in Neural Information Processing Systems, 2008.
  • S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Review,

43(1):129–159, 2001.

slide-40
SLIDE 40
  • M. Elad and M. Aharon.

Image denoising via sparse and redundant representations over learned

  • dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.
  • A. Gramfort and M. Kowalski. Improving M/EEG source localization with an inter-condition sparse
  • prior. In IEEE International Symposium on Biomedical Imaging, 2009.
  • J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transactions on

Information Theory, 52(9):4036–4048, 2006.

  • J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th

International Conference on Machine Learning (ICML), 2009.

  • R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.

Technical report, arXiv:0904.3523, 2009a.

  • R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical

report, arXiv:0909.1440, 2009b.

  • R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, E. Eger, F. Bach, and B. Thirion. Multi-scale

mining of fmri data with hierarchical structured sparsity. Technical report, Preprint arXiv:1105.0363,

  • 2011a. In submission to SIAM Journal on Imaging Sciences.
  • R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary
  • learning. In Submitted to ICML, 2011b.
  • K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic

filter maps. In Proceedings of CVPR, 2009.

  • S. Kim and E. P. Xing. Tree-guided group Lasso for multi-task regression with structured sparsity. In

Proceedings of the International Conference on Machine Learning (ICML), 2010.

slide-41
SLIDE 41
  • A. Lef`

evre, F. Bach, and C. F´ evotte. Itakura-Saito nonnegative matrix factorization with group

  • sparsity. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing

(ICASSP), 2011.

  • Y. Lin and H. H. Zhang. Component selection and smoothing in multivariate nonparametric regression.

Annals of Statistics, 34(5):2272–2297, 2006.

  • J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman.

Non-local sparse models for image

  • restoration. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2272–2279.

IEEE, 2009a.

  • J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. Advances

in Neural Information Processing Systems (NIPS), 21, 2009b.

  • J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In

NIPS, 2010.

  • Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center

for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.

  • B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed

by V1? Vision Research, 37:3311–3325, 1997.

  • F. Rapaport, E. Barillot, and J.-P. Vert.

Classification of arrayCGH data using fused SVM. Bioinformatics, 24(13):i375–i382, Jul 2008.

  • P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. SpAM: Sparse additive models. In Advances in

Neural Information Processing Systems (NIPS), 2008.

  • N. Shervashidze and F. Bach. Learning the structure for structured sparsity. IEEE Transactions on
slide-42
SLIDE 42

Signal Processing, 63(18):4894–4902, 2015.

  • R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of The Royal Statistical Society

Series B, 58(1):267–288, 1996.

  • P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute
  • penalties. Annals of Statistics, 37(6A):3468–3497, 2009.