Structured sparse methods for matrix factorization Francis Bach - PowerPoint PPT Presentation

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole Normale Sup´ erieure March 2011 Joint work with R. Jenatton, J. Mairal, G. Obozinski

Structured sparse methods for matrix factorization Outline • Learning problems on matrices • Sparse methods for matrices – Sparse principal component analysis – Dictionary learning • Structured sparse PCA – Sparsity-inducing norms and overlapping groups – Structure on dictionary elements – Structure on decomposition coefficients

Learning on matrices - Collaborative filtering • Given n X “movies” x ∈ X and n Y “customers” y ∈ Y , • Predict the “rating” z ( x , y ) ∈ Z of customer y for movie x • Training data: large n X × n Y incomplete matrix Z that describes the known ratings of some customers for some movies • Goal : complete the matrix. 1 1 2 2 1 3 2 3 3 3 1 1 2 1 1 3 1 1 3 1 2 2 3

Learning on matrices - Image denoising • Simultaneously denoise all patches of a given image • Example from Mairal, Bach, Ponce, Sapiro, and Zisserman (2009c)

Learning on matrices - Source separation • Single microphone (Benaroya et al., 2006; F´ evotte et al., 2009)

Learning on matrices - Multi-task learning • k linear prediction tasks on same covariates x ∈ R p – k weight vectors w j ∈ R p – Joint matrix of predictors W = ( w 1 , . . . , w k ) ∈ R p × k • Classical applications – Transfer learning – Multi-category classification (one task per class) (Amit et al., 2007) • Share parameters between tasks – Joint variable or feature selection (Obozinski et al., 2009; Pontil et al., 2007)

Learning on matrices - Dimension reduction n ) ∈ R n × p • Given data matrix X = ( x ⊤ 1 , . . . , x ⊤ – Principal component analysis : x i ≈ D α i – K-means : x i ≈ d k ⇒ X = DA

Sparsity in machine learning • Assumption : y = w ⊤ x + ε , with w ∈ R p sparse – Proxy for interpretability – Allow high-dimensional inference: log p = O ( n ) • Sparsity and convexity ( ℓ 1 -norm regularization): w ∈ R p L ( w ) + � w � 1 min w w 2 2 w w 1 1

Two types of sparsity for matrices M ∈ R n × p I - Directly on the elements of M • Many zero elements: M ij = 0 M • Many zero rows (or columns): ( M i 1 , . . . , M ip ) = 0 M

Two types of sparsity for matrices M ∈ R n × p II - Through a factorization of M = UV ⊤ • Matrix M = UV ⊤ , U ∈ R n × k and V ∈ R p × k • Low rank : m small V T U M = • Sparse decomposition : U sparse V T M U =

Structured (sparse) matrix factorizations • Matrix M = UV ⊤ , U ∈ R n × k and V ∈ R p × k • Structure on U and/or V – Low-rank: U and V have few columns – Dictionary learning / sparse PCA: U has many zeros – Clustering ( k -means): U ∈ { 0 , 1 } n × m , U1 = 1 – Pointwise positivity: non negative matrix factorization (NMF) – Specific patterns of zeros – Low-rank + sparse (Cand` es et al., 2009) – etc. • Many applications • Many open questions : algorithms, identifiability, evaluation

Sparse principal component analysis n ) ∈ R p × n , two views of PCA: • Given data X = ( x ⊤ 1 , . . . , x ⊤ – Analysis view : find the projection d ∈ R p of maximum variance (with deflation to obtain more components) – Synthesis view : find the basis d 1 , . . . , d k such that all x i have low reconstruction error when decomposed on this basis • For regular PCA, the two views are equivalent

Sparse principal component analysis n ) ∈ R p × n , two views of PCA: • Given data X = ( x ⊤ 1 , . . . , x ⊤ – Analysis view : find the projection d ∈ R p of maximum variance (with deflation to obtain more components) – Synthesis view : find the basis d 1 , . . . , d k such that all x i have low reconstruction error when decomposed on this basis • For regular PCA, the two views are equivalent • Sparse (and/or non-negative) extensions – Interpretability – High-dimensional inference – Two views are differents – For analysis view, see d’Aspremont, Bach, and El Ghaoui (2008)

Sparse principal component analysis Synthesis view • Find d 1 , . . . , d k ∈ R p sparse so that n � k � n 2 � � � � � � � � 2 min � x i − ( α i ) j d j = min � x i − D α i 2 is small � � α i ∈ R m � α i ∈ R m 2 i =1 j =1 i =1 – Look for A = ( α 1 , . . . , α n ) ∈ R k × n and D = ( d 1 , . . . , d k ) ∈ R p × k such that D is sparse and � X − DA � 2 F is small

Sparse principal component analysis Synthesis view • Find d 1 , . . . , d k ∈ R p sparse so that n � k � n 2 � � � � � � � � 2 min � x i − ( α i ) j d j = min � x i − D α i 2 is small � � α i ∈ R m � α i ∈ R m 2 i =1 j =1 i =1 – Look for A = ( α 1 , . . . , α n ) ∈ R k × n and D = ( d 1 , . . . , d k ) ∈ R p × k such that D is sparse and � X − DA � 2 F is small • Sparse formulation (Witten et al., 2009; Bach et al., 2008) – Penalize/constrain d j by the ℓ 1 -norm for sparsity – Penalize/constrain α i by the ℓ 2 -norm to avoid trivial solutions n k � � � x i − D α i � 2 min 2 + λ � d j � 1 s.t. ∀ i, � α i � 2 � 1 D , A i =1 j =1

Sparse PCA vs. dictionary learning • Sparse PCA : x i ≈ D α i , D sparse

Sparse PCA vs. dictionary learning • Sparse PCA : x i ≈ D α i , D sparse • Dictionary learning : x i ≈ D α i , α i sparse

Structured matrix factorizations (Bach et al., 2008) n k � � � x i − D α i � 2 min 2 + λ � d j � ⋆ s.t. ∀ i, � α i � • � 1 D , A i =1 j =1 n n � � � x i − D α i � 2 min 2 + λ � α i � • s.t. ∀ j, � d j � ⋆ � 1 D , A i =1 i =1 • Optimization by alternating minimization (non-convex) • α i decomposition coefficients (or “code”), d j dictionary elements • Two related/equivalent problems: – Sparse PCA = sparse dictionary ( ℓ 1 -norm on d j ) – Dictionary learning = sparse decompositions ( ℓ 1 -norm on α i ) (Olshausen and Field, 1997; Elad and Aharon, 2006; Lee et al., 2007)

Dictionary learning for image denoising = + x y ε �� measurements noise original image

Dictionary learning for image denoising • Solving the denoising problem (Elad and Aharon, 2006) – Extract all overlapping 8 × 8 patches x i ∈ R 64 n ) ∈ R n × 64 – Form the matrix X = ( x ⊤ 1 , . . . , x ⊤ – Solve a matrix factorization problem: n � D , A || X − DA || 2 || x i − D α i || 2 min F = min 2 D , A i =1 where A is sparse , and D is the dictionary – Each patch is decomposed into x i = D α i – Average the reconstruction D α i of each patch x i to reconstruct a full-sized image • The number of patches n is large (= number of pixels)

Online optimization for dictionary learning n � � � || x i − D α i || 2 min 2 + λ || α i || 1 A ∈ R k × n , D ∈D i =1 = { D ∈ R p × k s.t. ∀ j = 1 , . . . , k, △ D || d j || 2 � 1 } . • Classical optimization alternates between D and A . • Good results, but very slow !

Online optimization for dictionary learning n � � � || x i − D α i || 2 min min 2 + λ || α i || 1 D ∈D α i i =1 = { D ∈ R p × k s.t. ∀ j = 1 , . . . , k, △ D || d j || 2 � 1 } . • Classical optimization alternates between D and A . • Good results, but very slow ! • Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a) can – handle potentially infinite datasets – adapt to dynamic training sets – online code ( http://www.di.ens.fr/willow/SPAMS/ )

Denoising result (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009c)

What does the dictionary D look like?

Inpainting a 12-Mpixel photograph

Alternative usages of dictionary learning Computer vision • Use the “code” α as representation of observations for subsequent processing (Raina et al., 2007; Yang et al., 2009) • Adapt dictionary elements to specific tasks (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b) – Discriminative training for weakly supervised pixel classification (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2008)

Structured sparse methods for matrix factorization Outline • Learning problems on matrices • Sparse methods for matrices – Sparse principal component analysis – Dictionary learning • Structured sparse PCA – Sparsity-inducing norms and overlapping groups – Structure on dictionary elements – Structure on decomposition coefficients

Sparsity-inducing norms data fitting term � �� min f ( α ) + λ ψ ( α ) α ∈ R p � �� sparsity-inducing norm • Regularizing by a sparsity-inducing norm ψ • Most popular choice for ψ – ℓ 1 -norm: � α � 1 = � p j =1 | α j | – Lasso (Tibshirani, 1996), basis pursuit (Chen et al., 2001) – ℓ 1 -norm only encodes cardinality • Structured sparsity – Certain patterns are favored – Improvement of interpretability and prediction performance

Structured sparse methods for matrix factorization Francis Bach - PowerPoint PPT Presentation

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole Normale Sup erieure March 2011 Joint work with R. Jenatton, J. Mairal, G. Obozinski Structured sparse methods for matrix factorization Outline

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith George

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &

Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin Diday Universit e Paris

Benefits and Pitfalls Utility Extensions of the Exponential Mechanism References with

Principal Components Analysis Claire Le Barbenchon and Federico Ferrari Data Expeditions Welcome

Principal Component Analysis 4/7/17 PCA: the setting Unsupervised learning Unlabeled data

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science &

Online Principal Component Analysis Edo Liberty . . . . . . . . . . . . . . . . .

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace Machine

Principal Component Analysis: Why do we use fourier transformation to analyze flow? Ziming Liu

Structured sparse methods for matrix factorization Francis Bach - PowerPoint PPT Presentation

Structured sparse methods for matrix factorization Francis Bach Sierra team, INRIA - Ecole Normale Sup erieure March 2011 Joint work with R. Jenatton, J. Mairal, G. Obozinski Structured sparse methods for matrix factorization Outline

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for Pascal GPU

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Exploiting Matrix Reuse and Data Locality in Sparse Matrix-Vector and Matrix-Transpose-Vector

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

Sparse Matrix Partitioning, Reordering and Vector Multiplication Albert-Jan Yzelman, Utrecht

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization Shaden Smith George

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &amp;

Symbolic PCA of compositional data. Sun Makosso Kallyth &amp; Edwin Diday Universit e Paris

Benefits and Pitfalls Utility Extensions of the Exponential Mechanism References with

Principal Components Analysis Claire Le Barbenchon and Federico Ferrari Data Expeditions Welcome

Principal Component Analysis 4/7/17 PCA: the setting Unsupervised learning Unlabeled data

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science &amp;

Online Principal Component Analysis Edo Liberty . . . . . . . . . . . . . . . . .

Machine Learning 2 DS 4420 - Spring 2020 Dimensionality reduction I Byron C Wallace Machine

Principal Component Analysis: Why do we use fourier transformation to analyze flow? Ziming Liu

Sparse Tensor Factorization: Algorithms, Data Structures, and Challenges Shaden Smith &

Symbolic PCA of compositional data. Sun Makosso Kallyth & Edwin Diday Universit e Paris

Principal Component Analysis and Autoencoders Shuiwang Ji Department of Computer Science &