A D.C. Programming Approach to the Sparse Generalized Eigenvalue - PowerPoint PPT Presentation

A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem Bharath K. Sriperumbudur, David A. Torres and Gert R. G. Lanckriet University of California, San Diego OPT 2009

Generalized Eigenvalue Problem Given a matrix pair, ( A , B ), find a pair ( λ, x ) such that Ax = λ Bx , where A , B ∈ C n × n , C n ∋ x � = 0 and λ ∈ C . Variational formulation: x T Ax λ max ( A , B ) = max x x T Bx = 1 , s.t. (1) where x ∈ R n , A ∈ S n and B ∈ S n ++ . ◮ Popular in multivariate statistics and machine learning. ◮ Classification : Fisher discriminant analysis ◮ Dimensionality reduction : Principal component analysis, Canonical correlation analysis ◮ Clustering : Spectral clustering

Applications ◮ Fisher Discriminant Analysis (FDA) ◮ A = ( µ 1 − µ 2 )( µ 1 − µ 2 ) T is the between-cluster variance . ◮ B = Σ 1 + Σ 2 is the within-cluster variance . ◮ Principal Component Analysis (PCA) ◮ A = Σ is the covariance matrix . ◮ B is the identity matrix. ◮ Canonical Correlation Analysis (CCA) � � 0 S xy ◮ A = . S yx 0 � S xx � 0 ◮ B = , where S .. represents the cross-covariance 0 S yy matrix.

Why Sparsity? ◮ Usually, the solutions of FDA, PCA and CCA are not sparse. ◮ This often makes it difficult to interpret the results. ◮ PCA/CCA: For better interpretability, few relevant features are required that explain as much variance as possible. ◮ Applications: bio-informatics, finance, document translation etc. ◮ FDA: feature selection aids generalization performance by promoting sparse solutions. ◮ Sparse representation ⇒ better interpretation, better generalization and reduced computational costs.

Sparse Generalized Eigenvalue Problem ◮ The variational formulation for the sparse generalized eigenvalue problem is given by x T Ax max x x T Bx = 1 s.t. � x � 0 ≤ k , (2) where 1 ≤ k ≤ n and � x � 0 := � n i =1 1 {| x i |� =0 } is the cardinality of x . ◮ (2) is non-convex, NP-hard and therefore intractable. ◮ Usually, the ℓ 1 -norm approximation is used for the cardinality constraint, i.e., replace � x � 0 ≤ k by � x � 1 ≤ k. ◮ The problem is still computationally hard .

Sparse Generalized Eigenvalue Problem ◮ (2) can be written as x T Ax − ˜ max ρ || x || 0 x x T Bx ≤ 1 , s.t. (3) where ˜ ρ ≥ 0. log(1+ | x i | ε − 1 ) ◮ Approximate || x || 0 by � x � ε := � n for sufficiently i =1 log(1+ ε − 1 ) small ε > 0 as n log(1 + | x i | ε − 1 ) � � x � 0 = lim (4) log(1 + ε − 1 ) . ε → 0 i =1 ◮ The approximation, � x � ε can be interpreted as defining a limiting Student’s t-distribution prior over x (leading to an improper prior) given by n 1 � p ( x ) ∝ | x i | + ε i =1 and computing its negative log-likelihood.

Approximation to || x || 0 3 ||x|| 0 ||x|| 1 2.5 ε =1 log(1+|x| ε −1 )/log(1+ ε −1 ) ε =1e−2 2 ε =1e−5 ε =1e−10 1.5 1 0.5 0 −3 −2 −1 0 1 2 3 x Approximation to � x � 0 As ε → 0, � x � ε → � x � 0 and as ε → ∞ , � x � ε → � x � 1 .

Sparse Generalized Eigenvalue Problem ◮ (3) reduces to the approximate program, n � x T Ax − ρ ε max log( | x i | + ε ) x i =1 x T Bx ≤ 1 , s.t. (5) ˜ ρ where ρ ε := log(1+ ε − 1 ) . ◮ The task reduces to solving the approximate program in (5) with a small value of ε . ◮ (5) can be written as n � � τ || x || 2 − � x T ( A + τ I ) x − ρ min log( | x i | + ε ) x i =1 x T Bx ≤ 1 , s.t. (6) where τ ≥ max(0 , − λ min ( A )). ◮ The objective in (6) is a difference of two convex functions.

Majorization-Minimization (MM) ◮ Suppose we want to minimize f over Ω ⊂ R n . Construct a majorization function, g over Ω × Ω such that f ( x ) ≤ g ( x , y ) , ∀ x , y ∈ Ω and f ( x ) = g ( x , x ) , ∀ x ∈ Ω . ◮ The majorization algorithm corresponding to g updates x at iteration l by x ( l +1) ∈ arg min x ∈ Ω g ( x , x ( l ) ) , (7) unless we already have x ( l ) ∈ arg min x ∈ Ω g ( x , x ( l ) ) , in which case the algorithm stops. ◮ f ( x ( l +1) ) ≤ g ( x ( l +1) , x ( l ) ) ≤ g ( x ( l ) , x ( l ) ) = f ( x ( l ) ) . ◮ MM algorithms can be thought of as a generalization of the EM algorithm.

Sparse Generalized Eigenvalue Algorithm Proposition The following function n � τ � x � 2 2 − 2 x T ( A + τ I n ) y + y T ( A + τ I n ) y + ρ ε g ( x , y ) = log( ε + | y i | ) i =1 n | x i | − | y i | � + ρ ε (8) | y i | + ε , i =1 majorizes the objective function in (6). By following the minimization step in (7) with g as in (8), the sparse GEV algorithm is obtained as n | x i | x ( l +1) = arg min 2 − 2 x T ( A + τ I n ) x ( l ) + ρ ε � τ � x � 2 | x ( l ) x i | + ε i =1 x T Bx ≤ 1 , s.t. (9) which is a sequence of quadratically constrained quadratic programs (QCQPs).

Sparse Generalized Eigenvalue Program ◮ (9) can also be written as 2 + ρ x ( l +1) = arg min τ � W ( l ) x � 1 � x − ( τ − 1 A + I n ) x ( l ) � 2 x x T Bx ≤ 1 , s.t. (10) | + ε , w ( l ) := ( w ( l ) where w ( l ) 1 , . . . , w ( l ) 1 := n ) and i | x ( l ) i W ( l ) := diag( w ( l ) ). ◮ (10) is very similar to LASSO [Tibshirani, 1996] except for the weighted ℓ 1 -norm penalty and the quadratic constraint. ◮ When A � 0, B = I n and τ = 0, (9) reduces to a very simple iterative rule: �� 2 w ( l ) � − ρ ε � ( Ax ( l ) ) i � + sign(( Ax ( l ) ) i ) i x ( l +1) = , ∀ i , (11) i �� n � 2 �� 2 w ( l ) � − ρ ε � � ( Ax ( l ) ) i i =1 i + where [ a ] + := max(0 , a ), which we call as DC-PCA.

Convergence Analysis Theorem Let { x ( l ) } ∞ l =0 be any sequence generated by the sparse GEV algorithm in (9). Then, all the limit points of { x ( l ) } ∞ l =0 are stationary points of the program in (5), n n i | ) − [ x ( l ) ] T Ax ( l ) → ρ ε i | ) − [ x ∗ ] T Ax ∗ := L ∗ , log( ε + | x ( l ) � � log( ε + | x ∗ ρ ε i =1 i =1 for some stationary point x ∗ , � x ( l +1) − x ( l ) � → 0 , and either { x ( l ) } ∞ l =0 converges or the set of limit points of { x ( l ) } ∞ l =0 is a connected and compact subset of S ( L ∗ ) , where n � S ( a ) := { x ∈ S : x T Ax − ρ ε log( ε + | x i | ) = − a } i =1 and S is the set of stationary points of (5). If S ( L ∗ ) is finite, then any l =0 generated by (9) converges to some x ∗ in S ( L ∗ ) . sequence { x ( l ) } ∞

Convergence Analysis Corollary ρ = 0 and λ max ( A , B ) > 0 . Then, any sequence { x ( l ) } ∞ Let ˜ l =0 generated by (9) converges to some x ∗ such that λ max ( A , B ) = [ x ∗ ] T Ax ∗ and [ x ∗ ] T Bx ∗ = 1 . ◮ Local and global solutions are the same for ρ = 0. Corollary ρ = 0 . Then, any sequence { x ( l ) } ∞ Let A � 0 , τ = 0 and ˜ l =0 generated by the following algorithm B − 1 Ax ( l ) x ( l +1) = (12) � [ x ( l ) ] T AB − 1 Ax ( l ) converges to some x ∗ such that λ max ( A , B ) = [ x ∗ ] T Ax ∗ and [ x ∗ ] T Bx ∗ = 1 . ◮ With B = I n , (12) reduces to the power method for computing λ max ( A ).

Applications: Sparse PCA ◮ Sparse PCA algorithms: Proposed (DC-PCA), SDP relaxation (DSPCA [d’Aspremont et al., 2005]), greedy approach (GSPCA [Moghaddam et al., 2007]), regression based approach (SPCA [Zou et al., 2006]) and generalized power method (GPower ℓ 0 [Journ´ ee et al., 2008]). ◮ Pit props data [Jeffers, 1967] ◮ A benchmark data to test sparse PCA algorithms. ◮ 180 observations and 13 measured variables. ◮ 6 principal directions are considered as they capture 87% of the total variance. Algorithm Sparsity pattern Cumulative cardinality Cumulative variance SPCA (7,4,4,1,1,1) 18 75.8% DSPCA (6,2,3,1,1,1) 14 75.5% GSPCA (6,2,2,1,1,1) 13 77.1% GPower ℓ 0 (6,2,2,1,1,1) 13 77.1% DC-PCA (6,2,2,1,1,1) 13 77.1%

Pit Props 80 18 16 Cumulative Variance (%) 70 Cumulative cardinality 14 60 12 50 10 SPCA SPCA DSPCA DSPCA 40 8 GSPCA GSPCA GPower 0 GPower 0 6 30 DC−PCA DC−PCA 4 1 2 3 4 5 6 1 2 3 4 5 6 Number of principal components Number of principal components (a) (b) 1 1 Proportion of explained variance Proportion of explained variance 10 0.8 Cardinality 0.6 SPCA 0.5 5 DSPCA 0.4 GSPCA GPower 0 DC−PCA 0.2 2 4 6 8 10 12 0 0 9 9 18 18 27 27 36 36 45 45 ρ ~ Cardinality (c) (d) Figure: (a) cumulative variance and (b) cumulative cardinality for the first 6 sparse PCs; (c) proportion of explained variance (PEV) vs. cardinality for the first sparse PC; (d) dependence of sparsity and PEV on ˜ ρ for the first sparse PC computed with DC-PCA.

Gene Datasets Table: Gene expression datasets Dataset Samples ( p ) Genes ( n ) Reference Colon cancer 62 2000 [Alon et al., 1999] Leukemia 38 7129 [Golub et al., 1999] Ramaswamy 127 16063 [Ramaswamy et al., 2001] Table: Computation time (in seconds) to obtain the first sparse PC, averaged over cardinalities ranging from 1 to n , for the Colon cancer, Leukemia and Ramaswamy datasets. Colon cancer Leukemia Ramaswamy n 2000 7129 16063 SPCA 2.057 3.548 38.731 GPower ℓ 0 0.182 0.223 2.337 DC-PCA 0.034 0.156 0.547

A D.C. Programming Approach to the Sparse Generalized Eigenvalue - PowerPoint PPT Presentation

A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem Bharath K. Sriperumbudur, David A. Torres and Gert R. G. Lanckriet University of California, San Diego OPT 2009 Generalized Eigenvalue Problem Given a matrix pair, ( A , B

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Generalized MPLS Signaling draft-ietf-mpls-generalized-signaling-05.txt

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Generalized Geometric Programming for Circuit Design Stephen Boyd Seung Jean Kim 4/4/05 ISPD

Taming the Beast: Topic imaging Predictive approach Sparse Machine Learning for Large Text

Semifinite Generalized Quadrangles G. Eric Moorhouse Department of Mathematics University of

Generalized Weyl algebras and their global dimension V. V. Bavula 1 Generalized Weyl algebras

Generalized Contagion Generalized Model of Contagion Principles of Complex Systems References

Generalized Additive Models September 10, 2019 Generalized Additive Models September 10, 2019 1

Feedback Control and Visual Servoing Lecture 11 What will you take home today? Introduction to

Statistics for Applications Chapter 7: Regression 1/43 Heuristics of the linear regression (1)

t P P

Exploring Author Gender in Book Rating and Recommendation Michael D. Ekstrand People and

Preserved and unpreserved extreme points A. J. Guirao 1 , V. Montesinos 1 , V. Zizler 2 1 Instituto

CS 147: Computer Systems Performance Analysis Advanced Regression Techniques 1 / 31 Overview

A SHOCK SENSOR BASED SECOND ORDER BLENDED (BX) UPWIND RESIDUAL DISTRIBUTION SCHEME FOR STEADY

Neutrino Factory Decay Rings Studies in the framework of IDS-NF David Kelliher*, Jaroslaw