 
              Methods for Robust High Dimensional Graphical Model Selection Bala Rajaratnam Department of Statistics Stanford University Fields Institute Joint work with S. Oh (UC Berkeley) and K. Khare (U. Florida)
Motivation • Availability of high-throughput data from various applications • Need for methodology/tools for analyzing high-dimensional data • Examples: ◮ Biology: gene expression data ◮ Environmental science: climate data on spatial grid ◮ Finance: returns on thousands of stocks ◮ Retail: consumer behavior • Common goals: ◮ Understand complex relationships & multivariate dependencies ◮ Formulate correct models & develop inferential procedures 1 / 44
Modeling relationships • Correlation: basic measure of linear pairwise relationships • Covariance matrix Σ : collection of relationships • Estimates of Σ required in procedures such as PCA, CCA, MANOVA, etc. • Estimating (functions of) Σ and Ω = Σ − 1 are of statistical interest • Estimating Σ is difficult in high dimensions 2 / 44
Sparse estimates • Matrix Σ or Ω of size p -by- p has O ( p 2 ) elements • Estimating O ( p 2 ) parameters with classical estimators is not viable, especially when n ≪ p • Reliably estimate small number of parameters in Σ • Model selection: zero/non-zero structure recovery • Gives rise to sparse estimates of Σ or Ω • Sparsity pattern can be represented by graphs/networks 3 / 44
Gaussian Graphical Models (GGM) • Assume Y = ( Y 1 , . . . , Y p ) ′ has distribution N p ( 0, Σ ) • Denote V = { 1, 2, . . . , p } • Covariance matrix cov ( Y ) = Σ encodes marginal dependencies Y i ⊥ ⊥ Y j ⇐ ⇒ cov ( Y i , Y j ) = [ Σ ] ij = 0 • Inverse covariance matrix Ω = Σ − 1 encodes conditional dependencies given the rest � � Y i ⊥ ⊥ Y j | Y V \{ i , j } ⇐ ⇒ [ Ω ] ij = 0 � �� � � �� � matrix element conditional independence • Also known as Markov Random Fields (MRF) 4 / 44
Gaussian Graphical Models (GGM) • Graph summarizes relationships with nodes V = { 1, . . . , p } and set E of edges [ Ω ] ij = 0 ⇐ ⇒ i � ∼ j ���� � �� � matrix element network/graph • Build a graph from sparse Ω A B C A � � 1 0.2 0.3 A Ω = 0.2 2 0 B 0.3 0 1.2 C B C 5 / 44
GGM estimation algorithms • Regularized Gaussian likelihood methods ◮ Block coordinate descent (COVSEL) [Banerjee et al., 2008] ◮ Graphical lasso (GLASSO) [Friedman, Hastie, & Tibshirani, 2008] ◮ Large-scale GLASSO [Mazumder & Hastie, 2012] ◮ QUIC [Hsieh et al., 2011] ◮ G-ISTA [Guillot, Rajaratnam et al., 2012] ◮ Graphical Dual Proximal Gradient Methods [Dalal & Rajaratnam, 2013] ◮ Others • Bayesian methods ◮ Dawid & Lauritzen, 1993, Annals of Statistics ◮ Letac & Massam, 2007, Annals of Statistics ◮ Rajaratnam, Massam & Carvalho, 2008, Annals of Statistics ◮ Khare & Rajaratnam, 2011, Annals of Statistics ◮ Others • Testing-based methods ◮ Hero & Rajaratnam, 2011, JASA ◮ Hero & Rajaratnam, 2012, IEEE, Information Theory 6 / 44
Regularized Gaussian likelihood graphical model selection • All ℓ 1 -regularized Gaussian-likelihood methods solve ˆ Ω = arg max Ω ≻ 0 { log det ( Ω ) − tr ( Ω S ) − λ � Ω � 1 } • S : sample covariance matrix • Graphical Lasso [Friedman, Hastie, & Tibshirani, 2008] • Ω can be computed by solving optimization problem • Adding ℓ 1 -regularization term λ � Ω � 1 introduces sparsity • Penalty parameter λ controls level of sparsity • Dependency on Gaussianity ◮ Parametric model ◮ Sensitivity to outliers ◮ Log-concave function 7 / 44
Regularized pseudo-likelihood graphical model selection • Two main main approaches: 1. ℓ 1 -regularized likelihood methods 2. ℓ 1 -regularized regression-based/pseudo-likelihood methods • Series of linear regressions form a pseudo-likelihood function • Objective function is the ℓ 1 -penalized pseudo-likelihood • Pseudo-likelihood assumes less about distribution of the data • Applicable to wider range of data 8 / 44
Partial covariance and correlation • Matrix Y ∈ R n × p denotes iid observations of random variable with mean zero, covariance Σ = Ω − 1 . • Goal : estimate partial correlation graph • Partial correlation in terms of Ω = [ ω ij ] : ω ij ρ ij = − √ ω ii ω jj • Called “partial” because correlation of residuals r k , where � � β ( k ) = arg r k = Y k − Y ˆ β ( k ) , where ˆ � Y k − Y β � 2 min . 2 β : β k = 0 Now, partial correlation is ρ ij = cor ( r i , r j ) � ω jj • It can be shown that ρ ij ω ii � �� � β ij • Zero/non-zero pattern of [ ρ ij ] is identical to that of Ω • Partial correlation graph is given by sparsity pattern of Ω 9 / 44
Regularized regression-based graphical model selection • Neighborhood selection (NS) [Meinshausen and B¨ uhlmann, 2006] � � ω ( i ) = arg min � Y i − Y β � 2 ˆ 2 + λ � β � 1 β : β i = 0 • Neighborhood of i is defined as ne ( i ) = { k : ˆ ω ( i ) � � = 0 } k • MB does not take into account symmetry of Ω ne ( i ) ne ( j ) j ∈ � ⇒ i ∈ � � = • Current state-of-the-art methods address this issue ◮ SPACE [Peng et al., 2009] ◮ SYMLASSO [Friedman, Hastie, & Tibshirani, 2010] ◮ SPLICE [Rocha et al., 2008] 10 / 44
Sparse PArtial Correlation Estimation [Peng et al., 2009] SPACE objective function : ( w i = 1 or w i = ω ii ) � ω jj p p Q spc ( Ω ) =: − 1 n log ω ii + 1 � � � � ρ ij Y j � 2 | ρ ij | w i � Y i − 2 + λ 2 2 ω ii i = 1 i = 1 j � = i 1 � i < j � p � �� � β ij 1. Update [ ρ ij ] coordinate-wise (using current estimates [ ˆ ω ii ] ): �   p   1 � � ω jj ˆ � [ ρ ij ] ← min ρ ij Y j � 2 | ρ ij | w i � Y i − 2 + λ 2 ˆ ω ii [ ρ ij ]   i = 1 j � = i 1 � i < j � p ρ ij ] and [ ˆ 2. Update [ ω ii ] (using current estimates [ ˆ ω ii ] ):   � − 1 ˆ � ω jj  � Y i − ρ ij Y j � 2  ω ii ← ˆ 2 ω ii ˆ j � = i 11 / 44
Non-converging example: p = 3 case 1e+01 1e+03 abs. difference between successive updates 1e−01 1e+00 1e−03 1e−03 Convergence Threshold Convergence Threshold 0 1000 2000 3000 4000 0 1000 2000 3000 4000 iterations iterations estimator pvar.1 pvar.2 pvar.3 pcor.1 pcor.2 pcor.3 estimator pvar.1 pvar.2 pvar.3 pcor.1 pcor.2 pcor.3 (a) SPACE ( w i = ω ii ) (b) SPACE ( w i = 1) Figure: Y ( i ) ∼ N 3 ( 0, Ω − 1 ) , (left) n = 4, (right) n = 100 12 / 44
Non-convergence of SPACE • Investigate the nature and extent of convergence issues: 1. Are such examples pathological? How widespread are they? 2. When do they occur ? • Consider a sparse 100 × 100 matrix Ω with edge density 4% and condition number of 100. • Generate 100 multivariate Gaussian datasets (with n = 100), µ = 0 and Σ = Ω − 1 . • Record the number of times (out of 100) for which SPACE1 (uniform weights) and SPACE2 (partial variance weights) do not converge within 1500 iterations. • Original implementation of SPACE by [Peng et al., 2009] claims only 3 iterations are sufficient. 13 / 44
Non-convergence of SPACE SPACE1 ( w i = 1) SPACE2 ( w i = ω ii ) λ ∗ λ ∗ NZ NC NZ NC 0.026 60.9% 92 0.085 79.8% 100 0.099 19.7% 100 0.160 28.3% 0 0.163 7.6% 100 0.220 10.7% 0 0.228 2.9% 100 0.280 4.8% 0 0.614 0.4% 0 0.730 0.5% 97 Table: Number of simulations (out of 100) that do not converge within 1500 iterations (NC) for select values of penalty parameter ( λ ∗ = λ/ n ). Average percentage of non-zeros (NZ) in ˆ Ω are also shown. • SPACE exhibits extensive non-convergence behavior • Problem exacerbated when condition number is high • Typical of high dimensional sample starved settings 14 / 44
Symmetric Lasso and SPLICE SYMLASSO [Friedman, Hastie, & Tibshirani, 2010]:   p Ω ) = 1  n log α ii + 1 � � � Q sym ( α , ˘ ω ij α ii Y j � 2  + λ � Y i + | ω ij | , 2 α ii i = 1 j � = i 1 � i < j � p where α ii = 1 /ω ii . SPLICE [Rocha et al., 2008]: p p Q spl ( B , D ) = n � ii ) + 1 � 1 � � β ij Y j � 2 + λ log ( d 2 � Y i − | β ij | , d 2 2 2 ii i = 1 i = 1 j � = i i < j where d 2 ii = ω ii . Also, alternating (off-diagonal vs diagonal) iterative algorithms No convergence guarantees 15 / 44
Regularized regression-based graphical model selection • Advantage : Regression-based methods perform better model selection than likelihood-based methods in finite sample • Advantage : Regression-based methods are less restrictive than Gaussian likelihood-based methods • Disadvantage : ˆ Ω may not be positive definite (can be fixed) • Disadvantage : Solution may not be computable • Cause : Iterative algorithms SPACE, SYMLASSO and SPLICE are not guaranteed to converge 16 / 44
Regression-based methods: summary METHOD SYMLASSO SPLICE SPACE NS Property Symmetry + + + Convergence guarantee N/A Asymptotic consistency ( n , p → ∞ ) + + How can we obtain all of the good properties simultaneously? 17 / 44
Recommend
More recommend