High-dimensional covariance estimation based on Gaussian graphical - PowerPoint PPT Presentation

High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26, 2011 Joint work with Philipp R¨ utimann, Min Xu, and Peter B¨ uhlmann

Problem definition Want to estimate the covariance matrix for Gaussian Distributions: e.g., stock prices Take a random sample of vectors X ( 1 ) , . . . , X ( n ) i.i.d. ∼ N p ( 0 , Σ 0 ) , where p is understood to depend on n Let Θ 0 := Σ − 1 denote the concentration matrix 0 Sparsity: certain elements of Θ 0 are assumed to be zero Task: Use the sample to obtain a set of zeros, and then an estimator for Θ 0 ( Σ 0 ) upon a given pattern of zeros Show consistency in predictive risk and in estimating Θ 0 and Σ 0 when n , p → ∞

Gaussian graphical model: representation Let X be a p -dimensional Gaussian random vector X ∼ ( X 1 , . . . , X p ) ∼ N ( 0 , Σ 0 ) , Σ 0 = Θ − 1 where 0 In Gaussian graphical model G ( V , E 0 ) , where | V | = p : A pair ( i , j ) is NOT contained in E 0 ( θ 0 , ij = 0) iff X i ⊥ X j | { X k ; k ∈ V \ { i , j }} Define Predictive Risk with Σ ≻ 0 as R (Σ) = tr (Σ − 1 Σ 0 ) + log | Σ | ∝ − 2 E 0 ( log f Σ ( X )) where the Gaussian Log-likelihood function using Σ ≻ 0 is − p 2 log 2 π − 1 2 log | Σ | − 1 2 X T Σ − 1 X log f Σ ( X ) =

Penalized maximum likelihood estimators To estimate a sparse model (i.e., | Θ 0 | 0 is small), recent work has considered ℓ 1 -penalized maximum likelihood estimators: let | Θ | 1 = � vec Θ � 1 = � � j | θ ij | , i � � tr (Θ � S n ) − log | Θ | + λ n | Θ | 1 � Θ n = arg min , where Θ ≻ 0 S n = n − 1 � n � r = 1 X ( r ) ( X ( r ) ) T is the sample covariance The graph � G n is determined by the non-zeros of � Θ n References: Yuan-Lin 07, d’Aspremont-Banerjee-El Ghaoui 08, Friedman-Hastie-Tibshirani 08, Rothman et al 08, Z-Lafferty-Wasserman 08, and Ravikumar et. al. 08

Predictive risks Fix a point of interest with f 0 = N ( 0 , Σ 0 ) For a given L n , consider a constrained set of positive � � Σ − 1 � 1 ≤ L n } � definite matrices: Γ n = { Σ : Σ ≻ 0 , Define the oracle estimator as Σ ∗ = arg min Σ ∈ Γ n R (Σ) Recall R (Σ) = tr (Σ − 1 Σ 0 ) + log | Σ | Σ n as the minimizer of � R n (Σ) subject to Σ ∈ Γ n , Define � � � � tr Σ − 1 � S n + log | Σ | Σ n = arg min � �� Σ ∈ Γ n R n (Σ) � R n (Σ) is the negative Gaussian log-likelihood function and � S n is the sample covariance �

Risk consistency Persistence Theorem: Let p < n ξ , for some ξ > 0. Given � � � Σ − 1 � n � 1 ≤ L n } , where L n = o ( log n ) , ∀ n . Γ n = { Σ : Σ ≻ 0 , n ) P Then R ( � Σ n ) − R (Σ ∗ → 0 , 1 n 2 L n = logn where R (Σ) = tr (Σ − 1 Σ 0 ) + log | Σ | n = arg min Σ ∈ Γ n R (Σ) and Σ ∗ n=200 400 800 o + o o + o + o o + Persistency answers the asymptotic question: How large may the set Γ n be, so that it is still possible to select empirically a predictor whose risk is close to that of the best predictor in the set (see Greenshtein-Ritov 04 )

Non-edges act as the constraints Suppose we obtain an edge set E such that E 0 ⊆ E : Define the estimator for the concentration matrix Θ 0 as: � � Θ n ( E ) = argmin Θ ∈M E � tr (Θ � S n ) − log | Θ | , where M E = { Θ ≻ 0 and θ ij = 0 ∀ ( i , j ) �∈ E , and i � = j } Theorem. Assume that 0 < ϕ min (Σ 0 ) < ϕ max (Σ 0 ) < ∞ . Suppose that E 0 ⊂ E and | E \ E 0 | = O ( S ) , where S = | E 0 | . �� Θ n ( E ) − Θ 0 � F = O P ( p + S ) log max ( n , p ) / n Then, � � This is the same rate as Rothman et al 08 for the ℓ 1 -penalized likelihood estimate

Get rid of the dependency on p Theorem. Assume that 0 < ϕ min (Σ 0 ) < ϕ max (Σ 0 ) < ∞ . Assume that Σ 0 , ii = 1 , ∀ i . Suppose we obtain an edge set E such that E 0 ⊆ E and | E \ E 0 | = O ( S ) , where S := | E 0 | = � p i = 1 s i . Then, �� Θ n ( E ) − Θ 0 � F = O P S log max ( n , p ) / n In the likelihood function, � S n will be replaced by the sample correlation matrix Γ n = diag ( � S n ) − 1 / 2 ( � S n ) diag ( � S n ) − 1 / 2 �

Main questions: How to select an edge set E so that we estimate Θ 0 well? What assumptions do we need to impose on Σ 0 or Θ 0 ? How does n scale with p , | E | , or the maximum node degree deg ( G ) ? What if some edges have very small weights? How to ensure that E \ E 0 is small? How does the edge-constrained maximum likelihood estimate behave with respect to E 0 \ E and E \ E 0 ?

Outline Introduction The regression model The method Theoretical results Conclusion

A Regression Model We assume a multivariate Gaussian model X = ( X 1 , . . . , X p ) ∼ N p ( 0 , Σ 0 ) , where Σ 0 , ii = 1 Consider a regression formulation of the model: For all i = 1 , . . . , p � β i β i X i = j X j + V i j = − θ 0 , ij /θ 0 , ii , and where j � = i V i ∼ N ( 0 , σ 2 V i ) is independent of { X j ; j � = i } for which we assume that there exists v 2 > 0 such that for all i , Var ( V i ) = 1 /θ 0 , ii ≥ v 2 Recall X i ⊥ X j | { X k ; k ∈ V \ { i , j }} ⇐ ⇒ θ 0 , ij = 0 ⇒ β j i = 0 and β i ⇐ j = 0

Want to recover the support of β i Take a random sample of size n , and use the sample to estimate β i , ∀ i ; that is, we have for each variable X i ,                                X . \ i      β i X i         ǫ = + ,                             n × ( p − 1 ) n n p − 1 where we assume p > n , that is, given high-dimensional data X Lasso (Tibshirani 96), a.k.a. Basis Pursuit (Chen, Donoho, and Saunders 98, and others): β i = arg min β � X i − X ·\ i β � 2 / 2 n + λ n � β � 1 �

Meinshausen and B¨ uhlmann 06 Perform p regressions using the Lasso to obtain p vectors β p where for each i , of regression coefficients � β 1 , . . . , � β i = { � β i � j ; j ∈ { 1 , . . . , p } \ i } Then estimate the edge set by the “OR” rule, estimate an edge between nodes i and j β j β i ⇒ � j � = 0 or � ⇐ i � = 0 Under sparsity and “Neighborhood Stability” conditions, they show P ( � E n = E 0 ) → 1 as n → ∞

Sparsity At row i , define s i 0 , n as the smallest integer such that: p � s i min { θ 2 0 , ij , λ 2 θ 0 , ii } 0 , n λ 2 θ 0 , ii ≤ j = 1 , j � = i The essential sparsity s i 0 , n at row i counts all ( i , j ) such that � ⇒ | β i | θ 0 , ij | � λ θ 0 , ii ⇐ j | � λσ V i Define S 0 , n = � p i = 1 s i 0 , n as the essential sparsity of the graph, which counts all ( i , j ) such that � � j | � λσ V i or | β j ⇒ | β i | θ 0 , ij | � λ min ( θ 0 , ii , θ 0 , jj ) ⇐ i | � λσ V j Aim to keep ≍ 2 S 0 edges in E

Defining 2 s 0 Let 0 ≤ s 0 ≤ s be the smallest integer such that � � p − 1 i , λ 2 σ 2 ) ≤ s 0 λ 2 σ 2 , where λ = 2 log p / n i = 1 min ( β 2 If we order the β j ’s in decreasing order of magnitude | β 1 | ≥ | β 2 | ... ≥ | β p − 1 | , then | β j | < λσ ∀ j > s 0 0.8 s 0 2s 0 s p = 512 n = 500 s = 96 σ = 1 0.6 λ n = logp n Value 0.4 σ 2logp n 0.2 σ logp n σ n 0.0 0 20 40 60 80 100 120 This notion of sparsity has been used in linear regression (Cand` es-Tao 07, Z09,10)

Selection: individual neighborhood We use the Lasso in combination with thresholding (Z09, Z10) � 2 log p / n for inferring the graph: Let λ = For each of the nodewise regressions, obtain an estimator β i init using the Lasso with penalty parameter λ n ≍ λ , n � � � ( X ( r ) j X ( r ) β i β i | β i ) 2 + λ n j | ∀ i , init = argmin β i − i j r = 1 j � = i j � = i Threshold β i init with τ ≍ λ to get the “Zero” set: Let � � � � D i = { j : j � = i , � β i � < τ } j , init

Selection: joining the neighborhoods Define the total “zeros” as: { ( i , j ) : i � = j : ( i , j ) ∈ D i ∩ D j } D = Select edge set E := { ( i , j ) : i , j = 1 , . . . , p , i � = j , ( i , j ) �∈ D} That is, edge set is the joint neighborhoods across all nodes in the graph This reflects the idea that the essential sparsity S 0 , n of the graph counts all ( i , j ) such that � � | θ 0 , ij | ≥ λ min ( θ 0 , ii , θ 0 , jj )

Example: a star graph Construct Σ 0 from a model used in Ravikumar et. al. 08:   1 ρ ρ ρ . . . 0   ρ 2 ρ 2 ρ 1 . . . 0     ρ 2 ρ 2 ρ 1 . . . 0   Σ 0 =   ρ 2 ρ 2   ρ 1 . . . 0     . . . . . . . . . . . . . . . . . . 0 . . . . . . . . . . . . 1 p × p

Example: original graph p = 128 , n = 96 , s = 8 , ρ = 0 . 5 � � 2 log p / n , τ = 0 . 2 2 log p / n λ n = 2

Example: estimated graph with n = 96 � 2 log p / n λ n = 2

Example: estimated graph � 2 log p / n λ n = 2

High-dimensional covariance estimation based on Gaussian graphical - PowerPoint PPT Presentation

High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26, 2011 Joint work with Philipp R

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains Majid

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains Majid

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

High Dimensional Data, Covariance Matrices High Dimensional Data Examples and Application to

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPMC 6th February 2017

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPSS 13th September 2016

M-Estimation under High-Dimensional Asymptotics DLD, Andrea Montanari 2014-05-01 DLD, Andrea

Spiked Eigenvalues of High Dimensional Separable Sample Covariance Matrices Guangming Pan,

Covariance Matrices and Covariance Operators Theory and Applications H` a Quang Minh Functional

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Multivariate Gaussian Mean vector: Covariance matrix: 2 1 Conditioning a Gaussian Joint

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and

On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with

Robust covariance estimation for financial applications Tim Verdonck , Mia Hubert, Peter Rousseeuw

Effects of non-Gaussian noise on covariance-based detectors Toma olc tomaz.solc@ijs.si

DAKAR WORKSHOP, Dakar, 12 th September 2013 MONITORING AND EVALUATION OF THE DAKAR STRATEGY

Data Bridge! How to Connect School PEIMS Data to the HUD Point in Time Texas Conference on Ending

SURPRISE CASH COUNTS ON THE ROAD AREA TRAINING BURNET, TEXAS P r e p a r e d B y : T i a n a

Point-in-Time Planning and Implementation Guidelines Webinar June 25, 2014 Overview of the Point

Coupling 3D radiative transfer models with soil vegetation transfer models for sparse vegetation

Covariate Balancing Propensity Score for General Treatment Regimes Kosuke Imai Princeton

Covariate Balancing Propensity Score Kosuke Imai Princeton University Winter Conference in

Modelling and Estimating the Clustering of Extreme Events Rob Lamb 3 , 4 Ross Towe 1 Jonathan Tawn

Sambuz

Useful Links

Newsletter

Mail Us