High-dimensional covariance estimation based on Gaussian graphical - - PowerPoint PPT Presentation

high dimensional covariance estimation based on gaussian
SMART_READER_LITE
LIVE PREVIEW

High-dimensional covariance estimation based on Gaussian graphical - - PowerPoint PPT Presentation

High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26, 2011 Joint work with Philipp R


slide-1
SLIDE 1

High-dimensional covariance estimation based

  • n Gaussian graphical models

Shuheng Zhou

Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena

  • Sept. 26, 2011

Joint work with Philipp R¨ utimann, Min Xu, and Peter B¨ uhlmann

slide-2
SLIDE 2

Problem definition

Want to estimate the covariance matrix for Gaussian Distributions: e.g., stock prices Take a random sample of vectors X (1), . . . , X (n) i.i.d. ∼ Np(0, Σ0), where p is understood to depend on n Let Θ0 := Σ−1 denote the concentration matrix Sparsity: certain elements of Θ0 are assumed to be zero Task: Use the sample to obtain a set of zeros, and then an estimator for Θ0 (Σ0) upon a given pattern of zeros Show consistency in predictive risk and in estimating Θ0 and Σ0 when n, p → ∞

slide-3
SLIDE 3

Gaussian graphical model: representation

Let X be a p-dimensional Gaussian random vector X ∼ (X1, . . . , Xp) ∼ N(0, Σ0), where Σ0 = Θ−1 In Gaussian graphical model G(V, E0), where |V| = p:

A pair (i, j) is NOT contained in E0 (θ0,ij = 0) iff Xi⊥Xj | {Xk; k ∈ V \ {i, j}}

Define Predictive Risk with Σ ≻ 0 as R(Σ) = tr(Σ−1Σ0) + log |Σ| ∝ −2E0(log fΣ(X))

where the Gaussian Log-likelihood function using Σ ≻ 0 is log fΣ(X) = −p 2 log 2π − 1 2 log |Σ| − 1 2X T Σ−1X

slide-4
SLIDE 4

Penalized maximum likelihood estimators

To estimate a sparse model (i.e., |Θ0|0 is small), recent work has considered ℓ1-penalized maximum likelihood estimators: let |Θ|1 = vecΘ1 =

i

  • j |θij|,
  • Θn = arg min

Θ≻0

  • tr(Θ

Sn) − log |Θ| + λn|Θ|1

  • , where
  • Sn = n−1 n

r=1 X (r)(X (r))T is the sample covariance

The graph Gn is determined by the non-zeros of Θn References: Yuan-Lin 07, d’Aspremont-Banerjee-El Ghaoui 08,

Friedman-Hastie-Tibshirani 08, Rothman et al 08, Z-Lafferty-Wasserman 08, and Ravikumar et. al. 08

slide-5
SLIDE 5

Predictive risks

Fix a point of interest with f0 = N(0, Σ0) For a given Ln, consider a constrained set of positive definite matrices: Γn = {Σ : Σ ≻ 0,

  • Σ−1
  • 1 ≤ Ln}

Define the oracle estimator as Σ∗ = arg minΣ∈Γn R(Σ) Recall R(Σ) = tr(Σ−1Σ0) + log |Σ| Define Σn as the minimizer of Rn(Σ) subject to Σ ∈ Γn,

  • Σn = arg min

Σ∈Γn

  • trΣ−1

Sn + log |Σ|

  • Rn(Σ)
  • Rn(Σ) is the negative Gaussian log-likelihood function and
  • Sn is the sample covariance
slide-6
SLIDE 6

Risk consistency

Persistence Theorem: Let p < nξ, for some ξ > 0. Given Γn = {Σ : Σ ≻ 0,

  • Σ−1
  • 1 ≤ Ln}, where Ln = o(
  • n

log n), ∀n.

Then R( Σn) − R(Σ∗

n) P

→ 0, where R(Σ) = tr(Σ−1Σ0) + log |Σ| and Σ∗

n = arg minΣ∈Γn R(Σ)

+

  • +
  • n=200

400 800 +o

  • +

n

1 2

logn Ln =

Persistency answers the asymptotic question: How large may the set Γn be, so that it is still possible to select empirically a predictor whose risk is close to that of the best predictor in the set (see Greenshtein-Ritov 04)

slide-7
SLIDE 7

Non-edges act as the constraints

Suppose we obtain an edge set E such that E0 ⊆ E: Define the estimator for the concentration matrix Θ0 as:

  • Θn(E) = argminΘ∈ME
  • tr(Θ

Sn) − log |Θ|

  • , where

ME = {Θ ≻ 0 and θij = 0 ∀(i, j) ∈ E, and i = j}

  • Theorem. Assume that 0 < ϕmin(Σ0) < ϕmax(Σ0) < ∞.

Suppose that E0 ⊂ E and |E \ E0| = O(S), where S = |E0|. Then, Θn(E) − Θ0F = OP

  • (p + S) log max(n, p)/n
  • This is the same rate as Rothman et al 08 for the

ℓ1-penalized likelihood estimate

slide-8
SLIDE 8

Get rid of the dependency on p

  • Theorem. Assume that 0 < ϕmin(Σ0) < ϕmax(Σ0) < ∞. Assume

that Σ0,ii = 1, ∀i. Suppose we obtain an edge set E such that E0 ⊆ E and |E \ E0| = O(S), where S := |E0| = p

i=1 si. Then,

  • Θn(E) − Θ0F = OP
  • S log max(n, p)/n
  • In the likelihood function,

Sn will be replaced by the sample correlation matrix

  • Γn = diag(

Sn)−1/2( Sn)diag( Sn)−1/2

slide-9
SLIDE 9

Main questions:

How to select an edge set E so that we estimate Θ0 well? What assumptions do we need to impose on Σ0 or Θ0? How does n scale with p, |E|, or the maximum node degree deg(G)? What if some edges have very small weights? How to ensure that E \ E0 is small? How does the edge-constrained maximum likelihood estimate behave with respect to E0 \ E and E \ E0?

slide-10
SLIDE 10

Outline

Introduction The regression model The method Theoretical results Conclusion

slide-11
SLIDE 11

A Regression Model

We assume a multivariate Gaussian model X = (X1, . . . , Xp) ∼ Np(0, Σ0), where Σ0,ii = 1 Consider a regression formulation of the model: For all i = 1, . . . , p Xi =

  • j=i

βi

j Xj + Vi

where βi

j = −θ0,ij/θ0,ii, and

Vi ∼ N(0, σ2

Vi) is independent of {Xj; j = i}

for which we assume that there exists v2 > 0 such that for all i, Var(Vi) = 1/θ0,ii ≥ v2 Recall Xi ⊥Xj | {Xk; k ∈ V \ {i, j}} ⇐ ⇒ θ0,ij = 0 ⇐ ⇒ βj

i = 0 and βi j = 0

slide-12
SLIDE 12

Want to recover the support of βi

Take a random sample of size n, and use the sample to estimate βi, ∀i; that is, we have for each variable Xi,

        Xi        

n

=         X.\i        

n×(p−1)

            βi            

p−1

+         ǫ        

n

, where we assume p > n, that is, given high-dimensional data X

Lasso (Tibshirani 96), a.k.a. Basis Pursuit (Chen, Donoho, and

Saunders 98, and others):

  • βi = arg min

β Xi − X·\iβ2/2n + λnβ1

slide-13
SLIDE 13

Meinshausen and B¨ uhlmann 06

Perform p regressions using the Lasso to obtain p vectors

  • f regression coefficients

β1, . . . , βp where for each i,

  • βi = {

βi

j ; j ∈ {1, . . . , p} \ i}

Then estimate the edge set by the “OR” rule, estimate an edge between nodes i and j ⇐ ⇒ βi

j = 0 or

βj

i = 0

Under sparsity and “Neighborhood Stability” conditions, they show P( En = E0) → 1 as n → ∞

slide-14
SLIDE 14

Sparsity

At row i, define si

0,n as the smallest integer such that: p

  • j=1,j=i

min{θ2

0,ij, λ2θ0,ii}

≤ si

0,nλ2θ0,ii

The essential sparsity si

0,n at row i counts all (i, j) such that

|θ0,ij| λ

  • θ0,ii ⇐

⇒ |βi

j | λσVi

Define S0,n = p

i=1 si 0,n as the essential sparsity of the

graph, which counts all (i, j) such that |θ0,ij| λ min(

  • θ0,ii,
  • θ0,jj) ⇐

⇒ |βi

j | λσVi or |βj i | λσVj

Aim to keep ≍ 2S0 edges in E

slide-15
SLIDE 15

Defining 2s0

Let 0 ≤ s0 ≤ s be the smallest integer such that p−1

i=1 min(β2 i , λ2σ2) ≤ s0λ2σ2, where λ =

  • 2 log p/n

If we order the βj’s in decreasing order of magnitude |β1| ≥ |β2|... ≥ |βp−1|, then |βj| < λσ ∀ j > s0

20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8

Value σ 2logp n σ logp n σ n

s0 2s0 s

p = 512 n = 500 s = 96 σ = 1 λn = logp n

This notion of sparsity has been used in linear regression (Cand` es-Tao 07, Z09,10)

slide-16
SLIDE 16

Selection: individual neighborhood

We use the Lasso in combination with thresholding (Z09, Z10) for inferring the graph: Let λ =

  • 2 log p/n

For each of the nodewise regressions, obtain an estimator βi

init using the Lasso with penalty parameter λn ≍ λ,

βi

init = argminβi n

  • r=1

(X (r)

i

  • j=i

βi

j X (r) j

)2 + λn

  • j=i

|βi

j | ∀i,

Threshold βi

init with τ ≍ λ to get the “Zero” set: Let

Di = {j : j = i,

  • βi

j,init

  • < τ}
slide-17
SLIDE 17

Selection: joining the neighborhoods

Define the total “zeros” as: D = {(i, j) : i = j : (i, j) ∈ Di ∩ Dj} Select edge set E := {(i, j) : i, j = 1, . . . , p, i = j, (i, j) ∈ D} That is, edge set is the joint neighborhoods across all nodes in the graph This reflects the idea that the essential sparsity S0,n of the graph counts all (i, j) such that |θ0,ij| ≥ λ min(

  • θ0,ii,
  • θ0,jj)
slide-18
SLIDE 18

Example: a star graph

Construct Σ0 from a model used in Ravikumar et. al. 08: Σ0 =          1 ρ ρ ρ . . . ρ 1 ρ2 ρ2 . . . ρ ρ2 1 ρ2 . . . ρ ρ2 ρ2 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1         

p×p

slide-19
SLIDE 19

Example: original graph

p = 128, n = 96, s = 8, ρ = 0.5 λn = 2

  • 2 log p/n, τ = 0.2
  • 2 log p/n
slide-20
SLIDE 20

Example: estimated graph with n = 96

λn = 2

  • 2 log p/n
slide-21
SLIDE 21

Example: estimated graph

λn = 2

  • 2 log p/n
slide-22
SLIDE 22

Example: estimated graph

λn = 2

  • 2 log p/n
slide-23
SLIDE 23

Example: estimated graph

λn = 2

  • 2 log p/n
slide-24
SLIDE 24

Example: estimated graph

λn = 2

  • 2 log p/n
slide-25
SLIDE 25

Example: estimated graph

λn = 2

  • 2 log p/n
slide-26
SLIDE 26

Example: estimated graph

λn = 2

  • 2 log p/n
slide-27
SLIDE 27

Example: estimated graph

λn = 2

  • 2 log p/n
slide-28
SLIDE 28

Example: estimated graph

τ = 0.2

  • 2 log p/n
slide-29
SLIDE 29

Gelato: estimation of edge weight

Given a graph with edge set E, we estimate the concentration matrix by maximum likelihood: Denote the sample correlation matrix by Γn:

  • Γn = diag(

Sn)−1/2( Sn)diag( Sn)−1/2 The estimator for the concentration matrix Θ0 is:

  • Θn(E) = argminΘ∈Mp,E
  • tr(Θ

Γn) − log |Θ|

  • , where

Mp,E = {Θ ∈ Rp×p; Θ ≻ 0 and θij = 0 for all (i, j) ∈ D} and D := {(i, j) : i, j = 1, . . . , p, (i, j) ∈ E and i = j}

slide-30
SLIDE 30

Likelihood equations

Let diag( Sn)1/2 = { σ1, . . . , σp} The following relationships hold for the maximum likelihood estimate Θn and Σn = ( Θn)−1:

  • Σn,ii

= 1, ∀i = 1, . . . , p

  • Σn,ij

=

  • Γn,ij =

Sn,ij/ σi σj, ∀(i, j) ∈ E and

  • Θn,ij

= 0, ∀(i, j) ∈ D This is also known as positive definite matrix completion problem

slide-31
SLIDE 31

Set of assumptions

Let c, C be some absolute contants. (A0) The size of the neighborhood for each node i ∈ V is bounded by an integer s < p and the sample size satisfies n ≥ Cs log(cp/s). (A1) The dimension and number of sufficiently strong edges S0,n satisfy: p = o(ecn) for some 0 < c < 1 and S0,n = o(n/ log max(n, p)) (n → ∞). (A2) The minimal and maximal eigenvalues of Σ0 are bounded, and Σ0,ii = 1 for all i.

slide-32
SLIDE 32

The main theorem: selection

Assume that (A0), and (A2) hold. Let λ =

  • 2 log p/n. Let d, C, D depend on sparse and

restrictive eigenvalues of Σ0. Let λn = dλ and τ = Dλn be chosen appropriately chosen Denote the estimated edge by E = En(λn, τ) Then with high probability, |E| ≤ 2S0,n, where |E \ E0| ≤ S0,n and

  • Θ0,D
  • F

≤ Cλn

  • min{S0,n( max

i=1,...p θ2 0,ii), s0diag(Θ0)2 F}

where s0 = maxi si

0,n denotes the maximum “essential

node degree”

slide-33
SLIDE 33

Example: p = 128, s = 12, ρ = 0.5

λn = 2

  • 2 log p/n, τ = f
  • 2 log p/n

60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0

FPR n

Lasso f = 0.30 f = 0.35 f = 0.40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0

FNR n

Lasso f = 0.30 f = 0.35 f = 0.40

slide-34
SLIDE 34

The main theorem: estimation

Assume that in addition, (A1) holds. Then for Θn and

  • Σn = (

Θn)−1,

  • Θn − Θ0F = OP
  • S0,n log max(n, p)/n
  • Σn − Σ0F = OP
  • S0,n log max(n, p)/n
  • R(

Θn) − R(Θ0) = OP

  • S0,n log max(n, p)/n
  • So
  • Θn − Θ0
  • 2 ,
  • Σn − Σ0
  • 2 = Op
  • S0,n log max(n, p)/n
slide-35
SLIDE 35

Obtaining an edge set E

Let Si = {j : j = i, βi

j = 0} and si =

  • Si

. Let D, λn, C be the same as in the main Theorem For each of nodewise regressions, we apply the same thresholding rule to obtain a subset Ii as follows, Ii = {j : j = i,

  • βi

j,init

  • ≥ τ = Dλn},

and Di := {1, . . . , i − 1, i + 1, . . . , p} \ Ii Then we have with high probability, |Ii| ≤ 2si

0 and

|Ii ∪ Si| ≤ |Si| + si

0 and

  • Θi

0,D

  • 2 ≤ Cλnθ0,ii
  • si

0 ⇐

  • βi

D

  • 2 ≤ Cλn
  • si

Proof follows from results in Z10 on the Thresholded Lasso estimator

slide-36
SLIDE 36

Oracle inequalities for the Lasso

Theorem (Z 10). Under (A0) and (A2), for all nodewise regressions, the Lasso estimator achieves squared ℓ2 loss of OP(s0σ2 log p/n).

20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8

Value

s0 2s0 s

σ 2logp n σ logp n σ n p = 512 n = 500 s = 96 σ = 1 λn = logp n

slide-37
SLIDE 37

Constructing a pivot point

Now clearly by the “OR” rule, we have E = {(i, j) : j ∈ Ii, i = 1, . . . , p} and |E| ≤

p

  • i=1
  • Ii

p

  • i=1

2si

0 = 2S0

Given a 2S0-sparse set of edges E, Define a sparse approximation Θ0 of Θ0 which is identical to Θ0 on E and the diagonal, and zero elsewhere:

  • Θ0 = diag(Θ0) + Θ0,E = diag(Θ0) + Θ0,E∩E0
  • Θ0
  • 0 = p + 2|E ∩ E0| ≤ p + 4S0 with s + 1-sparse row

(column) vectors

slide-38
SLIDE 38
  • Θ0 as a sparse approximation

The bias is small:

  • Θ0 −

Θ0

  • F

≤ C max

i=1,...p(θ0,ii)

  • S0 log p/n

For q = 1, 2, ∞,

  • Θ0 −

Θ0

  • q

≤ C max

i=1,...p(θ0,ii)√s0λn

√ s Note that each row vector will be 2s0 + 1-sparse if we apply the “AND” rule, however, at the cost of a larger bias

slide-39
SLIDE 39
  • Θ0 as a pivot

The sparsity and the small bias allow us to bound

  • Θn(E) −

Θ0F = OP   

  • S0,n log max(n, p)/n
  • rn

   where we use the fact that both the estimator Θn(E) and the pivot Θ0 are sparse By the triangle inequality, we conclude that

  • Θn(E) − Θ0F

  • Θn(E) −

Θ0F + Θ0 − Θ0F = OP

  • S0,n log max(n, p)/n
slide-40
SLIDE 40

Generalization on the estimation step

Assume that (A1) and (A2) hold. Let σ2

max := maxi Σ0,ii < ∞

and σ2

min := mini Σ0,ii > 0. Let W = diag(Σ0)1/2. Suppose that

we obtain an edge set E such that |E| = lin(S0,n) is a linear function in S0,n For Θ0 = diag(Θ0) + Θ0,E

  • Θ0 − Θ0F ≤ C
  • 2S0,n log(p)/n

We note that this is equivalent to assume Ω0 − Ω0F ≤ C′ 2S0,n log(p)/n where Ω0 = WΘ0W and

  • Ω0 = W(

Θ0)W

slide-41
SLIDE 41

Generalization on the estimation step

  • Theorem. Suppose the sample size satisfies for a

sufficiently large constant M, n > MS0,n log max(n, p). Then

  • Ωn(E) − Ω0F ≤ Op(
  • 2S0,n log max p, n/n)

where Ωn(E) is the maximum likelihood estimator based

  • n the sample correlation matrix

Γn:

  • Ωn(E) = argminΩ∈Mp,E
  • tr(Ω

Γn) − log |Ω|

slide-42
SLIDE 42

Generalization on the estimation step

Given W = diag( Sn)1/2 and Ωn(E), compute

  • Θn =

W −1 Ωn W −1 and

  • Σn =

W( Ωn(E))−1 W for which the following hold:

  • Σn,ij

=

  • Sn,ij

∀(i, j) ∈ E ∪ {(i, i) : i = 1, . . . , p}

  • Θn,ij

= 0, ∀(i, j) ∈ D. Following the bound on Ωn(E) and arguments in Rothman et al. (2008),

  • Θn − Θ02

= OP

  • S0,n log max(n, p)/n
slide-43
SLIDE 43

Generalization on the estimation error

For the Frobenius norm and the risk to converge to zero, (A1) is to be replaced by: p ≍ nc for some 0 < c < 1 and p + S0,n = o(n/ log max(n, p)) as n → ∞ In this case, we have

  • Θn − Θ0F

= OP

  • (p + S0,n) log max(n, p)/n
  • Σn − Σ0F

= OP

  • (p + S0,n) log max(n, p)/n
  • R(

Θn) − R(Θ0) = OP

  • (p + S0,n) log max(n, p)/n
  • We could achieve these rates with
  • Θn(E) = argminΘ∈Mp,E
  • tr(Θ

Sn) − log |Θ|

slide-44
SLIDE 44

Conclusion

Gelato separates the tasks of model selection and (inverse) covariance estimation Thresholding plays a key role in obtaining a sparse approximation of the graph with a small bias using a very small amount of sample With stronger conditions on the sample size, convergence rates in terms of operator and frobenius norms, and KL divergence are established The method is feasible in high dimensions: p > n is allowed

slide-45
SLIDE 45

Related work on inverse/covariance estimation

Regression based selection/estimation: Meinshausen-B¨ uhlmann 06, Peng et. al. 09, Yuan 10, Verzelen 10, Cai-Liu-Luo 11 Penalized likelihood method based on ℓ1-norm penalty: Yuan-Lin 07, d’Aspremont-Banerjee-El Ghaoui 08, Friedman-Hastie-Tibshirani 07, Rothman et. al. 08, Zhou-Lafferty-Wasserman 08, Ravikumar et. al. 08 Nonconvex: Lam-Fan 09 Sparse covariance selection/estimation: Bickel and Levina 06, 08; El Karoui 08, Levina and Vershynin 10 and more...

slide-46
SLIDE 46

References

RUDELSON, M. and ZHOU, S. (2011). Reconstruction from anisotropic random measurements. ArXiv:1106.1151. ZHOU, S. (2009). Restricted eigenvalue conditions on subgaussian random matrices. ArXiv:0904.4723v2. ZHOU, S. (2009). Thresholding procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems 22. MIT Press. ZHOU, S. (2010). Thresholded lasso for high dimensional variable selection and statistical estimation. ArXiv:1002.1583. ZHOU, S., R ¨

UTIMANN, P., XU, M. and B ¨ UHLMANN, P. (2011).

High-dimensional covariance estimation based on gaussian graphical models. ArXiv:1009.0530v2.