SLIDE 1 High-dimensional covariance estimation based
- n Gaussian graphical models
Shuheng Zhou
Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena
Joint work with Philipp R¨ utimann, Min Xu, and Peter B¨ uhlmann
SLIDE 2
Problem definition
Want to estimate the covariance matrix for Gaussian Distributions: e.g., stock prices Take a random sample of vectors X (1), . . . , X (n) i.i.d. ∼ Np(0, Σ0), where p is understood to depend on n Let Θ0 := Σ−1 denote the concentration matrix Sparsity: certain elements of Θ0 are assumed to be zero Task: Use the sample to obtain a set of zeros, and then an estimator for Θ0 (Σ0) upon a given pattern of zeros Show consistency in predictive risk and in estimating Θ0 and Σ0 when n, p → ∞
SLIDE 3
Gaussian graphical model: representation
Let X be a p-dimensional Gaussian random vector X ∼ (X1, . . . , Xp) ∼ N(0, Σ0), where Σ0 = Θ−1 In Gaussian graphical model G(V, E0), where |V| = p:
A pair (i, j) is NOT contained in E0 (θ0,ij = 0) iff Xi⊥Xj | {Xk; k ∈ V \ {i, j}}
Define Predictive Risk with Σ ≻ 0 as R(Σ) = tr(Σ−1Σ0) + log |Σ| ∝ −2E0(log fΣ(X))
where the Gaussian Log-likelihood function using Σ ≻ 0 is log fΣ(X) = −p 2 log 2π − 1 2 log |Σ| − 1 2X T Σ−1X
SLIDE 4 Penalized maximum likelihood estimators
To estimate a sparse model (i.e., |Θ0|0 is small), recent work has considered ℓ1-penalized maximum likelihood estimators: let |Θ|1 = vecΘ1 =
i
Θ≻0
Sn) − log |Θ| + λn|Θ|1
r=1 X (r)(X (r))T is the sample covariance
The graph Gn is determined by the non-zeros of Θn References: Yuan-Lin 07, d’Aspremont-Banerjee-El Ghaoui 08,
Friedman-Hastie-Tibshirani 08, Rothman et al 08, Z-Lafferty-Wasserman 08, and Ravikumar et. al. 08
SLIDE 5 Predictive risks
Fix a point of interest with f0 = N(0, Σ0) For a given Ln, consider a constrained set of positive definite matrices: Γn = {Σ : Σ ≻ 0,
Define the oracle estimator as Σ∗ = arg minΣ∈Γn R(Σ) Recall R(Σ) = tr(Σ−1Σ0) + log |Σ| Define Σn as the minimizer of Rn(Σ) subject to Σ ∈ Γn,
Σ∈Γn
Sn + log |Σ|
- Rn(Σ)
- Rn(Σ) is the negative Gaussian log-likelihood function and
- Sn is the sample covariance
SLIDE 6 Risk consistency
Persistence Theorem: Let p < nξ, for some ξ > 0. Given Γn = {Σ : Σ ≻ 0,
- Σ−1
- 1 ≤ Ln}, where Ln = o(
- n
log n), ∀n.
Then R( Σn) − R(Σ∗
n) P
→ 0, where R(Σ) = tr(Σ−1Σ0) + log |Σ| and Σ∗
n = arg minΣ∈Γn R(Σ)
+
400 800 +o
n
1 2
logn Ln =
Persistency answers the asymptotic question: How large may the set Γn be, so that it is still possible to select empirically a predictor whose risk is close to that of the best predictor in the set (see Greenshtein-Ritov 04)
SLIDE 7 Non-edges act as the constraints
Suppose we obtain an edge set E such that E0 ⊆ E: Define the estimator for the concentration matrix Θ0 as:
Sn) − log |Θ|
ME = {Θ ≻ 0 and θij = 0 ∀(i, j) ∈ E, and i = j}
- Theorem. Assume that 0 < ϕmin(Σ0) < ϕmax(Σ0) < ∞.
Suppose that E0 ⊂ E and |E \ E0| = O(S), where S = |E0|. Then, Θn(E) − Θ0F = OP
- (p + S) log max(n, p)/n
- This is the same rate as Rothman et al 08 for the
ℓ1-penalized likelihood estimate
SLIDE 8 Get rid of the dependency on p
- Theorem. Assume that 0 < ϕmin(Σ0) < ϕmax(Σ0) < ∞. Assume
that Σ0,ii = 1, ∀i. Suppose we obtain an edge set E such that E0 ⊆ E and |E \ E0| = O(S), where S := |E0| = p
i=1 si. Then,
- Θn(E) − Θ0F = OP
- S log max(n, p)/n
- In the likelihood function,
Sn will be replaced by the sample correlation matrix
Sn)−1/2( Sn)diag( Sn)−1/2
SLIDE 9
Main questions:
How to select an edge set E so that we estimate Θ0 well? What assumptions do we need to impose on Σ0 or Θ0? How does n scale with p, |E|, or the maximum node degree deg(G)? What if some edges have very small weights? How to ensure that E \ E0 is small? How does the edge-constrained maximum likelihood estimate behave with respect to E0 \ E and E \ E0?
SLIDE 10
Outline
Introduction The regression model The method Theoretical results Conclusion
SLIDE 11 A Regression Model
We assume a multivariate Gaussian model X = (X1, . . . , Xp) ∼ Np(0, Σ0), where Σ0,ii = 1 Consider a regression formulation of the model: For all i = 1, . . . , p Xi =
βi
j Xj + Vi
where βi
j = −θ0,ij/θ0,ii, and
Vi ∼ N(0, σ2
Vi) is independent of {Xj; j = i}
for which we assume that there exists v2 > 0 such that for all i, Var(Vi) = 1/θ0,ii ≥ v2 Recall Xi ⊥Xj | {Xk; k ∈ V \ {i, j}} ⇐ ⇒ θ0,ij = 0 ⇐ ⇒ βj
i = 0 and βi j = 0
SLIDE 12 Want to recover the support of βi
Take a random sample of size n, and use the sample to estimate βi, ∀i; that is, we have for each variable Xi,
Xi
n
= X.\i
n×(p−1)
βi
p−1
+ ǫ
n
, where we assume p > n, that is, given high-dimensional data X
Lasso (Tibshirani 96), a.k.a. Basis Pursuit (Chen, Donoho, and
Saunders 98, and others):
β Xi − X·\iβ2/2n + λnβ1
SLIDE 13 Meinshausen and B¨ uhlmann 06
Perform p regressions using the Lasso to obtain p vectors
- f regression coefficients
β1, . . . , βp where for each i,
βi
j ; j ∈ {1, . . . , p} \ i}
Then estimate the edge set by the “OR” rule, estimate an edge between nodes i and j ⇐ ⇒ βi
j = 0 or
βj
i = 0
Under sparsity and “Neighborhood Stability” conditions, they show P( En = E0) → 1 as n → ∞
SLIDE 14 Sparsity
At row i, define si
0,n as the smallest integer such that: p
min{θ2
0,ij, λ2θ0,ii}
≤ si
0,nλ2θ0,ii
The essential sparsity si
0,n at row i counts all (i, j) such that
|θ0,ij| λ
⇒ |βi
j | λσVi
Define S0,n = p
i=1 si 0,n as the essential sparsity of the
graph, which counts all (i, j) such that |θ0,ij| λ min(
⇒ |βi
j | λσVi or |βj i | λσVj
Aim to keep ≍ 2S0 edges in E
SLIDE 15 Defining 2s0
Let 0 ≤ s0 ≤ s be the smallest integer such that p−1
i=1 min(β2 i , λ2σ2) ≤ s0λ2σ2, where λ =
If we order the βj’s in decreasing order of magnitude |β1| ≥ |β2|... ≥ |βp−1|, then |βj| < λσ ∀ j > s0
20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8
Value σ 2logp n σ logp n σ n
s0 2s0 s
p = 512 n = 500 s = 96 σ = 1 λn = logp n
This notion of sparsity has been used in linear regression (Cand` es-Tao 07, Z09,10)
SLIDE 16 Selection: individual neighborhood
We use the Lasso in combination with thresholding (Z09, Z10) for inferring the graph: Let λ =
For each of the nodewise regressions, obtain an estimator βi
init using the Lasso with penalty parameter λn ≍ λ,
βi
init = argminβi n
(X (r)
i
−
βi
j X (r) j
)2 + λn
|βi
j | ∀i,
Threshold βi
init with τ ≍ λ to get the “Zero” set: Let
Di = {j : j = i,
j,init
SLIDE 17 Selection: joining the neighborhoods
Define the total “zeros” as: D = {(i, j) : i = j : (i, j) ∈ Di ∩ Dj} Select edge set E := {(i, j) : i, j = 1, . . . , p, i = j, (i, j) ∈ D} That is, edge set is the joint neighborhoods across all nodes in the graph This reflects the idea that the essential sparsity S0,n of the graph counts all (i, j) such that |θ0,ij| ≥ λ min(
SLIDE 18
Example: a star graph
Construct Σ0 from a model used in Ravikumar et. al. 08: Σ0 = 1 ρ ρ ρ . . . ρ 1 ρ2 ρ2 . . . ρ ρ2 1 ρ2 . . . ρ ρ2 ρ2 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
p×p
SLIDE 19 Example: original graph
p = 128, n = 96, s = 8, ρ = 0.5 λn = 2
- 2 log p/n, τ = 0.2
- 2 log p/n
SLIDE 20 Example: estimated graph with n = 96
λn = 2
SLIDE 21 Example: estimated graph
λn = 2
SLIDE 22 Example: estimated graph
λn = 2
SLIDE 23 Example: estimated graph
λn = 2
SLIDE 24 Example: estimated graph
λn = 2
SLIDE 25 Example: estimated graph
λn = 2
SLIDE 26 Example: estimated graph
λn = 2
SLIDE 27 Example: estimated graph
λn = 2
SLIDE 28 Example: estimated graph
τ = 0.2
SLIDE 29 Gelato: estimation of edge weight
Given a graph with edge set E, we estimate the concentration matrix by maximum likelihood: Denote the sample correlation matrix by Γn:
Sn)−1/2( Sn)diag( Sn)−1/2 The estimator for the concentration matrix Θ0 is:
- Θn(E) = argminΘ∈Mp,E
- tr(Θ
Γn) − log |Θ|
Mp,E = {Θ ∈ Rp×p; Θ ≻ 0 and θij = 0 for all (i, j) ∈ D} and D := {(i, j) : i, j = 1, . . . , p, (i, j) ∈ E and i = j}
SLIDE 30 Likelihood equations
Let diag( Sn)1/2 = { σ1, . . . , σp} The following relationships hold for the maximum likelihood estimate Θn and Σn = ( Θn)−1:
= 1, ∀i = 1, . . . , p
=
Sn,ij/ σi σj, ∀(i, j) ∈ E and
= 0, ∀(i, j) ∈ D This is also known as positive definite matrix completion problem
SLIDE 31
Set of assumptions
Let c, C be some absolute contants. (A0) The size of the neighborhood for each node i ∈ V is bounded by an integer s < p and the sample size satisfies n ≥ Cs log(cp/s). (A1) The dimension and number of sufficiently strong edges S0,n satisfy: p = o(ecn) for some 0 < c < 1 and S0,n = o(n/ log max(n, p)) (n → ∞). (A2) The minimal and maximal eigenvalues of Σ0 are bounded, and Σ0,ii = 1 for all i.
SLIDE 32 The main theorem: selection
Assume that (A0), and (A2) hold. Let λ =
- 2 log p/n. Let d, C, D depend on sparse and
restrictive eigenvalues of Σ0. Let λn = dλ and τ = Dλn be chosen appropriately chosen Denote the estimated edge by E = En(λn, τ) Then with high probability, |E| ≤ 2S0,n, where |E \ E0| ≤ S0,n and
≤ Cλn
i=1,...p θ2 0,ii), s0diag(Θ0)2 F}
where s0 = maxi si
0,n denotes the maximum “essential
node degree”
SLIDE 33 Example: p = 128, s = 12, ρ = 0.5
λn = 2
- 2 log p/n, τ = f
- 2 log p/n
60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0
FPR n
Lasso f = 0.30 f = 0.35 f = 0.40 60 80 100 120 0.0 0.2 0.4 0.6 0.8 1.0
FNR n
Lasso f = 0.30 f = 0.35 f = 0.40
SLIDE 34 The main theorem: estimation
Assume that in addition, (A1) holds. Then for Θn and
Θn)−1,
- Θn − Θ0F = OP
- S0,n log max(n, p)/n
- Σn − Σ0F = OP
- S0,n log max(n, p)/n
- R(
Θn) − R(Θ0) = OP
- S0,n log max(n, p)/n
- So
- Θn − Θ0
- 2 ,
- Σn − Σ0
- 2 = Op
- S0,n log max(n, p)/n
SLIDE 35 Obtaining an edge set E
Let Si = {j : j = i, βi
j = 0} and si =
. Let D, λn, C be the same as in the main Theorem For each of nodewise regressions, we apply the same thresholding rule to obtain a subset Ii as follows, Ii = {j : j = i,
j,init
and Di := {1, . . . , i − 1, i + 1, . . . , p} \ Ii Then we have with high probability, |Ii| ≤ 2si
0 and
|Ii ∪ Si| ≤ |Si| + si
0 and
0,D
0 ⇐
⇒
D
Proof follows from results in Z10 on the Thresholded Lasso estimator
SLIDE 36 Oracle inequalities for the Lasso
Theorem (Z 10). Under (A0) and (A2), for all nodewise regressions, the Lasso estimator achieves squared ℓ2 loss of OP(s0σ2 log p/n).
20 40 60 80 100 120 0.0 0.2 0.4 0.6 0.8
Value
s0 2s0 s
σ 2logp n σ logp n σ n p = 512 n = 500 s = 96 σ = 1 λn = logp n
SLIDE 37 Constructing a pivot point
Now clearly by the “OR” rule, we have E = {(i, j) : j ∈ Ii, i = 1, . . . , p} and |E| ≤
p
p
2si
0 = 2S0
Given a 2S0-sparse set of edges E, Define a sparse approximation Θ0 of Θ0 which is identical to Θ0 on E and the diagonal, and zero elsewhere:
- Θ0 = diag(Θ0) + Θ0,E = diag(Θ0) + Θ0,E∩E0
- Θ0
- 0 = p + 2|E ∩ E0| ≤ p + 4S0 with s + 1-sparse row
(column) vectors
SLIDE 38
- Θ0 as a sparse approximation
The bias is small:
Θ0
≤ C max
i=1,...p(θ0,ii)
For q = 1, 2, ∞,
Θ0
≤ C max
i=1,...p(θ0,ii)√s0λn
√ s Note that each row vector will be 2s0 + 1-sparse if we apply the “AND” rule, however, at the cost of a larger bias
SLIDE 39
The sparsity and the small bias allow us to bound
Θ0F = OP
where we use the fact that both the estimator Θn(E) and the pivot Θ0 are sparse By the triangle inequality, we conclude that
≤
Θ0F + Θ0 − Θ0F = OP
SLIDE 40 Generalization on the estimation step
Assume that (A1) and (A2) hold. Let σ2
max := maxi Σ0,ii < ∞
and σ2
min := mini Σ0,ii > 0. Let W = diag(Σ0)1/2. Suppose that
we obtain an edge set E such that |E| = lin(S0,n) is a linear function in S0,n For Θ0 = diag(Θ0) + Θ0,E
- Θ0 − Θ0F ≤ C
- 2S0,n log(p)/n
We note that this is equivalent to assume Ω0 − Ω0F ≤ C′ 2S0,n log(p)/n where Ω0 = WΘ0W and
Θ0)W
SLIDE 41 Generalization on the estimation step
- Theorem. Suppose the sample size satisfies for a
sufficiently large constant M, n > MS0,n log max(n, p). Then
- Ωn(E) − Ω0F ≤ Op(
- 2S0,n log max p, n/n)
where Ωn(E) is the maximum likelihood estimator based
- n the sample correlation matrix
Γn:
- Ωn(E) = argminΩ∈Mp,E
- tr(Ω
Γn) − log |Ω|
SLIDE 42 Generalization on the estimation step
Given W = diag( Sn)1/2 and Ωn(E), compute
W −1 Ωn W −1 and
W( Ωn(E))−1 W for which the following hold:
=
∀(i, j) ∈ E ∪ {(i, i) : i = 1, . . . , p}
= 0, ∀(i, j) ∈ D. Following the bound on Ωn(E) and arguments in Rothman et al. (2008),
= OP
SLIDE 43 Generalization on the estimation error
For the Frobenius norm and the risk to converge to zero, (A1) is to be replaced by: p ≍ nc for some 0 < c < 1 and p + S0,n = o(n/ log max(n, p)) as n → ∞ In this case, we have
= OP
- (p + S0,n) log max(n, p)/n
- Σn − Σ0F
= OP
- (p + S0,n) log max(n, p)/n
- R(
Θn) − R(Θ0) = OP
- (p + S0,n) log max(n, p)/n
- We could achieve these rates with
- Θn(E) = argminΘ∈Mp,E
- tr(Θ
Sn) − log |Θ|
SLIDE 44
Conclusion
Gelato separates the tasks of model selection and (inverse) covariance estimation Thresholding plays a key role in obtaining a sparse approximation of the graph with a small bias using a very small amount of sample With stronger conditions on the sample size, convergence rates in terms of operator and frobenius norms, and KL divergence are established The method is feasible in high dimensions: p > n is allowed
SLIDE 45
Related work on inverse/covariance estimation
Regression based selection/estimation: Meinshausen-B¨ uhlmann 06, Peng et. al. 09, Yuan 10, Verzelen 10, Cai-Liu-Luo 11 Penalized likelihood method based on ℓ1-norm penalty: Yuan-Lin 07, d’Aspremont-Banerjee-El Ghaoui 08, Friedman-Hastie-Tibshirani 07, Rothman et. al. 08, Zhou-Lafferty-Wasserman 08, Ravikumar et. al. 08 Nonconvex: Lam-Fan 09 Sparse covariance selection/estimation: Bickel and Levina 06, 08; El Karoui 08, Levina and Vershynin 10 and more...
SLIDE 46
References
RUDELSON, M. and ZHOU, S. (2011). Reconstruction from anisotropic random measurements. ArXiv:1106.1151. ZHOU, S. (2009). Restricted eigenvalue conditions on subgaussian random matrices. ArXiv:0904.4723v2. ZHOU, S. (2009). Thresholding procedures for high dimensional variable selection and statistical estimation. In Advances in Neural Information Processing Systems 22. MIT Press. ZHOU, S. (2010). Thresholded lasso for high dimensional variable selection and statistical estimation. ArXiv:1002.1583. ZHOU, S., R ¨
UTIMANN, P., XU, M. and B ¨ UHLMANN, P. (2011).
High-dimensional covariance estimation based on gaussian graphical models. ArXiv:1009.0530v2.