Edgeworth and confidence interval correction in spiked PCA
Iain Johnstone & Jeha Yang
Statistics & Biomedical Data Science, Stanford & Two Sigma
Edgeworth and confidence interval correction in spiked PCA Iain - - PowerPoint PPT Presentation
Edgeworth and confidence interval correction in spiked PCA Iain Johnstone & Jeha Yang Statistics & Biomedical Data Science, Stanford & Two Sigma Shanghai, December 10, 2019 Edgeworth and confidence interval correction in spiked PCA
Statistics & Biomedical Data Science, Stanford & Two Sigma
Statistics & Biomedical Data Science, Stanford & Two Sigma
Quadeer et. al. PLOS Comp. Bio. 2018
Quadeer et. al. PLOS Comp. Bio. 2018
[David Morales, Matt McKay]
6
2nd eigenvalue
2-st Leading Eigenvalue c = 0.2, N = 300, N1 = 10, simple spks = [2.8, 1.9], deg. spks = [0.8, 0.9]
2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 1 2 3 4 5 6 7 8 9 Histogram of the sample eigenvalue
mean = 2.305 [2.322] std = 0.053 [0.054] (0.89 xPaul)
γ
2-st Leading Eigenvalue c = 0.2, N = 300, N1 = 30, simple spks = [6.8, 3.9], deg. spks = [0.8, 0.9]
3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 0.5 1 1.5 2 2.5 3 3.5 Histogram of the sample eigenvalue
mean = 4.145 [4.169] std = 0.126 [0.125] (0.89 xPaul)
γ ρ1 = 0.2 ; ρ2 = 0.1 Theoretical variance is pretty accurate, but there seems to be a shift in the mean (similar to what we’ve seen before in the eigenvector projections of sample covariance when spikes were close to each other)
i.i.d.
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 5 10 15
[Baik–Ben Arous–Pech´ e,05]
fluctuation
3 = {2
n
1
` Critical point: ° 1+
2
) ° (1+
1
¸ Tracy-Widom
[Baik–Ben Arous–Pech´ e,05]
)
1
` ( ¸ Critical point: ° 1+
2
) ° (1+
Gaussian
1
` fluctuation
2 = {1
n bias
1
¸
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 5 10 15
h =1+
º
` ° 1+
2
) ° (1+
2
) ° (1+ ° 1 )
º
` ( ½
=1+
º
` ° 1+
bias (bulk)
2
) ° (1+
2
) ° (1+ ° 1 )
º
` ( ½ Statistical physics lit, 94- Baik-Ben Arous-Peche(05) , Paul (07) Baik-Silverstein (06), Bloemendal-Virag (11) Mo (11) , Wang (12) Benaych-Georges-Guionnet- Maida (11)
Shi (2013)
(n,γn,l) = (400,1,(2.7))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
(n,γn,l) = (400,1,(2.7,2.2))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
(n,γn,l) = (400,1,(3.2,2.7))
z ^2n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
(n,γn,l) = (400,1,(2.7,2.4))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
Petrov, 1975, Hall, 1992
n
n
ni
2n
1 + γn
1 − γn)3/2 ,
1 − γn)3/2
1 + γn
1 − γn)3/2 ,
1 − γn)3/2
Muirhead-Chikuse (1975)
2
1 + γ)2
1 − γ)3 ≤ 0.2
(n, γ, l−factor) = (50,0.1,0.3)
l ^ Density 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.4 0.8 1.2 Edgeworth Normal Upper Edge
(n, γ, l−factor) = (50,1,0.3)
l ^ Density 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 Edgeworth Normal Upper Edge
(n, γ, l−factor) = (100,0.1,0.3)
l ^ Density 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 Edgeworth Normal Upper Edge
(n, γ, l−factor) = (100,1,0.3)
l ^ Density 3.5 4.0 4.5 5.0 5.5 6.0 0.0 0.5 1.0 1.5 Edgeworth Normal Upper Edge
(n,γn,l) = (400,1,(2.7,2.2))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
(n,γn,l) = (400,1,(2.7,2.4))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
(n,γn,l) = (400,1,(3.2,2.7))
z ^2n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal
k + γn
k − γn)3/2 ,
k − γ)1/2
k − γ +
k − γn)1/2
(n,γn,l) = (400,1,(2.7))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal Edgeworth
(n,γn,l) = (400,1,(2.7,2.2))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal Edgeworth
(n,γn,l) = (400,1,(3.2,2.7))
z ^2n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal Edgeworth
(n,γn,l) = (400,1,(2.7,2.4))
z ^1n Density −3 −2 −1 1 2 3 4 0.0 0.1 0.2 0.3 0.4 0.5
Normal Edgeworth
Figure: Density of ˆ zkn associated with ℓk = 2.7
(n,γn,l) = (400,0.5,(2.7,2.2))
ρ ^ Density 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 0.0 1.0 2.0 3.0
Normal Edgeworth
(n,γn,l) = (400,0.5,(3.2,2.7,2.2))
ρ ^ Density 3.0 3.5 4.0 4.5 1 2 3
Normal Edgeworth
Blue, red, green vertical lines correspond to ρ1n, ρ2n, ρ3n, respectively.
2
2
2 = UΛU′
k ]
p
p
n
(Marcenko-Pastur-Bai-Silverstein)
i = n−1
i − 1)
,
.
I
p
j=k
1+ℓjm − 1 1+m
k − γ)
,
~ - ....
.. eir
.I,
0(.
~-
~
..
A
k − γn)2
kn) + h.o.t.
i − 1)
2n
1 cj ni = κjFγn(gj kn) + O(n−1/2)
2n
k + γn
k − γn)3/2 + O(n−1/2) = α2(hk, γn) + O(n−1/2)
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 5 10 15
kn(ˆ
kn(ˆ
kn(ˆ
kn(ˆ
.. ,-
t'~ --
'
,._
'-"'
kn(ˆ
kn(ˆ
.. ,-
t'~ --
'
,._
'-"'
n(ˆ
kn(x, ℓ) = Φ(x) + n−1/2pk(x; ℓ, γn)φ(x)
n (ˆ
kn(ˆ
E kn(zn(ˆ
E kn(zn(θn, ℓk), ˆ
kn(ˆ
kn(ˆ
E kn(zn(ˆ
n(ˆ
n(ˆ
kn(ˆ
kn(ˆ
0.00 0.04 0.08
K−S : hk=1.5 the other hj
None 2.5 2.0
0.94 0.98
Upper : hk=1.5 the other hj coverage
None 2.5 2.0
0.94 0.98
Lower : hk=1.5 the other hj coverage
None 2.5 2.0
0.04 0.08
K−S : hk=2 the other hj
None 1.5 2.5
0.94 0.98
Upper : hk=2 the other hj coverage
None 1.5 2.5
0.94 0.98
Lower : hk=2 the other hj coverage
None 1.5 2.5
selec_E positive
0.00 0.04 0.08
K−S : hk=2.5 the other hj
None 1.5 2.0
0.94 0.98
Upper : hk=2.5 the other hj coverage
None 1.5 2.0
0.94 0.98
Lower : hk=2.5 the other hj coverage
None 1.5 2.0
selec_E positive
E < 0 with prob > 5% in tough cases:
r
k + Z
Reference: (single spike) Yang & J., Statistica Sinica 2018. (multispike) in preparation.