Spectral Methods for Latent Variable Models
Kaizheng Wang Department of ORFE Princeton University March 20th 2020
Spectral Methods for Latent Variable Models Kaizheng Wang - - PowerPoint PPT Presentation
Spectral Methods for Latent Variable Models Kaizheng Wang Department of ORFE Princeton University March 20 th 2020 Data Diversity Unstructured, heterogeneous and incomplete information: ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Credit:
Kaizheng Wang Department of ORFE Princeton University March 20th 2020
Unstructured, heterogeneous and incomplete information:
Credit: https://www.mathworks.com/help/textanalytics/gs/getting-started-with-topic-modeling.html, https://www.alliance-scotland.org.uk/alliance-homepage-holding-people-networking-2017-01-3/, https://medicalxpress.com/news/2015-04-tumor-only-genetic-sequencing-misguide-cancer .html, https://www.nature.com/articles/nature21386/figures/1, https://viterbi-web.usc.edu/ soltanol/RSC.pdf, Dzenan Hamzic
2
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
3
Credit (upper right): https://viterbi-web.usc.edu/ soltanol/RSC.pdf.
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
n × d
Object-by-feature ( ):
n × n
Object-by-object ( ):
Common belief: high ambient dim. but low intrinsic dim. Low-rank approximation:
4
2
Principal Component Analysis (PCA)
— truncated SVD
X1 . . . Xn = f 1 . . . fn B + E1 . . . En .
n × r n × d r × d n × d
samples latent coordinates latent bases noises
Low-dimensional embedding via latent variables:
5
Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels.
PC1 P C 2
6
2
9
Data: i.i.d., , . Goal: estimate the principal subspace spanned by the K leading eigenvectors of . PCA:
{Xi}n
i=1 ✓ Rd n
EXi = 0
E(XiX>
i ) = Σ
) = Σ
SVD
X = (X1, · · · , Xn)> ˆ U = (ˆ u1, · · · , ˆ uK) 2 Rd⇥K
10
Center Data1 Data2 Data3 Data4 Data5
11
Related works: Mcdonald et al. 2009; Zhang et al. 2013; Lee et al., 2015; Battey et al., 2015; Qu et al. 2002; El Karoui and d’Aspremont 2010; Liang et al. 2014.
and sends to the central server;
m local machines in total, each has n samples.
X(`) 2 Rn⇥d ˆ U (`) 2 Rd⇥K
SVD
ˆ U (`)
{ ˆ U (`)}m
`=1
ˆ U 2 Rd⇥K
How to find that best summarizes ?
12
{ ˆ U (`)}m
`=1
ˆ U 2 Od⇥K
How to find that best summarizes ?
13
{ ˆ U (`)}m
`=1
ˆ U 2 Od⇥K ⇢(V , W ) = kV V > W W >kF.
ˆ U = argmin
V 2Od×K m
X
`=1
⇢2(V , ˆ U (`)).
( ˆ U (1), · · · , ˆ U (m)) 2 Rd⇥mK ˆ U 2 Od⇥K.
SVD
Assume that is sub-Gaussian, i.e. . Define the effective rank and condition number as
14
Xi
kΣ1/2Xik 2 . 1
r = Tr(Σ)/1, = 1/(K K+1).
Assume that is sub-Gaussian, i.e. . Define the effective rank and condition number as
15
Xi
kΣ1/2Xik 2 . 1
r = Tr(Σ)/1, = 1/(K K+1).
There exists constant C such that when ,
Theorem (FWWZ, AoS 2019)
n C2p Kr
U ˆ U > UU >kF
r Kr mn | {z }
variance
+ 2 p Kr n | {z }
bias
.
Assume that is sub-Gaussian, i.e. . Define the effective rank and condition number as
16
Xi
kΣ1/2Xik 2 . 1
r = Tr(Σ)/1, = 1/(K K+1).
There exists constant C such that when ,
Theorem (FWWZ, AoS 2019)
n C2p Kr
m . n/(2r)
U ˆ U > UU >kF
r Kr mn | {z }
variance
+ 2 p Kr n | {z }
bias
.
n C2p Kr
17
: eigenvectors of .
1 m
Pm
`=1 ˆ
U (`) ˆ U (`)>
ˆ U
X(1) 2 Rn⇥d ˆ U (1) 2 Od⇥K X(m) 2 Rn⇥d ˆ U (m) 2 Od⇥K ˆ U 2 Od⇥K
. . . . . .
SVD SVD
18
: eigenvectors of . Averaging reduces variance but retains bias.
1 m
Pm
`=1 ˆ
U (`) ˆ U (`)>
ˆ U
X(1) 2 Rn⇥d ˆ U (1) 2 Od⇥K X(m) 2 Rn⇥d ˆ U (m) 2 Od⇥K ˆ U 2 Od⇥K
. . . . . .
SVD SVD
k ˆ U (`) ˆ U (`)>UU >kF . k( ˆ Σ(`)Σ)UkF/∆.
19
is a linear functional determined by .
Theorem (FWWZ, AoS 2019)
k
>
f : Rd⇥d ! Rd⇥d
) = Σ
k ˆ U (`) ˆ U (`)> [UU > + f( ˆ Σ(`) Σ)]kF . [k( ˆ Σ(`) Σ)UkF/∆]2,
20
More precise than Davis-Kahan: PCA has small bias:
is a linear functional determined by .
Theorem (FWWZ, AoS 2019)
k
>
f : Rd⇥d ! Rd⇥d
) = Σ
k ˆ U (`) ˆ U (`)>UU >kF . k( ˆ Σ(`)Σ)UkF/∆.
kE( ˆ U (`) ˆ U (`)>) UU >kF . [k( ˆ Σ(`) Σ)UkF/∆]2. k ˆ U (`) ˆ U (`)> [UU > + f( ˆ Σ(`) Σ)]kF . [k( ˆ Σ(`) Σ)UkF/∆]2,
Theoretical guarantees for distributed PCA:
Paper (alphabetical order):
principal eigenspaces. The Annals of Statistics, 2019.
21
Novembre et al. (2008), Nature. n = 1387 individuals and d = 197146 SNPs; Figure 1a: 2-dim. embedding vs. labels.
PC1 P C 2
22
e.g. Gram , adjacency ;
get r eigen-pairs ;
e.g. using the rows of ;
e.g. visualization.
XX> A
r
j=1
(u1, u2, . . . , ur)
Ext.: { robust, probabilistic, sparse, nonnegative } PCA.
Pearson (1901), Hotelling (1933), Schölkopf (1997), Tipping and Bishop (1999), Shi and Malik (2000), Ng et al. (2002), Belkin and Niyogi (2003), Von Luxburg (2007)
23
25
Community detection in networks:
Credit: Yuxin Chen.
26
Community detection in networks:
Credit: Yuxin Chen.
Symmetric adjacency matrix , :
Stochastic Block Model (Holland et al., 1983)
A ∈ {0, 1}n×n
McSherry (2001), Coja-Oghlan (2006), Rohe et al. (2011), Mossel et al. (2013), Massoulie (2014), Lelarge et al. (2015), Chin et al. (2015), Abbe et al. (2016), Zhang and Zhou (2016).
|J| = |Jc| = n
2
P(Aij = 1) = ( p, if i, j or i, j q, if i, j or i, j .
The 2nd eigenvector reveals .
27
¯ u = 1 √n(1J − 1Jc) (J, Jc)
EA = ✓ p1J,J q1J,Jc q1Jc,J p1Jc,Jc ◆ = p + q 2n 11> + p − q 2n ✓ 1J −1Jc ◆ 1>
J
−1>
Jc
Credit: Yuxin Chen.
To recover , we need in a uniform way. Classical ℓ2 bounds (Davis and Kahan, 1970) are too loose! The 2nd eigenvector reveals .
28
¯ u = 1 √n(1J − 1Jc) (J, Jc) u ≈ ¯ u (J, Jc)
Spectral method:
the 2nd eigenvector .
sgn(u)
u
A
SVD
EA = ✓ p1J,J q1J,Jc q1Jc,J p1Jc,Jc ◆ = p + q 2n 11> + p − q 2n ✓ 1J −1Jc ◆ 1>
J
−1>
Jc
Let and .
29
a 6= b
Theorem (AFWZ, AoS 2020+)
√a − √ b 2 > 2 n−(√a−
√ b)2/2
√a − √ b 2 ≤ 2
P(Aij = 1) = (
a log n n
, if i, j or i, j
b log n n
, if i, j or i, j
30
Key ingredients:
u = Au/λ ≈ A¯ u/¯ λ
Let and . a 6= b
Theorem (AFWZ, AoS 2020+)
√a − √ b 2 > 2 n−(√a−
√ b)2/2
√a − √ b 2 ≤ 2
P(Aij = 1) = (
a log n n
, if i, j or i, j
b log n n
, if i, j or i, j
31
Theorem (Linear approximation)
31
P(pnku A¯ u/¯ λk1 < εn) > 1 n3.
32
Theorem (Linear approximation)
General results (AFWZ, AoS 2020+)
minO∈Or×r kUO A ¯ U ¯ Λ−1k2,∞ ⌧ k ¯ Uk2,∞
32
P(pnku A¯ u/¯ λk1 < εn) > 1 n3.
33
Merits and demerits of ℓ∞ analysis
Characterizes individual objects precisely
Requires strong signals for uniform control:
Successor: ℓp analysis with p < ∞
& log n
35
36
Heteroscedastic model
`i 2 {±1}, Xi|`i ⇠ N(`iµ, Σi), Σi Σ.
Low-rank: . E(X | `) = `µ>
X = (X1, · · · , Xn)> ∈ Rn⇥d
37
Related works: Montanari and Sun 2018, Ndaoud 2018, Cai et al. 2019.
Hollowing
Spectral method
;
G = H(XX>) ∈ Rn⇥n
SVD
the 1st eigenvector G
sgn(u)
u Recall . E(X | `) = `µ>
38
d . n
H(XX>)
⇠ 1
2N(µ, Id) + 1 2N(µ, Id)
39
d . n
H(XX>)
⇠ 1
2N(µ, Id) + 1 2N(µ, Id)
log kµk2 log(d/n) kµk2 ⇠ (d/n)1/2 kµk2 ⇠ (d/n)1/4
`
(`, µ)
40
G = H(XX>) ∈ Rn⇥n
is the 1st eigenvector of ,
u
40
SNR = kµk4
2
kΣk2kµk2
2 + kΣk2 F/n 1,
, is the 1st eigen-pair of .
¯ X = `µ> (¯ λ, ¯ u)
¯ X ¯ X>
41
G = H(XX>) ∈ Rn⇥n
is the 1st eigenvector of ,
u
41
If , there exists s.t. εn → 0
Theorem
2 ≤ p . SNR
P
u/¯ λkp < εnk¯ ukp
SNR = kµk4
2
kΣk2kµk2
2 + kΣk2 F/n 1,
SNR
p ) NR p p )
, is the 1st eigen-pair of .
¯ X = `µ> (¯ λ, ¯ u)
¯ X ¯ X>
When : and
42
k · kp ⇣ k · k∞
P ✓ ku G¯ u/¯ λk∞ < εn pn ◆ > 1 1 n3 .
p ⇣ SNR & log n
When : and
43
k · kp ⇣ k · k∞
P ✓ ku G¯ u/¯ λk∞ < εn pn ◆ > 1 1 n3 .
Corollary (optimal clustering, )
1111 Σ = Id
SNR > 2 log n
p ⇣ SNR & log n e−SNR/[2+o(1)] 1 ⌧ SNR 2 log n
Sharpness of spectral methods.
analysis of random matrices with low expected rank. The Annals of Statistics, 2020+.
contextual network analysis. Manuscript. Extensions
(alphabetical orders)
44
45
Spectral methods are powerful but not omnipotent. : covariance
Seek for clustering-friendly projections!
1 2N(µ, Σ) + 1 2N(−µ, Σ)
µµ> + Σ kµk2
2/kΣk2 1
Σ ≈ I
6=
46
70000 fashion products, 10 categories (Xiao et al. 2017).
Visualization by PCA
Clustering via Uncoupled REgression (CURE): Wang, Yan and Diaz. Efficient clustering for stretched mixtures: landscape and optimality. Submitted.
47