Rotation Invariant Householder Parameterization for Bayesian PCA
Rajbir-Singh Nirwan, Nils Bertschinger
June 11, 2019
Rotation Invariant Householder Parameterization for Bayesian PCA - - PowerPoint PPT Presentation
Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils Bertschinger June 11, 2019 Outline Probabilistic PCA (PPCA) Non-identifiability issue of PPCA Conceptual solution to the problem
Rajbir-Singh Nirwan, Nils Bertschinger
June 11, 2019
2
Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE Y ∈ ℝN×D → X ∈ ℝN×Q
3
Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =
N
∏
n=1
𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I
Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE
Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q
4
Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =
N
∏
n=1
𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I
Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE
Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q
5
Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =
N
∏
n=1
𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I
Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE
Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q
6
Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =
N
∏
n=1
𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I
Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE
Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q
7
Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =
N
∏
n=1
𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I
Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE
Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q
8
−2 −1 1 2
W1
−2 −1 1 2
W2
p (W|Y) = p (Y|W) p (W) p(Y)
rotation invariant as well
meaningless and the interpretation of the latent space is almost impossible
9
p (W|Y) = p (Y|W) p (W) p(Y)
rotation invariant as well
meaningless and the interpretation of the latent space is almost impossible
10
−2 −1 1 2
W1
−2 −1 1 2
W2 sampled
probabilistic model is not changed
Outline of procedure
WWT = UΣVT (UΣVT)
T = UΣ2UT
V = I
p(U, Σ)
p(U, Σ|Y)
11
probabilistic model is not changed
Outline of procedure
WWT = UΣVT (UΣVT)
T = UΣ2UT
V = I
p(U, Σ)
p(U, Σ|Y) W ∼ 𝒪(0, I) → WWT is Wishart distributed U ∼ ? Σ ∼ ? → UΣΣTUT is Wishart distributed
12
decomposition of WWT → U is eigenvector matrix
U ∼ ? Σ ∼ ? → UΣΣTUT Wishart U ∈ 𝒲Q,D Stiefel manifold 𝒲Q,D = {U ∈ ℝD×Q|UTU = I} Eigenvectors of Wishart matrix are distributed uniformly in space of
→ U is uniformly distributed on the Stiefel manifold
13
decomposition of WWT → U is eigenvector matrix
U ∼ ? Σ ∼ ? → UΣΣTUT wishart U ∈ 𝒲Q,D Stiefel manifold 𝒲Q,D = {U ∈ ℝD×Q|UTU = I} Eigenvectors of Wishart matrix are distributed uniformly in space of
→ U is uniformly distributed on the Stiefel manifold
(James & Lee (2014))
p(λ) = ce− 1
2 ∑Q q=1 λq∏Q
q=1(λ
D − Q − 1 2
q
∏Q
q′=q+1 λq − λq′ )
p (σ1, …, σQ) = ce− 1
2 ∑Q q=1 σ2 q
Q
∏
q=1
σD−Q−1
q Q
∏
q′=q+1
σ2
q − σ2 q′ Q
∏
q=1
2σq
14
U ∼ uniform on Stiefel 𝒲Q,D Σ ∼ p(Σ) ← easy, since we know the analytic exp for density Hn = ( I ˜ Hn) ˜ Hn (vn) = − sgn (vn1) (I − 2unuT
n)
un =
vn + sgn(vn1) vn e1 vn + sgn(vn1) vn e1
for n = D : 1 How to uniformly sample U on 𝒲Q,D vn ∼ uniform on 𝕋n−1 U = HD (vD) HD−1 (vD−1)…H1 (v1)
Mezzadri (2007)
15
vD, …, vD−Q+1 ∼ 𝒪(0,I) σ ∼ p(σ) μ ∼ p(μ) U =
Q
∏
q=1
HD−q+1 (vD−q+1) Σ = diag(σ) W = UΣ σ noise ∼ p (σ noise ) Y ∼
N
∏
n=1
𝒪 (Yn,:|μ, WWT + σ2 noise I) The full generative model for Bayesian PPCA:
16
Synthetic Dataset
(N, D, Q) = (150,5,2) X ∼ 𝒪(0, I) ∈ ℝN×Q U ∼ uniform on Stiefel 𝒲Q,D Σ = diag (σ1, σ2) = diag (3.0, 1.0) W = UΣ ∈ ℝD×Q Y = XWT + ϵ ϵ ∼ 𝒪(0, 0.01) ∈ ℝN×D
1.0 1.5 2.0 2.5 3.0 3.5 4.0
σ
100 200 300 400 500 600 700 saPples froP p(σ|Y) σ - classical 3CA σ - 7rue values
−2 −1 1 2
W1
−2 −1 1 2
W2
−0.5 0.0 0.5 1.0
W1
−2 −1 1 2
W2
Breast Cancer Wisconsin Dataset
(N, D) = (569, 30)
−1.0 −0.5 0.0 0.5 1.0
W1
−1.0 −0.5 0.0 0.5 1.0
W2
−1.0 −0.5 0.0 0.5
W1
0.0 0.2 0.4 0.6 0.8 1.0
W2
18
p(Y|X) = ∏D
d=1𝒪 (Y:,d|μ, K + σ2I)
K = XXT, Kij = XT
i,:Xj,: = k (Xi,:, Xj,:)
kSE (x, x′) = σ2
SE exp (−0.5
x − x′
2 2/l2)
−5 5
X1
−5.0 −2.5 0.0 2.5 5.0
X2 standard - chain: 1
−5 5
X1 standard - chain: 2
−5 5
X1 standard - chain: 3
−2 2
X1
−5.0 −2.5 0.0 2.5 5.0
X2 unique - chain: 1
−2 2
X1 unique - chain: 2
−2 2
X1 unique - chain: 3
parameterization
increased model complexity
19
identifies principle components even though the likelihood and the posterior are rotationally symmetric
is not changed compared to a standard Gaussian prior on W
(no Jacobian correction needed)
principle variances
problem there as well
20
Supervisor: Prof. Dr. Nils Bertschinger Funder: Dr. h. c. Helmut O. Maucher
Poster session: #235 Github: https://github.com/RSNirwan/HouseholderBPCA