Rotation Invariant Householder Parameterization for Bayesian PCA - - PowerPoint PPT Presentation

rotation invariant householder parameterization for
SMART_READER_LITE
LIVE PREVIEW

Rotation Invariant Householder Parameterization for Bayesian PCA - - PowerPoint PPT Presentation

Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils Bertschinger June 11, 2019 Outline Probabilistic PCA (PPCA) Non-identifiability issue of PPCA Conceptual solution to the problem


slide-1
SLIDE 1

Rotation Invariant Householder Parameterization for Bayesian PCA

Rajbir-Singh Nirwan, Nils Bertschinger

June 11, 2019

slide-2
SLIDE 2

Outline

  • Probabilistic PCA (PPCA)
  • Non-identifiability issue of PPCA
  • Conceptual solution to the problem
  • Implementation
  • Results

2

slide-3
SLIDE 3
  • Classical PCA

Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE Y ∈ ℝN×D → X ∈ ℝN×Q

Probabilistic PCA

3

slide-4
SLIDE 4

Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =

N

n=1

𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I

  • Classical PCA

Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE

  • Probabilistic PCA (PPCA)

Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q

Probabilistic PCA

4

slide-5
SLIDE 5

Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =

N

n=1

𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I

  • Classical PCA

Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE

  • Probabilistic PCA (PPCA)

Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q

Probabilistic PCA

5

  • Optimization for D=5, Q=2
slide-6
SLIDE 6

Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =

N

n=1

𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I

  • Classical PCA

Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE

  • Probabilistic PCA (PPCA)

Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q

Probabilistic PCA

6

  • Optimization for D=5, Q=2
slide-7
SLIDE 7

Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =

N

n=1

𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I

  • Classical PCA

Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE

  • Probabilistic PCA (PPCA)

Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q

Probabilistic PCA

7

  • Optimization for D=5, Q=2
slide-8
SLIDE 8
  • Rotation invariant likelihood

Y = XWT + ϵ X ∈ ℝN×Q → Y ∈ ℝN×D X ∼ 𝒪(0, I), ϵ ∼ 𝒪 (0, σ2I) p(Y|W) =

N

n=1

𝒪 (Yn,:|0, WWT + σ2I) WRRTWT = WWT ∀ RRT = I

  • Classical PCA

Formulated as a projection from data space Y to a lower dimensional latent space X Latent space: maximizes variance of projected data, minimizes MSE

  • Probabilistic PCA (PPCA)

Viewed as a generative model, that maps the latent space X to the data space Y Y ∈ ℝN×D → X ∈ ℝN×Q

Probabilistic PCA

8

−2 −1 1 2

W1

−2 −1 1 2

W2

  • ptimized
slide-9
SLIDE 9

p (W|Y) = p (Y|W) p (W) p(Y)

  • If prior does not break the symmetry, posterior will be

rotation invariant as well

  • Sampling will be challenging, posterior averages are

meaningless and the interpretation of the latent space is almost impossible

Bayesian approach to PPCA

9

slide-10
SLIDE 10

p (W|Y) = p (Y|W) p (W) p(Y)

  • If prior does not break the symmetry, posterior will be

rotation invariant as well

  • Sampling will be challenging, posterior averages are

meaningless and the interpretation of the latent space is almost impossible

Bayesian approach to PPCA

10

−2 −1 1 2

W1

−2 −1 1 2

W2 sampled

slide-11
SLIDE 11
  • Find different parameterization of the model, such that the

probabilistic model is not changed

Outline of procedure

  • SVD of W

WWT = UΣVT (UΣVT)

T = UΣ2UT

V = I

  • Fix coordinate system
  • Specify correct prior

p(U, Σ)

  • Sample from

p(U, Σ|Y)

Solution

11

slide-12
SLIDE 12
  • Find different parameterization of the model, such that the

probabilistic model is not changed

Outline of procedure

  • SVD of W

WWT = UΣVT (UΣVT)

T = UΣ2UT

V = I

  • Fix coordinate system
  • Specify correct prior

p(U, Σ)

  • Sample from

p(U, Σ|Y) W ∼ 𝒪(0, I) → WWT is Wishart distributed U ∼ ? Σ ∼ ? → UΣΣTUT is Wishart distributed

Solution

12

slide-13
SLIDE 13

Theory

  • Since U, Σ is SVD of W and U, Σ2 is eigenvalue

decomposition of WWT → U is eigenvector matrix

U ∼ ? Σ ∼ ? → UΣΣTUT Wishart U ∈ 𝒲Q,D Stiefel manifold 𝒲Q,D = {U ∈ ℝD×Q|UTU = I} Eigenvectors of Wishart matrix are distributed uniformly in space of

  • rthogonal matrices ( Blai (2007), Uhlig (1994) )

→ U is uniformly distributed on the Stiefel manifold

13

slide-14
SLIDE 14

Theory

  • Since U, Σ is SVD of W and U, Σ2 is eigenvalue

decomposition of WWT → U is eigenvector matrix

U ∼ ? Σ ∼ ? → UΣΣTUT wishart U ∈ 𝒲Q,D Stiefel manifold 𝒲Q,D = {U ∈ ℝD×Q|UTU = I} Eigenvectors of Wishart matrix are distributed uniformly in space of

  • rthogonal matrices ( Blai (2007), Uhlig (1994) )

→ U is uniformly distributed on the Stiefel manifold

  • Square of ordered eigenvalue matrix Σ is distributed as

(James & Lee (2014))

p(λ) = ce− 1

2 ∑Q q=1 λq∏Q

q=1(λ

D − Q − 1 2

q

∏Q

q′=q+1 λq − λq′ )

p (σ1, …, σQ) = ce− 1

2 ∑Q q=1 σ2 q

Q

q=1

σD−Q−1

q Q

q′=q+1

σ2

q − σ2 q′ Q

q=1

2σq

14

slide-15
SLIDE 15
  • Need:

U ∼ uniform on Stiefel 𝒲Q,D Σ ∼ p(Σ) ← easy, since we know the analytic exp for density Hn = ( I ˜ Hn) ˜ Hn (vn) = − sgn (vn1) (I − 2unuT

n)

un =

vn + sgn(vn1) vn e1 vn + sgn(vn1) vn e1

for n = D : 1 How to uniformly sample U on 𝒲Q,D vn ∼ uniform on 𝕋n−1 U = HD (vD) HD−1 (vD−1)…H1 (v1)

Mezzadri (2007)

Implementation

15

slide-16
SLIDE 16

vD, …, vD−Q+1 ∼ 𝒪(0,I) σ ∼ p(σ) μ ∼ p(μ) U =

Q

q=1

HD−q+1 (vD−q+1) Σ = diag(σ) W = UΣ σ noise ∼ p (σ noise ) Y ∼

N

n=1

𝒪 (Yn,:|μ, WWT + σ2 noise I) The full generative model for Bayesian PPCA:

Implementation

16

slide-17
SLIDE 17

Synthetic Dataset

  • Construction

(N, D, Q) = (150,5,2) X ∼ 𝒪(0, I) ∈ ℝN×Q U ∼ uniform on Stiefel 𝒲Q,D Σ = diag (σ1, σ2) = diag (3.0, 1.0) W = UΣ ∈ ℝD×Q Y = XWT + ϵ ϵ ∼ 𝒪(0, 0.01) ∈ ℝN×D

  • Inference

1.0 1.5 2.0 2.5 3.0 3.5 4.0

σ

100 200 300 400 500 600 700 saPples froP p(σ|Y) σ - classical 3CA σ - 7rue values

−2 −1 1 2

W1

−2 −1 1 2

W2

−0.5 0.0 0.5 1.0

W1

−2 −1 1 2

W2

Results

  • 17
slide-18
SLIDE 18

Breast Cancer Wisconsin Dataset

(N, D) = (569, 30)

  • Bayesian PCA

−1.0 −0.5 0.0 0.5 1.0

W1

−1.0 −0.5 0.0 0.5 1.0

W2

−1.0 −0.5 0.0 0.5

W1

0.0 0.2 0.4 0.6 0.8 1.0

W2

Results

  • Advantages
  • Breaks the rotation symmetry without changing the probabilistic model
  • Enrichment of the classical PCA solution with uncertainty estimates
  • Decomposition of prior into rotation and principle variances
  • Allows to construct other priors without issues
  • Sparsity prior on principle variances without a-priori rotation preference
  • If desired a-priori rotation preference without affecting the variances

18

slide-19
SLIDE 19
  • GPLVM with the same rotation invariant problem

p(Y|X) = ∏D

d=1𝒪 (Y:,d|μ, K + σ2I)

K = XXT, Kij = XT

i,:Xj,: = k (Xi,:, Xj,:)

kSE (x, x′) = σ2

SE exp (−0.5

x − x′

2 2/l2)

−5 5

X1

−5.0 −2.5 0.0 2.5 5.0

X2 standard - chain: 1

−5 5

X1 standard - chain: 2

−5 5

X1 standard - chain: 3

−2 2

X1

−5.0 −2.5 0.0 2.5 5.0

X2 unique - chain: 1

−2 2

X1 unique - chain: 2

−2 2

X1 unique - chain: 3

Extension to non-linear models

  • No rotation symmetry in the posterior for the suggested

parameterization

  • Different chains converge to different solutions due to

increased model complexity

19

slide-20
SLIDE 20
  • Suggested new parameterization for W in PPCA, which uniquely

identifies principle components even though the likelihood and the posterior are rotationally symmetric

  • Showed how to set the prior on the new parameters such that the model

is not changed compared to a standard Gaussian prior on W

  • Provided an efficient implementation via Householder transformations

(no Jacobian correction needed)

  • New parameterization allows for other interpretable priors on rotation and

principle variances

  • Extended to non-linear models and successfully solved the rotation

problem there as well

Conclusion

20

slide-21
SLIDE 21

Thanks for your attention!

Supervisor: Prof. Dr. Nils Bertschinger Funder: Dr. h. c. Helmut O. Maucher

Poster session: #235 Github: https://github.com/RSNirwan/HouseholderBPCA