rotation invariant householder parameterization for
play

Rotation Invariant Householder Parameterization for Bayesian PCA - PowerPoint PPT Presentation

Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils Bertschinger June 11, 2019 Outline Probabilistic PCA (PPCA) Non-identifiability issue of PPCA Conceptual solution to the problem


  1. Rotation Invariant Householder Parameterization for Bayesian PCA Rajbir-Singh Nirwan, Nils Bertschinger June 11, 2019

  2. Outline • Probabilistic PCA (PPCA) • Non-identifiability issue of PPCA • Conceptual solution to the problem • Implementation • Results � 2

  3. Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE � 3

  4. Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 4

  5. Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Optimization for D=5, Q=2 Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 5

  6. Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Optimization for D=5, Q=2 Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 6

  7. Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Optimization for D=5, Q=2 Y = XW T + ϵ ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), N 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = n =1 WRR T W T = WW T ∀ RR T = I � 7

  8. Probabilistic PCA • Classical PCA Formulated as a projection from data space Y to a lower dimensional latent space X Y ∈ ℝ N × D X ∈ ℝ N × Q → Latent space: maximizes variance of projected data, minimizes MSE • Probabilistic PCA (PPCA) Viewed as a generative model, that maps the latent space X to the data space Y X ∈ ℝ N × Q Y ∈ ℝ N × D → • Rotation invariant likelihood Y = XW T + ϵ optimized 2 ϵ ∼ 𝒪 ( 0 , σ 2 I ) X ∼ 𝒪 ( 0 , I ), 1 N W 2 0 𝒪 ( Y n ,: | 0 , WW T + σ 2 I ) ∏ p ( Y | W ) = −1 n =1 −2 WRR T W T = WW T ∀ RR T = I −2 −1 0 1 2 � 8 W 1

  9. Bayesian approach to PPCA p ( Y | W ) p ( W ) p ( W | Y ) = p ( Y ) • If prior does not break the symmetry, posterior will be rotation invariant as well • Sampling will be challenging, posterior averages are meaningless and the interpretation of the latent space is almost impossible � 9

  10. Bayesian approach to PPCA sampled 2 p ( Y | W ) p ( W ) 1 p ( W | Y ) = W 2 0 p ( Y ) −1 −2 −2 −1 0 1 2 W 1 • If prior does not break the symmetry, posterior will be rotation invariant as well • Sampling will be challenging, posterior averages are meaningless and the interpretation of the latent space is almost impossible � 10

  11. Solution • Find di ff erent parameterization of the model, such that the probabilistic model is not changed Outline of procedure WW T = U Σ V T ( U Σ V T ) T = U Σ 2 U T • SVD of W • Fix coordinate system V = I • Specify correct prior p ( U , Σ ) • Sample from p ( U , Σ | Y ) � 11

  12. Solution • Find di ff erent parameterization of the model, such that the probabilistic model is not changed Outline of procedure WW T = U Σ V T ( U Σ V T ) T = U Σ 2 U T • SVD of W • Fix coordinate system V = I • Specify correct prior p ( U , Σ ) • Sample from p ( U , Σ | Y ) is Wishart distributed WW T W ∼ 𝒪 ( 0 , I ) → U ∼ ? is Wishart distributed U ΣΣ T U T → Σ ∼ ? � 12

  13. Σ ∼ ? → U ΣΣ T U T Wishart U ∼ ? Theory • Since U , Σ is SVD of W and U , Σ 2 is eigenvalue decomposition of WW T → U is eigenvector matrix 𝒲 Q , D = { U ∈ ℝ D × Q | U T U = I } Stiefel manifold U ∈ 𝒲 Q , D Eigenvectors of Wishart matrix are distributed uniformly in space of orthogonal matrices ( Blai (2007), Uhlig (1994) ) → U is uniformly distributed on the Stiefel manifold � 13

  14. Σ ∼ ? → U ΣΣ T U T wishart U ∼ ? Theory • Since U , Σ is SVD of W and U , Σ 2 is eigenvalue decomposition of WW T → U is eigenvector matrix 𝒲 Q , D = { U ∈ ℝ D × Q | U T U = I } Stiefel manifold U ∈ 𝒲 Q , D Eigenvectors of Wishart matrix are distributed uniformly in space of orthogonal matrices ( Blai (2007), Uhlig (1994) ) → U is uniformly distributed on the Stiefel manifold • Square of ordered eigenvalue matrix Σ is distributed as (James & Lee (2014)) q ′ � = q +1 λ q − λ q ′ � ) 2 ∑ Q q =1 ( λ p ( λ ) = ce − 1 D − Q − 1 q =1 λ q ∏ Q ∏ Q 2 q Q Q Q p ( σ 1 , …, σ Q ) = ce − 1 2 ∑ Q q =1 σ 2 ∏ ∏ ∏ σ D − Q − 1 σ 2 q − σ 2 q 2 σ q q q ′ � q ′ � = q +1 q =1 q =1 � 14

  15. Implementation U ∼ uniform on Stiefel 𝒲 Q , D • Need: Σ ∼ p ( Σ ) ← easy, since we know the analytic exp for density How to uniformly sample U on 𝒲 Q , D for n = D : 1 v n ∼ uniform on 𝕋 n − 1 v n + sgn ( v n 1 ) v n e 1 u n = v n + sgn ( v n 1 ) v n e 1 H n ( v n ) = − sgn ( v n 1 ) ( I − 2 u n u T ˜ n ) H n = ( H n ) I 0 ˜ 0 Mezzadri (2007) U = H D ( v D ) H D − 1 ( v D − 1 ) … H 1 ( v 1 ) � 15

  16. Implementation The full generative model for Bayesian PPCA: v D , …, v D − Q +1 ∼ 𝒪 (0, I ) σ ∼ p ( σ ) μ ∼ p ( μ ) Q H D − q +1 ( v D − q +1 ) ∏ U = q =1 Σ = diag( σ ) W = U Σ σ noise ∼ p ( σ noise ) N 𝒪 ( Y n ,: | μ , WW T + σ 2 noise I ) ∏ Y ∼ n =1 � 16

  17. Results Synthetic Dataset • Construction X ∼ 𝒪 ( 0 , I ) ∈ ℝ N × Q U ∼ uniform on Stiefel 𝒲 Q , D ( N , D , Q ) = (150,5,2) ϵ ∼ 𝒪 (0, 0.01) ∈ ℝ N × D Σ = diag ( σ 1 , σ 2 ) = diag (3.0, 1.0) W = U Σ ∈ ℝ D × Q Y = XW T + ϵ • Inference 700 saPples froP p ( σ | Y ) σ - classical 3CA 2 2 600 σ - 7rue values 500 1 1 400 W 2 W 2 0 0 300 −1 200 −1 100 −2 −2 0 −2 −1 0 1 2 −0.5 0.0 0.5 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 W 1 W 1 σ � 17

  18. Results Breast Cancer Wisconsin Dataset ( N , D ) = (569, 30) • Bayesian PCA 1.0 1.0 0.8 0.5 0.6 W 2 W 2 0.0 0.4 −0.5 0.2 −1.0 0.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 W 1 W 1 • Advantages • Breaks the rotation symmetry without changing the probabilistic model • Enrichment of the classical PCA solution with uncertainty estimates • Decomposition of prior into rotation and principle variances • Allows to construct other priors without issues • Sparsity prior on principle variances without a-priori rotation preference • If desired a-priori rotation preference without a ff ecting the variances � 18

  19. Extension to non-linear models • GPLVM with the same rotation invariant problem standard - chain: 1 standard - chain: 2 standard - chain: 3 5.0 d =1 𝒪 ( Y :, d | μ , K + σ 2 I ) p ( Y | X ) = ∏ D 2.5 X 2 0.0 i ,: X j ,: = k ( X i ,: , X j ,: ) −2.5 K = XX T , K ij = X T −5.0 −5 0 5 −5 0 5 −5 0 5 X 1 X 1 X 1 SE exp ( − 0.5 2 / l 2 ) 2 unique - chain: 1 unique - chain: 2 unique - chain: 3 k SE ( x , x ′ � ) = σ 2 x − x ′ � 5.0 2.5 X 2 0.0 −2.5 −5.0 −2 0 2 −2 0 2 −2 0 2 X 1 X 1 X 1 • No rotation symmetry in the posterior for the suggested parameterization • Di ff erent chains converge to di ff erent solutions due to increased model complexity � 19

  20. Conclusion • Suggested new parameterization for W in PPCA, which uniquely identifies principle components even though the likelihood and the posterior are rotationally symmetric • Showed how to set the prior on the new parameters such that the model is not changed compared to a standard Gaussian prior on W • Provided an e ffi cient implementation via Householder transformations (no Jacobian correction needed) • New parameterization allows for other interpretable priors on rotation and principle variances • Extended to non-linear models and successfully solved the rotation problem there as well � 20

  21. Poster session: #235 Github: https://github.com/RSNirwan/HouseholderBPCA Thanks for your attention! Supervisor: Prof. Dr. Nils Bertschinger Funder: Dr. h. c. Helmut O. Maucher

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend