probabilistic numerics part ii linear algebra and
play

Probabilistic Numerics Part II Linear Algebra and Nonlinear - PowerPoint PPT Presentation

Probabilistic Numerics Part II Linear Algebra and Nonlinear Optimization Philipp Hennig MLSS 2015 20 / 07 / 2015 Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent


  1. Probabilistic Numerics – Part II – Linear Algebra and Nonlinear Optimization Philipp Hennig MLSS 2015 20 / 07 / 2015 Emmy Noether Group on Probabilistic Numerics Department of Empirical Inference Max Planck Institute for Intelligent Systems Tübingen, Germany

  2. Probabilistic Numerics Recap from Saturday On Saturday ▸ computation is inference ▸ classic methods for integration and solution of differential equations can be interpreted as MAP inference from Gaussian models ▸ customizing the implicit prior gives faster, tailored numerics ▸ probabilistic formulation allows propagation of uncertainty through composite computations 1 ,

  3. Linear Algebra Ax = b A ∈ R N × N symmetric positive definite A x \ b 2 ,

  4. Why you should care about linear algebra least-squares: a most basic machine learning task A − 1 A f ( x ) = k xX ( k XX + σ 2 I ) − 1 b = k xX A − 1 b ˆ 3 ,

  5. Inference on Matrix Elements generic Gaussian priors [Hennig, SIOPT, 2015] ▸ prior on elements of inverse H = A − 1 ∈ R N × N with Σ ∈ R N 2 × N 2 p ( H ) = N(� H ; � ⇀ ⇀ ( 2 π ) N 2 / 2 ∣ Σ ∣ 1 / 2 exp [(� � � � ⇀ Σ − 1 (� � � � ⇀ ⊺ H 0 , Σ ) = 1 H − H 0 ) H − H 0 )] ▸ can collect noise-free observations p ( S,Y ∣ H ) = δ ( S − HY ) AS = Y S = HY ∈ R N × M ⇔ ▸ a linear projection: (using the Kronecker product) S km = ∑ S = ( I ⊗ Y ⊺ ) � H = C � C ∈ R NM × N 2 � ⇀ � ⇀ ⇀ ⇀ δ ki Y jm H ij . H ij ▸ posterior: p ( H ∣ S,Y ) = N [ � H 0 + Σ C ⊺ (C Σ C ⊺ ) − 1 ( � S − C H 0 ) , Σ − Σ C ⊺ (C Σ C ⊺ ) − 1 C Σ ] H ; � ⇀ ⇀ � � � � ⇀ ▸ requires O( N 3 M ) operations! Need structure in Σ 4 ,

  6. p ( H ∣ S,Y ) = N [� H ; � ⇀ H 0 + Σ C ⊺ (C Σ C ⊺ ) − 1 (� ⇀ � � � � ⇀ S − C H 0 ) , Σ − Σ C ⊺ (C Σ C ⊺ ) − 1 C Σ ] ▸ good probabilistic numerical methods must have both ▸ low computational cost ▸ meaningful prior assumptions 5 ,

  7. A factorization assumption with support on all matrices = + D ⊺ H 0 H C ⋅ ▸ cov ( H ij ,H kℓ ) = V ik W jℓ p ( H ) = N( H ; H 0 ,V ⊗ W ) ⇒ ▸ if V,W ≻ 0 , this puts nonzero mass on all H ∈ R N × N var ( H ij ) = V ii W jj ▸ draw n columns of C iid. from N( C ∶ i ;0 ,V / n ) ▸ draw n columns of D iid. from N( D ∶ i ;0 ,W / n ) 6 ,

  8. A Structured Prior computation requires trading expressivity and cost [Hennig, SIOPT, 2015] ▸ prior p ( H ) = N(� H ; � ⇀ ⇀ H 0 ,V ⊗ W ) gives p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] H true H M Y A S ⇒ 7 ,

  9. A Structured Prior computation requires trading expressivity and cost [Hennig, SIOPT, 2015] ▸ prior p ( H ) = N(� H ; � ⇀ ⇀ H 0 ,V ⊗ W ) gives p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] H true H M Y A S ⇒ 7 ,

  10. A Structured Prior computation requires trading expressivity and cost [Hennig, SIOPT, 2015] ▸ prior p ( H ) = N(� H ; � ⇀ ⇀ H 0 ,V ⊗ W ) gives p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] H true H M Y A S ⇒ ▸ two problems: ▸ still requires O( M 3 ) inversion just to compute mean ↝ would like diagonal Y ⊺ WY (conjugate observations) ▸ how to choose H 0 , V, W to get well-scaled prior? ↝ ‘empirical Bayesian’ choice to include H 7 ,

  11. A Scaled Prior probabilistic computation needs meaningful priors [Hennig, SIOPT, 2015] ▸ using H 0 = ǫI with ǫ ≪ 1 . It would be nice to have W = V = H : var ( H ) ij = V ii W jj = H ii H jj for symmetric positive definite matrices, H ii > 0 , H 2 ij ≤ H ii H jj ▸ if W = V = H , p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ WY ) − 1 Y ⊺ W, V ⊗ ( W − WY ( Y ⊺ WY ) − 1 Y ⊺ W )] 8 ,

  12. A Scaled Prior probabilistic computation needs meaningful priors [Hennig, SIOPT, 2015] ▸ using H 0 = ǫI with ǫ ≪ 1 . It would be nice to have W = V = H : var ( H ) ij = V ii W jj = H ii H jj for symmetric positive definite matrices, H ii > 0 , H 2 ij ≤ H ii H jj ▸ if W = V = H , p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ , W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] 8 ,

  13. A Scaled Prior probabilistic computation needs meaningful priors [Hennig, SIOPT, 2015] ▸ using H 0 = ǫI with ǫ ≪ 1 . It would be nice to have W = V = H : var ( H ) ij = V ii W jj = H ii H jj for symmetric positive definite matrices, H ii > 0 , H 2 ij ≤ H ii H jj ▸ if W = V = H , p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ , W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] ▸ can choose conjugate directions S ⊺ AS = S ⊺ Y = diag i { g i } using Gram-Schmidt process. Choose orthogonal set { u 1 ,...,u N } y ⊺ i − 1 j u i s i = u i − ∑ s j y ⊺ j s j j = 1 then ( s m − H 0 y m ) s ⊺ M E ∣ S,Y [ H ] = H 0 + ∑ m y ⊺ m s m i = 1 8 ,

  14. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  15. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  16. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  17. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  18. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  19. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true 9 ,

  20. Active Learning of Matrix Inverses Gaussian Elimination [C.F . Gauss, 1809] which set of orthogonal directions should we choose? ▸ e.g. { u 1 ,...,u N } = { e 1 ,...,e N } ∣ S ∣ ∣ Y ∣ p ( H ) ∣ A ⋅ H M ∣ H true Gaussian eliminiation of A is maximum a-posteriori estimation of H under a well-scaled Gaussian prior, if the search directions are chosen from the unit vectors. 9 ,

  21. Gaussian elimination as MAP inference: ▸ decide to use Gaussian prior ▸ factorization assumption (Kronecker structure) in covariance gives simple update ▸ implicitly choosing “ W = H ” gives well-scaled prior ▸ conjugate directions for efficient bookkeeping ▸ construct projections from unit vectors 10 ,

  22. What about Uncertainty? calibrating prior covariance at runtime [Hennig, SIOPT, 2015] under “ W = H ” p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ ,W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] just need WY = S . So choose W = S ( Y ⊺ S ) − 1 S ⊺ + ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) Ω ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) 1 0 . 8 m s m 0 . 6 y ⊺ 0 . 4 0 . 2 0 5 10 15 20 25 30 step m 11 ,

  23. What about Uncertainty? calibrating prior covariance at runtime [Hennig, SIOPT, 2015] under “ W = H ” p ( H ∣ S,Y ) = N[ H ; H 0 + ( S − H 0 Y )( Y ⊺ S ) − 1 S ⊺ ,W ⊗ ( W − S ( Y ⊺ S ) − 1 S ⊺ )] just need WY = S . So choose W = S ( Y ⊺ S ) − 1 S ⊺ + ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) Ω ( I − Y ( Y ⊺ Y ) − 1 Y ⊺ ) W M for W 0 = H W M for W 0 estimated 11 ,

  24. ▸ scaled, structured prior, exploration along unit vectors gives Gaussian elimination ▸ empirical Bayesian estimation of covariance gives scaled posterior uncertainty, retains classic estimate, at very low cost overhead 12 ,

  25. Can we do better than Gaussian Elimination? encode symmetry H = H ⊺ [Hennig, SIOPT, 2015] ▸ Using Γ � H = 1 / 2 (� ⇀ H + H ⊺ ) , p ( symm. ∣ H ) = lim β � 0 N( 0;Γ � � � � ⇀ ⇀ H,β ) p ( H ∣ symm. ) = N(� H ; � ⇀ ⇀ H 0 ,W ⊗ ⊖ W ) ( W ⊗ ⊖ W ) ij,kℓ = 1 / 2 ( W ik W jℓ + W iℓ W jk ) H ∼ N ( H 0 , W ⊗ W ) H ∼ N ( H 0 , W ⊗ ⊖ W ) ▸ p ( S,Y ∣ H ) = δ ( S − HY ) now gives ( ∆ = S − H 0 Y , G = Y ⊺ WY ) p ( H ∣ S,Y ) = N[ H ; H 0 + ∆ G − 1 Y ⊺ W + WY G − 1 ∆ ⊺ − WY G − 1 ∆ ⊺ Y G − 1 Y ⊺ W, ( W − WY G − 1 Y ⊺ W )⊗ ⊖( W − WY G − 1 Y ⊺ W )] 13 ,

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend