Probabilistic & Unsupervised Learning Expectation Propagation - PowerPoint PPT Presentation

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T � � � � B t B t B t B t z t + 1 = f ( z t , u t ) + w t � � � � A t A t A t A t y 1 y 2 y 3 • • • y T x t = g ( z t , u t ) + v t � � � � D t D t D t D t � � � � C t C t C t C t w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � + ∂ f z t � z t z t + 1 ≈ f ( ˆ t , u t ) ( z t − ˆ t ) + w t � ∂ z t � �� z t ˆ t � � �� B t u t � A t � � + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ , u t ) ( z t − ˆ ) + v t � t t ∂ z t � �� z t − 1 ˆ z t t � D t u t � �� z t ˆ � C t t Run the Kalman filter (smoother) on non-stationary linearised system ( � A t , � B t , � C t , � D t ): ◮ Adaptively approximates non-Gaussian messages by Gaussians. ◮ Local linearisation depends on central point of distribution ⇒ approximation degrades with increased state uncertainty. May work acceptably for close-to-linear systems. Can base EM-like algorithm on EKF/EKS (or alternatives).

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 )

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach.

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand.

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter:

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter: √ λ v ) for eigenvalues, eigenvectors ˆ ◮ Evaluate f (ˆ z t –1 ) , f (ˆ V t − 1 v = λ v . z t –1 ±

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter: √ λ v ) for eigenvalues, eigenvectors ˆ ◮ Evaluate f (ˆ z t –1 ) , f (ˆ V t − 1 v = λ v . z t –1 ± ◮ “Fit” Gaussian to these 2 K + 1 points.

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter: √ λ v ) for eigenvalues, eigenvectors ˆ ◮ Evaluate f (ˆ z t –1 ) , f (ˆ V t − 1 v = λ v . z t –1 ± ◮ “Fit” Gaussian to these 2 K + 1 points. ◮ Equivalent to numerical evaluation of mean and covariance by Gaussian quadrature.

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter: √ λ v ) for eigenvalues, eigenvectors ˆ ◮ Evaluate f (ˆ z t –1 ) , f (ˆ V t − 1 v = λ v . z t –1 ± ◮ “Fit” Gaussian to these 2 K + 1 points. ◮ Equivalent to numerical evaluation of mean and covariance by Gaussian quadrature. ◮ One form of “Assumed Density Filtering” and EP .

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter: √ λ v ) for eigenvalues, eigenvectors ˆ ◮ Evaluate f (ˆ z t –1 ) , f (ˆ V t − 1 v = λ v . z t –1 ± ◮ “Fit” Gaussian to these 2 K + 1 points. ◮ Equivalent to numerical evaluation of mean and covariance by Gaussian quadrature. ◮ One form of “Assumed Density Filtering” and EP . �� ˆ �� z t , ˆ ◮ Parametric variational: argmin KL N V t d z t –1 . . . . Requires Gaussian � ⇒ may be challenging. expectations of log

Other message approximations Consider the forward messages on a latent chain: � P ( z t | x 1 : t ) = 1 Z P ( x t | z t ) d z t –1 P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) We want to approximate the messages to retain a tractable form (i.e. Gaussian). � P ( z t | x 1 : t ) ≈ 1 ˜ ˜ Z P ( x t | z t ) P ( z t | z t –1 ) P ( z t –1 | x 1 : t –1 ) d z t –1 � �� N ( f ( z t –1 ) , Q ) N (ˆ z t –1 , V t –1 ) ◮ Linearisation at the peak (EKF) is only one approach. ◮ Laplace filter: use mode and curvature of integrand. ◮ Sigma-point (“unscented”) filter: √ λ v ) for eigenvalues, eigenvectors ˆ ◮ Evaluate f (ˆ z t –1 ) , f (ˆ V t − 1 v = λ v . z t –1 ± ◮ “Fit” Gaussian to these 2 K + 1 points. ◮ Equivalent to numerical evaluation of mean and covariance by Gaussian quadrature. ◮ One form of “Assumed Density Filtering” and EP . �� ˆ �� z t , ˆ ◮ Parametric variational: argmin KL N V t d z t –1 . . . . Requires Gaussian � ⇒ may be challenging. expectations of log �� ˆ �� N z t , ˆ ◮ The other KL: argmin KL d z t –1 V t needs only first and second moments of nonlinear message ⇒ EP.

Variational learning Free energy: F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z|X ) + H [ q ] = log P ( X| θ ) − KL [ q ( Z ) � P ( Z|X , θ )] ≤ ℓ ( θ )

Variational learning Free energy: F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z|X ) + H [ q ] = log P ( X| θ ) − KL [ q ( Z ) � P ( Z|X , θ )] ≤ ℓ ( θ ) E-steps: ◮ Exact EM: q ( Z ) = argmax F = P ( Z|X , θ ) q ◮ Saturates bound: converges to local maximum of likelihood.

Variational learning Free energy: F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z|X ) + H [ q ] = log P ( X| θ ) − KL [ q ( Z ) � P ( Z|X , θ )] ≤ ℓ ( θ ) E-steps: ◮ Exact EM: q ( Z ) = argmax F = P ( Z|X , θ ) q ◮ Saturates bound: converges to local maximum of likelihood. ◮ (Factored) variational approximation: q ( Z ) = F = KL [ q 1 ( Z 1 ) q 2 ( Z 2 ) � P ( Z|X , θ )] argmax argmin q 1 ( Z 1 ) q 2 ( Z 2 ) q 1 ( Z 1 ) q 2 ( Z 2 )

Variational learning Free energy: F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z|X ) + H [ q ] = log P ( X| θ ) − KL [ q ( Z ) � P ( Z|X , θ )] ≤ ℓ ( θ ) E-steps: ◮ Exact EM: q ( Z ) = argmax F = P ( Z|X , θ ) q ◮ Saturates bound: converges to local maximum of likelihood. ◮ (Factored) variational approximation: q ( Z ) = F = KL [ q 1 ( Z 1 ) q 2 ( Z 2 ) � P ( Z|X , θ )] argmax argmin q 1 ( Z 1 ) q 2 ( Z 2 ) q 1 ( Z 1 ) q 2 ( Z 2 ) ◮ Increases bound: converges, but not necessarily to ML.

Variational learning Free energy: F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z|X ) + H [ q ] = log P ( X| θ ) − KL [ q ( Z ) � P ( Z|X , θ )] ≤ ℓ ( θ ) E-steps: ◮ Exact EM: q ( Z ) = argmax F = P ( Z|X , θ ) q ◮ Saturates bound: converges to local maximum of likelihood. ◮ (Factored) variational approximation: q ( Z ) = F = KL [ q 1 ( Z 1 ) q 2 ( Z 2 ) � P ( Z|X , θ )] argmax argmin q 1 ( Z 1 ) q 2 ( Z 2 ) q 1 ( Z 1 ) q 2 ( Z 2 ) ◮ Increases bound: converges, but not necessarily to ML. ◮ Other approximations: q ( Z ) ≈ P ( Z|X , θ )

Variational learning Free energy: F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z|X ) + H [ q ] = log P ( X| θ ) − KL [ q ( Z ) � P ( Z|X , θ )] ≤ ℓ ( θ ) E-steps: ◮ Exact EM: q ( Z ) = argmax F = P ( Z|X , θ ) q ◮ Saturates bound: converges to local maximum of likelihood. ◮ (Factored) variational approximation: q ( Z ) = F = KL [ q 1 ( Z 1 ) q 2 ( Z 2 ) � P ( Z|X , θ )] argmax argmin q 1 ( Z 1 ) q 2 ( Z 2 ) q 1 ( Z 1 ) q 2 ( Z 2 ) ◮ Increases bound: converges, but not necessarily to ML. ◮ Other approximations: q ( Z ) ≈ P ( Z|X , θ ) ◮ Usually no guarantees, but if learning converges it may be more accurate than the factored approximation

Approximating the posterior Linearisation (or local Laplace, sigma-point and other such approaches) seem ad hoc . A more principled approach might look for an approximate q that is closest to P in some sense. q = argmin D ( P ↔ q ) q ∈Q

Approximating the posterior Linearisation (or local Laplace, sigma-point and other such approaches) seem ad hoc . A more principled approach might look for an approximate q that is closest to P in some sense. q = argmin D ( P ↔ q ) q ∈Q Open choices: ◮ form of the metric D ◮ nature of the constraint space Q

Approximating the posterior Linearisation (or local Laplace, sigma-point and other such approaches) seem ad hoc . A more principled approach might look for an approximate q that is closest to P in some sense. q = argmin D ( P ↔ q ) q ∈Q Open choices: ◮ form of the metric D ◮ nature of the constraint space Q ◮ Variational methods: D = KL [ q � P ] .

Approximating the posterior Linearisation (or local Laplace, sigma-point and other such approaches) seem ad hoc . A more principled approach might look for an approximate q that is closest to P in some sense. q = argmin D ( P ↔ q ) q ∈Q Open choices: ◮ form of the metric D ◮ nature of the constraint space Q ◮ Variational methods: D = KL [ q � P ] . ◮ Choosing Q = { tree-factored distributions } leads to efficient message passing.

Approximating the posterior Linearisation (or local Laplace, sigma-point and other such approaches) seem ad hoc . A more principled approach might look for an approximate q that is closest to P in some sense. q = argmin D ( P ↔ q ) q ∈Q Open choices: ◮ form of the metric D ◮ nature of the constraint space Q ◮ Variational methods: D = KL [ q � P ] . ◮ Choosing Q = { tree-factored distributions } leads to efficient message passing. ◮ Can we use other divergences?

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )?

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )? For a factored approximation the (clique) marginals obtained by minimising this KL are correct:

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )? For a factored approximation the (clique) marginals obtained by minimising this KL are correct: � � � � � � � P ( Z|X ) q j ( Z j |X ) − d Z P ( Z|X ) log q j ( Z j |X ) argmin KL � = argmin q i q i j

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )? For a factored approximation the (clique) marginals obtained by minimising this KL are correct: � � � � � � � P ( Z|X ) q j ( Z j |X ) − d Z P ( Z|X ) log q j ( Z j |X ) argmin KL � = argmin q i q i j � � = argmin − d Z P ( Z|X ) log q j ( Z j |X ) q i j

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )? For a factored approximation the (clique) marginals obtained by minimising this KL are correct: � � � � � � � P ( Z|X ) q j ( Z j |X ) − d Z P ( Z|X ) log q j ( Z j |X ) argmin KL � = argmin q i q i j � � = argmin − d Z P ( Z|X ) log q j ( Z j |X ) q i j � = argmin − d Z i P ( Z i |X ) log q i ( Z i |X ) q i = P ( Z i |X ) and the marginals are what we need for learning (although if factored over disjoint sets as in the variational approximation some cliques will be missing).

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )? For a factored approximation the (clique) marginals obtained by minimising this KL are correct: � � � � � � � P ( Z|X ) q j ( Z j |X ) − d Z P ( Z|X ) log q j ( Z j |X ) argmin KL � = argmin q i q i j � � = argmin − d Z P ( Z|X ) log q j ( Z j |X ) q i j � = argmin − d Z i P ( Z i |X ) log q i ( Z i |X ) q i = P ( Z i |X ) and the marginals are what we need for learning (although if factored over disjoint sets as in the variational approximation some cliques will be missing). Perversely, this means finding the best q for this KL is intractable!

The other KL What about the ‘other’ KL ( q = argmin KL [ P � q ] )? For a factored approximation the (clique) marginals obtained by minimising this KL are correct: � � � � � � � P ( Z|X ) q j ( Z j |X ) − d Z P ( Z|X ) log q j ( Z j |X ) argmin KL � = argmin q i q i j � � = argmin − d Z P ( Z|X ) log q j ( Z j |X ) q i j � = argmin − d Z i P ( Z i |X ) log q i ( Z i |X ) q i = P ( Z i |X ) and the marginals are what we need for learning (although if factored over disjoint sets as in the variational approximation some cliques will be missing). Perversely, this means finding the best q for this KL is intractable! But it raises the hope that approximate minimisation might still yield useful results.

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites.

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites. Consider q with the same factorisation, but potentially approximated sites: � N def ˜ q ( Z ) f i ( Z i ) . We would like to minimise (at least in some sense) KL [ P � q ] . = i = 1

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites. Consider q with the same factorisation, but potentially approximated sites: � N def ˜ q ( Z ) f i ( Z i ) . We would like to minimise (at least in some sense) KL [ P � q ] . = i = 1 Possible optimisations:

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites. Consider q with the same factorisation, but potentially approximated sites: � N def ˜ q ( Z ) f i ( Z i ) . We would like to minimise (at least in some sense) KL [ P � q ] . = i = 1 Possible optimisations: � � N � � � N � ˜ f i ( Z i ) f i ( Z i ) min KL � (global: intractable) { ˜ f i } i = 1 i = 1

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites. Consider q with the same factorisation, but potentially approximated sites: � N def ˜ q ( Z ) f i ( Z i ) . We would like to minimise (at least in some sense) KL [ P � q ] . = i = 1 Possible optimisations: � � N � � � N � ˜ f i ( Z i ) f i ( Z i ) min KL � (global: intractable) { ˜ f i } i = 1 i = 1 � � � � � ˜ KL f i ( Z i ) f i ( Z i ) min (local, fixed: simple, inaccurate) ˜ f i

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites. Consider q with the same factorisation, but potentially approximated sites: � N def ˜ q ( Z ) f i ( Z i ) . We would like to minimise (at least in some sense) KL [ P � q ] . = i = 1 Possible optimisations: � � N � � � N � ˜ f i ( Z i ) f i ( Z i ) min KL � (global: intractable) { ˜ f i } i = 1 i = 1 � � � � � ˜ KL f i ( Z i ) f i ( Z i ) min (local, fixed: simple, inaccurate) ˜ f i � � � � � � ˜ � ˜ ˜ f i ( Z i ) f j ( Z j ) f i ( Z i ) f j ( Z j ) min KL (local, contextual: iterative, accurate) ˜ f i j � = i j � = i

Approximate optimisation The posterior distribution in a graphical model is a (normalised) product of factors: N � � P ( Z|X ) = P ( Z , X ) = 1 P ( Z i | pa ( Z i )) ∝ f i ( Z i ) P ( X ) Z i = 1 i where the Z i are not necessarily disjoint. In the language of EP the f i are called sites. Consider q with the same factorisation, but potentially approximated sites: � N def ˜ q ( Z ) f i ( Z i ) . We would like to minimise (at least in some sense) KL [ P � q ] . = i = 1 Possible optimisations: � � N � � � N � ˜ f i ( Z i ) f i ( Z i ) min KL � (global: intractable) { ˜ f i } i = 1 i = 1 � � � � � ˜ KL f i ( Z i ) f i ( Z i ) min (local, fixed: simple, inaccurate) ˜ f i � � � � � � ˜ � ˜ ˜ f i ( Z i ) f j ( Z j ) f i ( Z i ) f j ( Z j ) (local, contextual: iterative, accurate) ← EP min KL ˜ f i j � = i j � = i

Expectation? Propagation? EP is really two ideas: ◮ Approximation of factors.

Expectation? Propagation? EP is really two ideas: ◮ Approximation of factors. ◮ Usually by “projection” to exponential families. ◮ This involves finding expected sufficient statistics, hence expectation.

Expectation? Propagation? EP is really two ideas: ◮ Approximation of factors. ◮ Usually by “projection” to exponential families. ◮ This involves finding expected sufficient statistics, hence expectation. ◮ Local divergence minimization in the context of other factors.

Expectation? Propagation? EP is really two ideas: ◮ Approximation of factors. ◮ Usually by “projection” to exponential families. ◮ This involves finding expected sufficient statistics, hence expectation. ◮ Local divergence minimization in the context of other factors. ◮ This leads to a message passing approach, hence propagation.

Local updates Each EP update involves a KL minimisation: � � � def f new ˜ ˜ ( Z ) ← argmin KL [ f i ( Z i ) q ¬ i ( Z ) � f ( Z i ) q ¬ i ( Z )] q ¬ i ( Z ) f j ( Z j ) = i f ∈{ ˜ f } j � = i def Write q ¬ i ( Z ) = q ¬ i ( Z i ) q ¬ i ( Z ¬ i |Z i ) . Then: [ Z ¬ i = Z\Z i ] min KL [ f i ( Z i ) q ¬ i ( Z ) � f ( Z i ) q ¬ i ( Z )] f � d Z i d Z ¬ i f i ( Z i ) q ¬ i ( Z ) log f ( Z i ) q ¬ i ( Z ) = max f � � = max d Z i d Z ¬ i f i ( Z i ) q ¬ i ( Z i ) q ¬ i ( Z ¬ i |Z i ) log f ( Z i ) q ¬ i ( Z i ) + log q ¬ i ( Z ¬ i |Z i ) f � � � = max d Z i f i ( Z i ) q ¬ i ( Z i ) log f ( Z i ) q ¬ i ( Z i )) d Z ¬ i q ¬ i ( Z ¬ i |Z i ) f KL [ f i ( Z i ) q ¬ i ( Z i ) � f ( Z i ) q ¬ i ( Z i )] = min f q ¬ i ( Z i ) is sometimes called the cavity distribution.

Expectation Propagation (EP) Input f 1 ( Z 1 ) . . . f N ( Z N ) f i ( Z i ) = 1 for i > 1, q ( Z ) ∝ � Initialize ˜ KL [ f 1 ( Z 1 ) � f 1 ( Z 1 )] , ˜ i ˜ f 1 ( Z 1 ) = argmin f i ( Z i ) f ∈{ ˜ f } repeat for i = 1 . . . N do � Delete: q ¬ i ( Z ) ← q ( Z ) ˜ = f j ( Z j ) ˜ f i ( Z i ) j � = i Project: ˜ f new ( Z ) ← argmin KL [ f i ( Z i ) q ¬ i ( Z i ) � f ( Z i ) q ¬ i ( Z i )] i f ∈{ ˜ f } Include: q ( Z ) ← ˜ f new ( Z i ) q ¬ i ( Z ) i end for until convergence

Message Passing ◮ The cavity distribution (in a tree) can be further broken down into a product of terms from each neighbouring clique: � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i )

Message Passing ◮ The cavity distribution (in a tree) can be further broken down into a product of terms from each neighbouring clique: � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) ◮ Once the i th site has been approximated, the messages can be passed on to neighbouring cliques by marginalising to the shared variables (SSM example follows). ⇒ belief propagation.

Message Passing ◮ The cavity distribution (in a tree) can be further broken down into a product of terms from each neighbouring clique: � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) ◮ Once the i th site has been approximated, the messages can be passed on to neighbouring cliques by marginalising to the shared variables (SSM example follows). ⇒ belief propagation. ◮ In loopy graphs, we can use loopy belief propagation. In that case � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) becomes an approximation to the true cavity distribution (or we can recast the approximation directly in terms of messages ⇒ later lecture).

Message Passing ◮ The cavity distribution (in a tree) can be further broken down into a product of terms from each neighbouring clique: � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) ◮ Once the i th site has been approximated, the messages can be passed on to neighbouring cliques by marginalising to the shared variables (SSM example follows). ⇒ belief propagation. ◮ In loopy graphs, we can use loopy belief propagation. In that case � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) becomes an approximation to the true cavity distribution (or we can recast the approximation directly in terms of messages ⇒ later lecture). ◮ For some approximations (e.g. Gaussian) may be able to compute true loopy cavity using approximate sites, even if computing exact message would have been intractable.

Message Passing ◮ The cavity distribution (in a tree) can be further broken down into a product of terms from each neighbouring clique: � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) ◮ Once the i th site has been approximated, the messages can be passed on to neighbouring cliques by marginalising to the shared variables (SSM example follows). ⇒ belief propagation. ◮ In loopy graphs, we can use loopy belief propagation. In that case � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) becomes an approximation to the true cavity distribution (or we can recast the approximation directly in terms of messages ⇒ later lecture). ◮ For some approximations (e.g. Gaussian) may be able to compute true loopy cavity using approximate sites, even if computing exact message would have been intractable. ◮ In either case, message updates can be scheduled in any order.

Message Passing ◮ The cavity distribution (in a tree) can be further broken down into a product of terms from each neighbouring clique: � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) ◮ Once the i th site has been approximated, the messages can be passed on to neighbouring cliques by marginalising to the shared variables (SSM example follows). ⇒ belief propagation. ◮ In loopy graphs, we can use loopy belief propagation. In that case � q ¬ i ( Z i ) = M j → i ( Z j ∩ Z i ) j ∈ ne ( i ) becomes an approximation to the true cavity distribution (or we can recast the approximation directly in terms of messages ⇒ later lecture). ◮ For some approximations (e.g. Gaussian) may be able to compute true loopy cavity using approximate sites, even if computing exact message would have been intractable. ◮ In either case, message updates can be scheduled in any order. ◮ No guarantee of convergence (but see “power-EP” methods).

EP for a NLSSM • • • z i − 2 z i − 1 z i z i + 1 z i + 2 • • • x i − 2 x i − 1 x i x i + 1 x i + 2 e.g. exp ( −� z i − h s ( z i − 1 ) � 2 / 2 σ 2 ) P ( z i | z i − 1 ) = φ i ( z i , z i − 1 ) e.g. exp ( −� x i − h o ( z i ) � 2 / 2 σ 2 ) P ( x i | z i ) = ψ i ( z i )

EP for a NLSSM • • • z i − 2 z i − 1 z i z i + 1 z i + 2 • • • x i − 2 x i − 1 x i x i + 1 x i + 2 e.g. exp ( −� z i − h s ( z i − 1 ) � 2 / 2 σ 2 ) P ( z i | z i − 1 ) = φ i ( z i , z i − 1 ) e.g. exp ( −� x i − h o ( z i ) � 2 / 2 σ 2 ) P ( x i | z i ) = ψ i ( z i ) Then f i ( z i , z i − 1 ) = φ i ( z i , z i − 1 ) ψ i ( z i ) . As φ i and ψ i are non-linear, inference is not generally tractable.

EP for a NLSSM • • • z i − 2 z i − 1 z i z i + 1 z i + 2 • • • x i − 2 x i − 1 x i x i + 1 x i + 2 e.g. exp ( −� z i − h s ( z i − 1 ) � 2 / 2 σ 2 ) P ( z i | z i − 1 ) = φ i ( z i , z i − 1 ) e.g. exp ( −� x i − h o ( z i ) � 2 / 2 σ 2 ) P ( x i | z i ) = ψ i ( z i ) Then f i ( z i , z i − 1 ) = φ i ( z i , z i − 1 ) ψ i ( z i ) . As φ i and ψ i are non-linear, inference is not generally tractable. Assume ˜ f i ( z i , z i − 1 ) is Gaussian. Then, � � � � � � ˜ ˜ ˜ q ¬ i ( z i , z i − 1 ) = f i ′ ( z i ′ , z i ′ − 1 ) = f i ′ ( z i ′ , z i ′ − 1 ) f i ′ ( z i ′ , z i ′ − 1 ) i ′ � = i i ′ < i i ′ > i z 1 ... z i − 2 z 1 ... z i − 2 z i + 1 ... z n z i + 1 ... z i � �� α i − 1 ( z i − 1 ) β i ( z i ) with both α and β Gaussian.

EP for a NLSSM • • • z i − 2 z i − 1 z i z i + 1 z i + 2 • • • x i − 2 x i − 1 x i x i + 1 x i + 2 e.g. exp ( −� z i − h s ( z i − 1 ) � 2 / 2 σ 2 ) P ( z i | z i − 1 ) = φ i ( z i , z i − 1 ) e.g. exp ( −� x i − h o ( z i ) � 2 / 2 σ 2 ) P ( x i | z i ) = ψ i ( z i ) Then f i ( z i , z i − 1 ) = φ i ( z i , z i − 1 ) ψ i ( z i ) . As φ i and ψ i are non-linear, inference is not generally tractable. Assume ˜ f i ( z i , z i − 1 ) is Gaussian. Then, � � � � � � ˜ ˜ ˜ q ¬ i ( z i , z i − 1 ) = f i ′ ( z i ′ , z i ′ − 1 ) = f i ′ ( z i ′ , z i ′ − 1 ) f i ′ ( z i ′ , z i ′ − 1 ) i ′ � = i i ′ < i i ′ > i z 1 ... z i − 2 z 1 ... z i − 2 z i + 1 ... z n z i + 1 ... z i � �� α i − 1 ( z i − 1 ) β i ( z i ) with both α and β Gaussian. � � � ˜ � f ( z i , z i − 1 ) α i − 1 ( z i − 1 ) β i ( z i ) f i ( z i , z i − 1 ) = argmin KL φ i ( z i , z i − 1 ) ψ i ( z i ) α i − 1 ( z i − 1 ) β i ( z i ) f ∈N

NLSSM EP message updates � � � ˜ � f ( z i , z i − 1 ) q ¬ i ( z i , z i − 1 ) f i ( z i , z i − 1 ) = argmin f ( z i , z i − 1 ) q ¬ i ( z i , z i − 1 ) KL f ∈N

NLSSM EP message updates � � � ˜ � f ( z i , z i − 1 ) α i − 1 ( z i − 1 ) β i ( z i ) f i ( z i , z i − 1 ) = argmin φ i ( z i , z i − 1 ) ψ i ( z i ) α i − 1 ( z i − 1 ) β i ( z i ) KL f ∈N z i β i f q ¬ i α i − 1 z i − 1

NLSSM EP message updates � � � ˜ � f ( z i , z i − 1 ) α i − 1 ( z i − 1 ) β i ( z i ) f i ( z i , z i − 1 ) = argmin φ i ( z i , z i − 1 ) ψ i ( z i ) α i − 1 ( z i − 1 ) β i ( z i ) KL f ∈N � �� P ( z i − 1 , z i ) P ( z i − 1 , z i ) z i β i f q ¬ i � P α i − 1 z i − 1

NLSSM EP message updates � � � ˜ � f ( z i , z i − 1 ) α i − 1 ( z i − 1 ) β i ( z i ) f i ( z i , z i − 1 ) = argmin φ i ( z i , z i − 1 ) ψ i ( z i ) α i − 1 ( z i − 1 ) β i ( z i ) KL f ∈N � �� P ( z i − 1 , z i ) P ( z i − 1 , z i ) � �� ˜ � P ( z i − 1 , z i ) P ( z i − 1 , z i ) = argmin KL P ( z i − 1 , z i ) P ∈N z i z i β i f q ¬ i � ˜ P P � P α i − 1 z i − 1 z i − 1

NLSSM EP message updates � � � ˜ � f ( z i , z i − 1 ) α i − 1 ( z i − 1 ) β i ( z i ) f i ( z i , z i − 1 ) = argmin φ i ( z i , z i − 1 ) ψ i ( z i ) α i − 1 ( z i − 1 ) β i ( z i ) KL f ∈N � �� P ( z i − 1 , z i ) P ( z i − 1 , z i ) � ˜ �� P ( z i − 1 , z i ) ˜ � P ( z i − 1 , z i ) ˜ P ( z i − 1 , z i ) = argmin KL P ( z i − 1 , z i ) f i ( z i , z i − 1 ) = α i − 1 ( z i − 1 ) β i ( z i ) P ∈N z i z i β i f q ¬ i � ˜ P P � P α i − 1 z i − 1 z i − 1

NLSSM EP message updates � � � ˜ � f ( z i , z i − 1 ) α i − 1 ( z i − 1 ) β i ( z i ) f i ( z i , z i − 1 ) = argmin φ i ( z i , z i − 1 ) ψ i ( z i ) α i − 1 ( z i − 1 ) β i ( z i ) KL f ∈N � �� P ( z i − 1 , z i ) P ( z i − 1 , z i ) � ˜ �� P ( z i − 1 , z i ) ˜ � P ( z i − 1 , z i ) ˜ P ( z i − 1 , z i ) = argmin KL P ( z i − 1 , z i ) f i ( z i , z i − 1 ) = α i − 1 ( z i − 1 ) β i ( z i ) P ∈N � � � � 1 ˜ α i − 1 ( z i − 1 )˜ ˜ α i ( z i ) = f i ′ ( z i ′ , z i ′ − 1 ) = f i ( z i , z i − 1 ) = P ( z i − 1 , z i ) β i ( z i ) i ′ < i + 1 z 1 ... z i − 1 z i − 1 z i − 1 � � � � 1 ˜ β i ( z i )˜ ˜ β i − 1 ( z i − 1 ) = f i ′ ( z i ′ , z i ′ − 1 ) = f i ( z i , z i − 1 ) = P ( z i − 1 , z i ) α i − 1 ( z i − 1 ) i ′ > i z i + 1 ... z i z i z i z i z i β i f q ¬ i α i � ˜ P P � P α i − 1 β i − 1 z i − 1 z i − 1

Moment Matching Each EP update involves a KL minimisation: ˜ f new ( Z ) ← argmin KL [ f i ( Z i ) q ¬ i ( Z ) � f ( Z i ) q ¬ i ( Z )] i f ∈{ ˜ f } Usually, both q ¬ i ( Z i ) and ˜ Z ( θ ) e T ( x ) · θ . Then 1 f are in the same exponential family. Let q ( x ) = � � � � � � � 1 � q ( x ) � Z ( θ ) e T ( x ) · θ p ( x ) = argmin p ( x ) argmin KL KL � q θ � 1 Z ( θ ) e T ( x ) · θ = argmin − dx p ( x ) log θ � − dx p ( x ) T ( x ) · θ + log Z ( θ ) = argmin θ � � ∂ ∂ 1 dx e T ( x ) · θ ∂ θ = − dx p ( x ) T ( x ) + Z ( θ ) ∂ θ � 1 dx e T ( x ) · θ T ( x ) = −� T ( x ) � p + Z ( θ ) = −� T ( x ) � p + � T ( x ) � q So minimum is found by matching sufficient stats. This is usually moment matching.

Numerical issues How do we calculate � T ( x ) � p ?

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods. ◮ Positive definite joints, but not guaranteed to give positive definite messages.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods. ◮ Positive definite joints, but not guaranteed to give positive definite messages. ◮ Heuristics include skipping non-positive-definite steps, or damping messages by interpolation or exponentiating to power < 1.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods. ◮ Positive definite joints, but not guaranteed to give positive definite messages. ◮ Heuristics include skipping non-positive-definite steps, or damping messages by interpolation or exponentiating to power < 1. ◮ Other quadrature approaches ( e.g. GP quadrature) may be more accurate, and may allow formal constraint to pos-def cone.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods. ◮ Positive definite joints, but not guaranteed to give positive definite messages. ◮ Heuristics include skipping non-positive-definite steps, or damping messages by interpolation or exponentiating to power < 1. ◮ Other quadrature approaches ( e.g. GP quadrature) may be more accurate, and may allow formal constraint to pos-def cone. ◮ Laplace approximation.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods. ◮ Positive definite joints, but not guaranteed to give positive definite messages. ◮ Heuristics include skipping non-positive-definite steps, or damping messages by interpolation or exponentiating to power < 1. ◮ Other quadrature approaches ( e.g. GP quadrature) may be more accurate, and may allow formal constraint to pos-def cone. ◮ Laplace approximation. ◮ Equivalent to Laplace propagation.

Numerical issues How do we calculate � T ( x ) � p ? Often analytically tractable, but even if not requires a (relatively) low-dimensional integral: ◮ Quadrature methods. ◮ Classical Gaussian quadrature (same Gauss, but nothing to do with the distribution) gives an iterative version of Sigma-point methods. ◮ Positive definite joints, but not guaranteed to give positive definite messages. ◮ Heuristics include skipping non-positive-definite steps, or damping messages by interpolation or exponentiating to power < 1. ◮ Other quadrature approaches ( e.g. GP quadrature) may be more accurate, and may allow formal constraint to pos-def cone. ◮ Laplace approximation. ◮ Equivalent to Laplace propagation. ◮ As long as messages remain positive definite will converge to global Laplace approximation.

EP for Gaussian process classification EP provides a succesful framework for Gaussian-process modelling of non-Gaussian observations ( e.g. for classification).

EP for Gaussian process classification EP provides a succesful framework for Gaussian-process modelling of non-Gaussian observations ( e.g. for classification). x 1 x 2 x 3 • • • x n g 1 g 2 g 3 • • • g n Recall: ◮ A GP defines a multivariate Gaussian distribution on any finite subset of random vars { g 1 . . . g n } drawn from a (usually uncountable) potential set indexed by “inputs” x i .

EP for Gaussian process classification EP provides a succesful framework for Gaussian-process modelling of non-Gaussian observations ( e.g. for classification). x 1 x 2 x 3 • • • x n K g 1 g 2 g 3 • • • g n Recall: ◮ A GP defines a multivariate Gaussian distribution on any finite subset of random vars { g 1 . . . g n } drawn from a (usually uncountable) potential set indexed by “inputs” x i . ◮ The Gaussian parameters depend on the inputs: ( µ = [ µ ( x i )] , Σ = [ K ( x i , x j )] ).

EP for Gaussian process classification EP provides a succesful framework for Gaussian-process modelling of non-Gaussian observations ( e.g. for classification). x 1 x 2 x 3 • • • x n K g 1 g 2 g 3 • • • g n Recall: ◮ A GP defines a multivariate Gaussian distribution on any finite subset of random vars { g 1 . . . g n } drawn from a (usually uncountable) potential set indexed by “inputs” x i . ◮ The Gaussian parameters depend on the inputs: ( µ = [ µ ( x i )] , Σ = [ K ( x i , x j )] ). ◮ If we think of the g s as function values, a GP provides a prior over functions.

EP for Gaussian process classification EP provides a succesful framework for Gaussian-process modelling of non-Gaussian observations ( e.g. for classification). x 1 x 2 x 3 • • • x n K g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n Recall: ◮ A GP defines a multivariate Gaussian distribution on any finite subset of random vars { g 1 . . . g n } drawn from a (usually uncountable) potential set indexed by “inputs” x i . ◮ The Gaussian parameters depend on the inputs: ( µ = [ µ ( x i )] , Σ = [ K ( x i , x j )] ). ◮ If we think of the g s as function values, a GP provides a prior over functions. ◮ In a GP regression model, noisy observations y i are conditionally independent given g i .

EP for Gaussian process classification EP provides a succesful framework for Gaussian-process modelling of non-Gaussian observations ( e.g. for classification). x 1 x 2 x 3 • • • x n x ′ K g ′ g 1 g 2 g 3 • • • g n y ′ y 1 y 2 y 3 y n Recall: ◮ A GP defines a multivariate Gaussian distribution on any finite subset of random vars { g 1 . . . g n } drawn from a (usually uncountable) potential set indexed by “inputs” x i . ◮ The Gaussian parameters depend on the inputs: ( µ = [ µ ( x i )] , Σ = [ K ( x i , x j )] ). ◮ If we think of the g s as function values, a GP provides a prior over functions. ◮ In a GP regression model, noisy observations y i are conditionally independent given g i . ◮ No parameters to learn (though often hyperparameters); instead, we make predictions on test data directly: [assuming µ = 0, and matrix Σ incorporates diagonal noise] � X , X Σ X , x ′ � P ( y ′ | x ′ , D ) = N Σ x ′ , X Σ − 1 X , X z , Σ x ′ , x ′ − Σ x ′ , X Σ − 1

GP EP updates g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n ◮ We can write the GP joint on g i and y i as a factor graph: � � � y i | g i , σ 2 P ( g 1 . . . g n , y 1 , . . . y n ) = N ( g 1 . . . g n | 0 , K ) N i i

GP EP updates g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n ◮ We can write the GP joint on g i and y i as a factor graph: � � � y i | g i , σ 2 P ( g 1 . . . g n , y 1 , . . . y n ) = N ( g 1 . . . g n | 0 , K ) N i � �� i f 0 ( G ) f i ( g i )

GP EP updates g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n ◮ We can write the GP joint on g i and y i as a factor graph: � � � y i | g i , σ 2 P ( g 1 . . . g n , y 1 , . . . y n ) = N ( g 1 . . . g n | 0 , K ) N i � �� i f 0 ( G ) f i ( g i ) ◮ The same factorisation applies to non-Gaussian P ( y i | g i ) ( e.g. P ( y i =1 ) = 1 / ( 1 + e − g i ) ).

GP EP updates g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n ◮ We can write the GP joint on g i and y i as a factor graph: � � � y i | g i , σ 2 P ( g 1 . . . g n , y 1 , . . . y n ) = N ( g 1 . . . g n | 0 , K ) N i � �� i f 0 ( G ) f i ( g i ) ◮ The same factorisation applies to non-Gaussian P ( y i | g i ) ( e.g. P ( y i =1 ) = 1 / ( 1 + e − g i ) ). � � ◮ EP: approximate non-Gaussian f i ( g i ) by Gaussian ˜ µ i , ˜ ψ 2 f i ( g i ) = N ˜ . i

GP EP updates g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n ◮ We can write the GP joint on g i and y i as a factor graph: � � � y i | g i , σ 2 P ( g 1 . . . g n , y 1 , . . . y n ) = N ( g 1 . . . g n | 0 , K ) N i � �� i f 0 ( G ) f i ( g i ) ◮ The same factorisation applies to non-Gaussian P ( y i | g i ) ( e.g. P ( y i =1 ) = 1 / ( 1 + e − g i ) ). � � ◮ EP: approximate non-Gaussian f i ( g i ) by Gaussian ˜ µ i , ˜ ψ 2 f i ( g i ) = N ˜ . i � � ˜ 1 . . . ˜ ◮ q ¬ i ( g i ) can be constructed by the usual GP marginalisation. If Σ = K + diag ψ 2 ψ 2 n � � Σ i , ¬ i Σ − 1 µ ¬ i , K i , i − Σ i , ¬ i Σ − 1 q ¬ i ( g i ) = N ¬ i , ¬ i ˜ ¬ i , ¬ i Σ ¬ i , i

GP EP updates g 1 g 2 g 3 • • • g n y 1 y 2 y 3 y n ◮ We can write the GP joint on g i and y i as a factor graph: � � � y i | g i , σ 2 P ( g 1 . . . g n , y 1 , . . . y n ) = N ( g 1 . . . g n | 0 , K ) N i � �� i f 0 ( G ) f i ( g i ) ◮ The same factorisation applies to non-Gaussian P ( y i | g i ) ( e.g. P ( y i =1 ) = 1 / ( 1 + e − g i ) ). � � ◮ EP: approximate non-Gaussian f i ( g i ) by Gaussian ˜ µ i , ˜ ψ 2 f i ( g i ) = N ˜ . i � � ˜ 1 . . . ˜ ◮ q ¬ i ( g i ) can be constructed by the usual GP marginalisation. If Σ = K + diag ψ 2 ψ 2 n � � Σ i , ¬ i Σ − 1 µ ¬ i , K i , i − Σ i , ¬ i Σ − 1 q ¬ i ( g i ) = N ¬ i , ¬ i ˜ ¬ i , ¬ i Σ ¬ i , i ◮ The EP updates thus require calculating Gaussian expectations of f i ( g ) g { 1 , 2 } : �� dg q ¬ i ( g ) f i ( g ) g 2 − (˜ f new ˜ µ new ) 2 ( g i ) = N dg q ¬ i ( g ) f i ( g ) g , q ¬ i ( g i ) i i

EP GP prediction x 1 x 2 x 3 • • • x n K • • • g 1 g 2 g 3 g n y 1 y 2 y 3 y n ◮ Once appoximate site potentials have stabilised, they can be used to make predictions.

EP GP prediction x 1 x 2 x 3 • • • x n x ′ K • • • g 1 g 2 g 3 g n y 1 y 2 y 3 y n ◮ Once appoximate site potentials have stabilised, they can be used to make predictions. ◮ Introducing a test point changes K , but does not affect the marginal P ( g 1 . . . g n ) (by consistency of the GP).

EP GP prediction x 1 x 2 x 3 • • • x n x ′ K • • • g ′ g 1 g 2 g 3 g n y ′ y 1 y 2 y 3 y n ◮ Once appoximate site potentials have stabilised, they can be used to make predictions. ◮ Introducing a test point changes K , but does not affect the marginal P ( g 1 . . . g n ) (by consistency of the GP). ◮ The unobserved output factor provides no information about g ′ ( ⇒ constant factor on g ′ )

EP GP prediction x 1 x 2 x 3 • • • x n x ′ K • • • g ′ g 1 g 2 g 3 g n y ′ y 1 y 2 y 3 y n ◮ Once appoximate site potentials have stabilised, they can be used to make predictions. ◮ Introducing a test point changes K , but does not affect the marginal P ( g 1 . . . g n ) (by consistency of the GP). ◮ The unobserved output factor provides no information about g ′ ( ⇒ constant factor on g ′ ) ◮ Thus no change is needed to the approximating potentials ˜ f i .

EP GP prediction x 1 x 2 x 3 • • • x n x ′ K • • • g ′ g 1 g 2 g 3 g n y ′ y 1 y 2 y 3 y n ◮ Once appoximate site potentials have stabilised, they can be used to make predictions. ◮ Introducing a test point changes K , but does not affect the marginal P ( g 1 . . . g n ) (by consistency of the GP). ◮ The unobserved output factor provides no information about g ′ ( ⇒ constant factor on g ′ ) ◮ Thus no change is needed to the approximating potentials ˜ f i . ◮ Predictions are obtained by marginalising the approximation: [let ˜ Ψ = diag [ ˜ 1 . . . ˜ ψ 2 ψ 2 n ] ] � � g ′ | K x ′ , X ( K X , X + ˜ dg ′ P ( y ′ | g ′ ) N Ψ) − 1 ˜ P ( y ′ | x ′ , D ) = µ , � K x ′ , x ′ − K x ′ , X ( K X , X + ˜ Ψ) − 1 K X , x ′

Normalisers ◮ Approximate sites determined by moment matching are naturally normalised.

Probabilistic & Unsupervised Learning Expectation Propagation - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Expectation Propagation Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Probabilistic & Unsupervised Learning Expectation Propagation Maneesh Sahani

Probabilistic & Unsupervised Learning Expectation Maximisation Maneesh Sahani

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

NNLO subtraction for numerical integration of virtual amplitudes Mao Zeng, ETH Zrich

Renormalization for LaMET Yi-Bo Yang L a t t i c e Michigan state university P a r t o n P h y

The Schwinger model in the canonical formulation Urs Wenger Albert Einstein Center for

Structural biomathematics: an overview of molecular simulations and protein structure prediction

Robust TTS duration modelling using DNNs Gustav Eje Henter Srikanth Ronanki Oliver Watts Mirjam

Privacy guarantees in statistical estimation: How to formalize the problem? Martin Wainwright UC

What can we learn from data? Annex 58, 60 and 66 Meeting LBNL, Berkeley, September 2014 Henrik

Variational inference Probabilistic Graphical Models Sharif University of Technology Soleymani

Probabilistic & Unsupervised Learning Expectation Propagation - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Expectation Propagation Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Probabilistic &amp; Unsupervised Learning Expectation Propagation Maneesh Sahani

Probabilistic &amp; Unsupervised Learning Expectation Maximisation Maneesh Sahani

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

NNLO subtraction for numerical integration of virtual amplitudes Mao Zeng, ETH Zrich

Renormalization for LaMET Yi-Bo Yang L a t t i c e Michigan state university P a r t o n P h y

The Schwinger model in the canonical formulation Urs Wenger Albert Einstein Center for

Structural biomathematics: an overview of molecular simulations and protein structure prediction

Robust TTS duration modelling using DNNs Gustav Eje Henter Srikanth Ronanki Oliver Watts Mirjam

Privacy guarantees in statistical estimation: How to formalize the problem? Martin Wainwright UC

What can we learn from data? Annex 58, 60 and 66 Meeting LBNL, Berkeley, September 2014 Henrik

Variational inference Probabilistic Graphical Models Sharif University of Technology Soleymani

Probabilistic & Unsupervised Learning Expectation Propagation Maneesh Sahani

Probabilistic & Unsupervised Learning Expectation Maximisation Maneesh Sahani