Probabilistic & Unsupervised Learning Factored Variational - PowerPoint PPT Presentation

Free-energy-based variational approximation What if finding expected sufficient stats under P ( Z|X , θ ) is computationally intractable? For the generalised EM algorithm, we argued that intractable maximisations could be replaced by gradient M-steps. ◮ Each step increases the likelihood. ◮ A fixed point of the gradient M-step must be at a mode of the expected log-joint. For the E-step we could: ◮ Parameterise q = q ρ ( Z ) and take a gradient step in ρ . ◮ Assume some simplified form for q , usually factored: q = � i q i ( Z i ) where Z i partition Z , and maximise within this form. In either case, we choose q from within a limited set Q : VE step : maximise F ( q , θ ) wrt constrained latent distribution given parameters: q ( k ) ( Z ) := argmax q ( Z ) , θ ( k − 1 ) � � F . q ( Z ) ∈Q← Constraint M step : unchanged � θ ( k ) := argmax q ( k ) ( Z ) , θ q ( k ) ( Z ) log p ( Z , X| θ ) d Z , F � � = argmax θ θ Unlike in GEM, the fixed point may not be at an unconstrained optimum of F .

What do we lose? What does restricting q to Q cost us?

What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed.

What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Z|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � θ ( k − 1 ) � q ( k ) , θ ( k − 1 ) � q ( k ) , θ ( k ) � θ ( k ) � � � � � = F ≤ F ≤ ℓ ℓ , E step M step Jensen

What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Z|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � θ ( k − 1 ) � q ( k ) , θ ( k − 1 ) � q ( k ) , θ ( k ) � θ ( k ) � � � � � = F ≤ F ≤ ℓ ℓ , E step M step Jensen ◮ This means we may not (and usually won’t) converge to a maximum of ℓ .

What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Z|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � θ ( k − 1 ) � q ( k ) , θ ( k − 1 ) � q ( k ) , θ ( k ) � θ ( k ) � � � � � = F ≤ F ≤ ℓ ℓ , E step M step Jensen ◮ This means we may not (and usually won’t) converge to a maximum of ℓ . The hope is that by increasing a lower bound on ℓ we will find a decent solution.

What do we lose? What does restricting q to Q cost us? ◮ Recall that the free-energy is bounded above by Jensen: F ( q , θ ) ≤ ℓ ( θ ML ) Thus, as long as every step increases F , convergence is still guaranteed. ◮ But, since P ( Z|X , θ ( k ) ) may not lie in Q , we no longer saturate the bound after the E-step. Thus, the likelihood may not increase on each full EM step. � � θ ( k − 1 ) � q ( k ) , θ ( k − 1 ) � q ( k ) , θ ( k ) � θ ( k ) � � � � � = F ≤ F ≤ ℓ ℓ , E step M step Jensen ◮ This means we may not (and usually won’t) converge to a maximum of ℓ . The hope is that by increasing a lower bound on ℓ we will find a decent solution. [Note that if P ( Z|X , θ ML ) ∈ Q , then θ ML is a fixed point of the variational algorithm.]

KL divergence Recall that F ( q , θ ) = � log P ( X , Z| θ ) � q ( Z ) + H [ q ] = � log P ( X| θ ) + log P ( Z|X , θ ) � q ( Z ) − � log q ( Z ) � q ( Z ) = � log P ( X| θ ) � q ( Z ) − KL [ q � P ( Z|X , θ )] . Thus, E step maximise F ( q , θ ) wrt the distribution over latents, given parameters: q ( k ) ( Z ) := argmax q ( Z ) , θ ( k − 1 ) � � F . q ( Z ) ∈Q is equivalent to: E step minimise KL [ q � p ( Z|X , θ )] wrt distribution over latents, given parameters: � q ( Z ) q ( k ) ( Z ) := argmin q ( Z ) log p ( Z|X , θ ( k − 1 ) ) d Z q ( Z ) ∈Q So, in each E step, the algorithm is trying to find the best approximation to P ( Z|X ) in Q in a KL sense. This is related to ideas in information geometry . It also suggests generalisations to other distance measures.

Factored Variational E-step The most common form of variational approximation partitions Z into disjoint sets Z i with � q ( Z ) = � � � � Q = q i ( Z i ) . q i

Factored Variational E-step The most common form of variational approximation partitions Z into disjoint sets Z i with � q ( Z ) = � � � � Q = q i ( Z i ) . q i In this case the E-step is itself iterative: (Factored VE step) i : maximise F ( q , θ ) wrt q i ( Z i ) given other q j and parameters: q ( k ) � q j ( Z j ) , θ ( k − 1 ) � ( Z i ) := argmax F � q i ( Z i ) . i q i ( Z i ) j � = i

Factored Variational E-step The most common form of variational approximation partitions Z into disjoint sets Z i with � q ( Z ) = � � � � Q = q i ( Z i ) . q i In this case the E-step is itself iterative: (Factored VE step) i : maximise F ( q , θ ) wrt q i ( Z i ) given other q j and parameters: q ( k ) � q j ( Z j ) , θ ( k − 1 ) � ( Z i ) := argmax F � q i ( Z i ) . i q i ( Z i ) j � = i ◮ q i updates iterated to convergence to “complete” VE-step.

Factored Variational E-step The most common form of variational approximation partitions Z into disjoint sets Z i with � q ( Z ) = � � � � Q = q i ( Z i ) . q i In this case the E-step is itself iterative: (Factored VE step) i : maximise F ( q , θ ) wrt q i ( Z i ) given other q j and parameters: q ( k ) � q j ( Z j ) , θ ( k − 1 ) � ( Z i ) := argmax F � q i ( Z i ) . i q i ( Z i ) j � = i ◮ q i updates iterated to convergence to “complete” VE-step. ◮ In fact, every (VE) i -step separately increases F , so any schedule of (VE) i - and M-steps will converge. Choice can be dictated by practical issues (rarely efficient to fully converge E-step before updating parameters).

Factored Variational E-step The Factored Variational E-step has a general form.

Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Z j ) , θ ( k − 1 ) � � � �� log P ( X , Z| θ ( k − 1 ) ) F = j q j ( Z j ) + H q j ( Z j ) � j j

Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Z j ) , θ ( k − 1 ) � � � �� log P ( X , Z| θ ( k − 1 ) ) F = j q j ( Z j ) + H q j ( Z j ) � j j � � � log P ( X , Z| θ ( k − 1 ) ) � = d Z i q i ( Z i ) j � = i q j ( Z j ) + H [ q i ] + H [ q j ] � j � = i

Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Z j ) , θ ( k − 1 ) � � � �� log P ( X , Z| θ ( k − 1 ) ) F = j q j ( Z j ) + H q j ( Z j ) � j j � � � log P ( X , Z| θ ( k − 1 ) ) � = d Z i q i ( Z i ) j � = i q j ( Z j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� δ F + λ q i − 1 = δ q i

Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Z j ) , θ ( k − 1 ) � � � �� log P ( X , Z| θ ( k − 1 ) ) F = j q j ( Z j ) + H q j ( Z j ) � j j � � � log P ( X , Z| θ ( k − 1 ) ) � = d Z i q i ( Z i ) j � = i q j ( Z j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� δ j � = i q j ( Z j ) − log q i ( Z i ) − q i ( Z i ) � � log P ( X , Z| θ ( k − 1 ) ) F + λ q i − 1 = q i ( Z i ) + λ � δ q i

Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Z j ) , θ ( k − 1 ) � � � �� log P ( X , Z| θ ( k − 1 ) ) F = j q j ( Z j ) + H q j ( Z j ) � j j � � � log P ( X , Z| θ ( k − 1 ) ) � = d Z i q i ( Z i ) j � = i q j ( Z j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� δ j � = i q j ( Z j ) − log q i ( Z i ) − q i ( Z i ) � � log P ( X , Z| θ ( k − 1 ) ) F + λ q i − 1 = q i ( Z i ) + λ � δ q i � � log P ( X , Z| θ ( k − 1 ) ) ⇒ q i ( Z i ) ∝ exp (= 0 ) � j � = i q j ( Z j )

Factored Variational E-step The Factored Variational E-step has a general form. The free energy is: � � q j ( Z j ) , θ ( k − 1 ) � � � �� log P ( X , Z| θ ( k − 1 ) ) F = j q j ( Z j ) + H q j ( Z j ) � j j � � � log P ( X , Z| θ ( k − 1 ) ) � = d Z i q i ( Z i ) j � = i q j ( Z j ) + H [ q i ] + H [ q j ] � j � = i Now, taking the variational derivative of the Lagrangian (enforcing normalisation of q i ): � �� δ j � = i q j ( Z j ) − log q i ( Z i ) − q i ( Z i ) � � log P ( X , Z| θ ( k − 1 ) ) F + λ q i − 1 = q i ( Z i ) + λ � δ q i � � log P ( X , Z| θ ( k − 1 ) ) ⇒ q i ( Z i ) ∝ exp (= 0 ) � j � = i q j ( Z j ) In general, this depends only on the expected sufficient statistics under q j . Thus, again, we don’t actually need the entire distributions, just the relevant expectations (now for approximate inference as well as learning).

Mean-field approximations If Z i = z i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation.

Mean-field approximations If Z i = z i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Z ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine P ( X , Z ) = 1 � � � � Z exp W ij s i s j + b i s i ij i with some s i ∈ Z and others observed.

Mean-field approximations If Z i = z i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Z ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine P ( X , Z ) = 1 � � � � Z exp W ij s i s j + b i s i ij i with some s i ∈ Z and others observed. ◮ Expectations wrt a fully-factored q distribute over all s i ∈ Z � � � log P ( X , Z ) � � q i = W ij � s i � q i � s j � q j + b i � s i � q i ij i (where q i for s i ∈ X is a delta function on the observed value).

Mean-field approximations If Z i = z i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Z ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine P ( X , Z ) = 1 � � � � Z exp W ij s i s j + b i s i ij i with some s i ∈ Z and others observed. ◮ Expectations wrt a fully-factored q distribute over all s i ∈ Z � � � log P ( X , Z ) � � q i = W ij � s i � q i � s j � q j + b i � s i � q i ij i (where q i for s i ∈ X is a delta function on the observed value). ◮ Thus, we can update each q i in turn given the means (or, in general, mean sufficient statistics) of the others.

Mean-field approximations If Z i = z i ( i.e. , q is factored over all variables) then the variational technique is often called a “mean field” approximation. ◮ Suppose P ( X , Z ) has sufficient statistics that are separable in the latent variables: e.g. the Boltzmann machine P ( X , Z ) = 1 � � � � Z exp W ij s i s j + b i s i ij i with some s i ∈ Z and others observed. ◮ Expectations wrt a fully-factored q distribute over all s i ∈ Z � � � log P ( X , Z ) � � q i = W ij � s i � q i � s j � q j + b i � s i � q i ij i (where q i for s i ∈ X is a delta function on the observed value). ◮ Thus, we can update each q i in turn given the means (or, in general, mean sufficient statistics) of the others. ◮ Each variable sees the mean field imposed by its neighbours, and we update these fields until they all agree.

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) • • • s ( 3 ) 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) • • • s ( 2 ) 1 2 3 T s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t )

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t ) �� log P ( x τ | s 1 : M log P ( s µ τ | s µ = exp τ –1 ) + τ ) µ τ τ � q m ′ t ′ ¬ ( m , t )

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t ) �� log P ( x τ | s 1 : M log P ( s µ τ | s µ = exp τ –1 ) + τ ) µ τ τ � q m ′ t ′ ¬ ( m , t ) �� log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m ∝ exp � � � � � t –1 ) t –1 + ) t + t ) � q m ′ t q m q m t + 1 ¬ m

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T � � q m t ( s m log P ( s 1 : M t ) ∝ exp 1 : T , x 1 : T ) � q m ′ t ′ ( s m ′ t ′ ) ¬ ( m , t ) �� log P ( x τ | s 1 : M log P ( s µ τ | s µ = exp τ –1 ) + τ ) µ τ τ � q m ′ t ′ ¬ ( m , t ) �� log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m ∝ exp � � � � � t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� t –1 ( j ) · e j log Φ m ij q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t )

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp � � � � � t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� t –1 ( j ) · e j log Φ m ij q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp � � � � � t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� t –1 ( j ) · e j log Φ m ij q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp � � � � � t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� t –1 ( j ) · e j log Φ m ij q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp � � � � � t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� t –1 ( j ) · e j log Φ m ij q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output ◮ Multiple passes; messages depend on (approximate) marginals

Mean-field FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • 1 2 3 T q ( s 1 : M � q m t ( s m 1 : T ) = t ) s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) m , t 1 2 3 T x 1 x 2 x 3 x T �� q m t ( s m log P ( s m t | s m log P ( x t | s 1 : M log P ( s m t + 1 | s m t ) ∝ exp � � � � � t –1 ) t –1 + ) + t ) � q m ′ t q m q m t t + 1 ¬ m � �� t –1 ( j ) · e j log Φ m ij q m j log Φ m ji q m � log A i ( x t ) � q ¬ m β m t + 1 ( j ) α m t ( i ) ∝ e t ( i ) ∝ e t β t ( i ) ∝ � α t ( i ) ∝ � Cf. forward-backward: j Φ ij A j ( x t + 1 ) β t + 1 ( j ) j α t –1 ( j )Φ ji · A i ( x t ) ◮ Yields a message-passing algorithm like forward-backward ◮ Updates depend only on immediate neighbours in chain ◮ Chains couple only through joint output ◮ Multiple passes; messages depend on (approximate) marginals ◮ Evidence does not appear explicitly in backward message (cf Kalman smoothing)

Structured variational approximation ◮ q ( Z ) need not be completely factorized. A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2

Structured variational approximation ◮ q ( Z ) need not be completely factorized. ◮ For example, suppose Z can be partitioned into sets Z 1 and Z 2 such that computing the expected sufficient statistics under P ( Z 1 |Z 2 , X ) and P ( Z 2 |Z 1 , X ) would be tractable. ⇒ Then the factored approximation q ( Z ) = q ( Z 1 ) q ( Z 2 ) is tractable. A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2

Structured variational approximation ◮ q ( Z ) need not be completely factorized. ◮ For example, suppose Z can be partitioned into sets Z 1 and Z 2 such that computing the expected sufficient statistics under P ( Z 1 |Z 2 , X ) and P ( Z 2 |Z 1 , X ) would be tractable. ⇒ Then the factored approximation q ( Z ) = q ( Z 1 ) q ( Z 2 ) is tractable. ◮ In particular, any factorisation of q ( Z ) into a product of distributions on trees, yields a tractable approximation. A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2

Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) • • • s ( 3 ) 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) • • • s ( 2 ) 1 2 3 T s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T

Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • For the FHMM we can factor the chains: 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • q ( s 1 : M � 1 2 3 T q m ( s m 1 : T ) = 1 : T ) m s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T

Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • For the FHMM we can factor the chains: 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • q ( s 1 : M � 1 2 3 T q m ( s m 1 : T ) = 1 : T ) m s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T � � q m ( s m log P ( s 1 : M 1 : T ) ∝ exp 1 : T , x 1 : T ) � q m ′ ( s m ′ 1 : T ) ¬ m �� log P ( s µ t | s µ log P ( x t | s 1 : M = exp t − 1 ) + ) t � µ t t q m ′ ¬ m � � � � � log P ( s m t | s m � log P ( x t | s 1 : M ∝ exp t − 1 ) + ) t � q m ′ ( s m ′ ) t t t ¬ m � log P ( x t | s 1 : M ) � � qm ′ ( sm ′ t ) t � P ( s m t | s m � = t − 1 ) e ¬ m t t

Stuctured FHMM s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • For the FHMM we can factor the chains: 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) s ( 2 ) • • • q ( s 1 : M � 1 2 3 T q m ( s m 1 : T ) = 1 : T ) m s ( 1 ) s ( 1 ) s ( 1 ) • • • s ( 1 ) 1 2 3 T x 1 x 2 x 3 x T � � q m ( s m log P ( s 1 : M 1 : T ) ∝ exp 1 : T , x 1 : T ) � q m ′ ( s m ′ 1 : T ) ¬ m �� log P ( s µ t | s µ log P ( x t | s 1 : M = exp t − 1 ) + ) t � µ t t q m ′ ¬ m � � � � � log P ( s m t | s m � log P ( x t | s 1 : M ∝ exp t − 1 ) + ) t � q m ′ ( s m ′ ) t t t ¬ m � log P ( x t | s 1 : M ) � � qm ′ ( sm ′ t ) t � P ( s m t | s m � = t − 1 ) e ¬ m t t This looks like a standard HMM joint, with a modified likelihood term ⇒ cycle through multiple forward-backward passes, updating likelihood terms each time.

Messages on an arbitrary graph A B Consider a DAG: � P ( X , Z ) = P ( V k | pa ( V k )) C k D E and let q ( Z ) = � i q i ( Z i ) for disjoint sets {Z i } .

Messages on an arbitrary graph A B Consider a DAG: � P ( X , Z ) = P ( V k | pa ( V k )) C k D E and let q ( Z ) = � i q i ( Z i ) for disjoint sets {Z i } . We have that the VE update for q i is given by q ∗ i ( Z i ) ∝ exp � log p ( Z , X ) � q ¬ i ( Z ) where �·� q ¬ i ( Z ) denotes averaging with respect to q j ( Z j ) for all j � = i

Messages on an arbitrary graph A B Consider a DAG: � P ( X , Z ) = P ( V k | pa ( V k )) C k D E and let q ( Z ) = � i q i ( Z i ) for disjoint sets {Z i } . We have that the VE update for q i is given by q ∗ i ( Z i ) ∝ exp � log p ( Z , X ) � q ¬ i ( Z ) where �·� q ¬ i ( Z ) denotes averaging with respect to q j ( Z j ) for all j � = i Then: �� log q ∗ i ( Z i ) = log P ( V k | pa ( V k )) + const k q ¬ i ( Z ) � � = � log P ( Z j | pa ( Z j )) � q ¬ i ( Z ) + � log P ( V j | pa ( V j )) � q ¬ i ( Z ) + const j ∈Z i j ∈ ch ( Z i )

Messages on an arbitrary graph A B Consider a DAG: � P ( X , Z ) = P ( V k | pa ( V k )) C k D E and let q ( Z ) = � i q i ( Z i ) for disjoint sets {Z i } . We have that the VE update for q i is given by q ∗ i ( Z i ) ∝ exp � log p ( Z , X ) � q ¬ i ( Z ) where �·� q ¬ i ( Z ) denotes averaging with respect to q j ( Z j ) for all j � = i Then: �� log q ∗ i ( Z i ) = log P ( V k | pa ( V k )) + const k q ¬ i ( Z ) � � = � log P ( Z j | pa ( Z j )) � q ¬ i ( Z ) + � log P ( V j | pa ( V j )) � q ¬ i ( Z ) + const j ∈Z i j ∈ ch ( Z i ) This defines messages that are passed between nodes in the graph. Each node receives messages from its Markov boundary: parents, children and parents of children (all neighbours in the corresponding factor graph).

Non-factored variational methods The term variational approximation is used whenever a bound on the likelihood (or on another estimation cost function) is optimised, but does not necessarily become tight. Many further variational approximations have been developed, including: ◮ parametric forms (e.g. Gaussian) for non-linear models ◮ closed form updates in special cases ◮ numerical or sampling-based computation of expectations ◮ ’recognition networks’ or amortisation to estimate variational parameters ◮ non-free-energy-based bounds (both upper and lower) on the likelihood. We can also see MAP- or zero-temperature EM and recognition models as parametric forms of variational inference.

Non-factored variational methods The term variational approximation is used whenever a bound on the likelihood (or on another estimation cost function) is optimised, but does not necessarily become tight. Many further variational approximations have been developed, including: ◮ parametric forms (e.g. Gaussian) for non-linear models ◮ closed form updates in special cases ◮ numerical or sampling-based computation of expectations ◮ ’recognition networks’ or amortisation to estimate variational parameters ◮ non-free-energy-based bounds (both upper and lower) on the likelihood. We can also see MAP- or zero-temperature EM and recognition models as parametric forms of variational inference. Variational methods can also be used to find an approximate posterior on the parameters.

Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables.

Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Z d θ P ( X , Z| θ , M ) P ( θ |M ) �� d Z d θ Q ( Z , θ ) log P ( X , Z , θ |M ) = max Q ( Z , θ ) Q

Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Z d θ P ( X , Z| θ , M ) P ( θ |M ) �� d Z d θ Q ( Z , θ ) log P ( X , Z , θ |M ) = max Q ( Z , θ ) Q �� d Z d θ Q Z ( Z ) Q θ ( θ ) log P ( X , Z , θ |M ) ≥ max Q Z ( Z ) Q θ ( θ ) Q Z , Q θ

Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Z d θ P ( X , Z| θ , M ) P ( θ |M ) �� d Z d θ Q ( Z , θ ) log P ( X , Z , θ |M ) = max Q ( Z , θ ) Q �� d Z d θ Q Z ( Z ) Q θ ( θ ) log P ( X , Z , θ |M ) ≥ max Q Z ( Z ) Q θ ( θ ) Q Z , Q θ The constraint that the distribution Q must factor into the product Q y ( Z ) Q θ ( θ ) leads to the variational Bayesian EM algorithm or just “Variational Bayes” .

Variational Bayes So far, we have applied Jensen’s bound and factorisations to help with integrals over latent variables. We can do the same for integrals over parameters in order to bound the log marginal likelihood or evidence. �� log P ( X|M ) = log d Z d θ P ( X , Z| θ , M ) P ( θ |M ) �� d Z d θ Q ( Z , θ ) log P ( X , Z , θ |M ) = max Q ( Z , θ ) Q �� d Z d θ Q Z ( Z ) Q θ ( θ ) log P ( X , Z , θ |M ) ≥ max Q Z ( Z ) Q θ ( θ ) Q Z , Q θ The constraint that the distribution Q must factor into the product Q y ( Z ) Q θ ( θ ) leads to the variational Bayesian EM algorithm or just “Variational Bayes” . Some call this the “Evidence Lower Bound” (ELBO). I’m not fond of that term.

Variational Bayesian EM . . . Coordinate maximization of the VB free-energy lower bound d Z d θ Q Z ( Z ) Q θ ( θ ) log p ( X , Z , θ |M ) �� F ( Q Z , Q θ ) = Q Z ( Z ) Q θ ( θ ) leads to EM-like updates:

Variational Bayesian EM . . . Coordinate maximization of the VB free-energy lower bound d Z d θ Q Z ( Z ) Q θ ( θ ) log p ( X , Z , θ |M ) �� F ( Q Z , Q θ ) = Q Z ( Z ) Q θ ( θ ) leads to EM-like updates: Q ∗ Z ( Z ) ∝ exp � log P ( Z , X| θ ) � Q θ ( θ ) E-like step Q ∗ θ ( θ ) ∝ P ( θ ) exp � log P ( Z , X| θ ) � Q Z ( Z ) M-like step

Variational Bayesian EM . . . Coordinate maximization of the VB free-energy lower bound d Z d θ Q Z ( Z ) Q θ ( θ ) log p ( X , Z , θ |M ) �� F ( Q Z , Q θ ) = Q Z ( Z ) Q θ ( θ ) leads to EM-like updates: Q ∗ Z ( Z ) ∝ exp � log P ( Z , X| θ ) � Q θ ( θ ) E-like step Q ∗ θ ( θ ) ∝ P ( θ ) exp � log P ( Z , X| θ ) � Q Z ( Z ) M-like step Maximizing F is equivalent to minimizing KL-divergence between the approximate posterior , Q ( θ ) Q ( Z ) and the true posterior , P ( θ , Z|X ) . �� P ( X , Z , θ ) log P ( X ) − F ( Q Z , Q θ ) = log P ( X ) − d Z d θ Q Z ( Z ) Q θ ( θ ) log Q Z ( Z ) Q θ ( θ ) �� d Z d θ Q Z ( Z ) Q θ ( θ ) log Q Z ( Z ) Q θ ( θ ) = = KL ( Q || P ) P ( Z , θ |X )

Conjugate-Exponential models Let’s focus on conjugate-exponential ( CE ) latent-variable models: ◮ Condition (1) . The joint probability over variables is in the exponential family: � � φ ( θ ) T T ( Z , X ) P ( Z , X| θ ) = f ( Z , X ) g ( θ ) exp where φ ( θ ) is the vector of natural parameters , T are sufficient statistics ◮ Condition (2) . The prior over parameters is conjugate to this joint probability: P ( θ | ν, τ ) = h ( ν, τ ) g ( θ ) ν exp � � φ ( θ ) T τ where ν and τ are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: ◮ ν : number of pseudo-observations ◮ τ : values of pseudo-observations

Conjugate-Exponential examples In the CE family: ◮ Gaussian mixtures ◮ factor analysis, probabilistic PCA ◮ hidden Markov models and factorial HMMs ◮ linear dynamical systems and switching models ◮ discrete-variable belief networks Other as yet undreamt-of models combinations of Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: ◮ Boltzmann machines, MRFs (no simple conjugacy) ◮ logistic regression (no simple conjugacy) ◮ sigmoid belief networks (not exponential) ◮ independent components analysis (not exponential) Note: one can often approximate such models with a suitable choice from the CE family.

Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then:

Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then: ◮ Q θ ( θ ) is also conjugate, i.e. �� Q θ ( θ ) ∝ P ( θ ) i log P ( z i , x i | θ ) exp Q Z � � φ ( θ ) T � � � = h ( ν, τ ) g ( θ ) ν e φ ( θ ) T τ log f ( Z , X ) i T ( z i , x i ) g ( θ ) n e Q Z e Q Z ν e φ ( θ ) T ˜ τ ) g ( θ ) ˜ τ ∝ h (˜ ν, ˜ with ˜ ν = ν + n and ˜ τ = τ + � i � T ( z i , x i ) � Q Z

Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then: ◮ Q θ ( θ ) is also conjugate, i.e. �� Q θ ( θ ) ∝ P ( θ ) i log P ( z i , x i | θ ) exp Q Z � � φ ( θ ) T � � � = h ( ν, τ ) g ( θ ) ν e φ ( θ ) T τ log f ( Z , X ) i T ( z i , x i ) g ( θ ) n e Q Z e Q Z ν e φ ( θ ) T ˜ τ ) g ( θ ) ˜ τ ∝ h (˜ ν, ˜ with ˜ ν = ν + n and ˜ τ = τ + � i � T ( z i , x i ) � Q Z ◮ Q Z ( Z ) = � n i = 1 Q z i ( z i ) takes the same form as in the E-step of regular EM Q z i ( z i ) ∝ exp � log P ( z i , x i | θ ) � Q θ Q θ T ( z i , x i ) = P ( z i | x i , φ ( θ )) � φ ( θ ) � T ∝ f ( z i , x i ) e with natural parameters φ ( θ ) = � φ ( θ ) � Q θ

Conjugate-exponential VB Given an iid data set D = ( x 1 , . . . x n ) , if the model is CE then: ◮ Q θ ( θ ) is also conjugate, i.e. �� Q θ ( θ ) ∝ P ( θ ) i log P ( z i , x i | θ ) exp Q Z � � φ ( θ ) T � � � = h ( ν, τ ) g ( θ ) ν e φ ( θ ) T τ log f ( Z , X ) i T ( z i , x i ) g ( θ ) n e Q Z e Q Z ν e φ ( θ ) T ˜ τ ) g ( θ ) ˜ τ ∝ h (˜ ν, ˜ with ˜ ν = ν + n and ˜ τ = τ + � i � T ( z i , x i ) � Q Z ⇒ only need to track ˜ ν, ˜ τ . ◮ Q Z ( Z ) = � n i = 1 Q z i ( z i ) takes the same form as in the E-step of regular EM Q z i ( z i ) ∝ exp � log P ( z i , x i | θ ) � Q θ Q θ T ( z i , x i ) = P ( z i | x i , φ ( θ )) � φ ( θ ) � T ∝ f ( z i , x i ) e with natural parameters φ ( θ ) = � φ ( θ ) � Q θ ⇒ inference unchanged from regular EM.

The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Z ( Z ) ← p ( Z|X , ¯ Q Z ( Z ) ← p ( Z|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Z Q Z ( Z ) log P ( Z , X , θ ) Q θ ( θ ) ← exp d Z Q Z ( Z ) log P ( Z , X , θ ) θ

The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Z ( Z ) ← p ( Z|X , ¯ Q Z ( Z ) ← p ( Z|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Z Q Z ( Z ) log P ( Z , X , θ ) Q θ ( θ ) ← exp d Z Q Z ( Z ) log P ( Z , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) .

The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Z ( Z ) ← p ( Z|X , ¯ Q Z ( Z ) ← p ( Z|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Z Q Z ( Z ) log P ( Z , X , θ ) Q θ ( θ ) ← exp d Z Q Z ( Z ) log P ( Z , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty.

The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Z ( Z ) ← p ( Z|X , ¯ Q Z ( Z ) ← p ( Z|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Z Q Z ( Z ) log P ( Z , X , θ ) Q θ ( θ ) ← exp d Z Q Z ( Z ) log P ( Z , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian).

The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Z ( Z ) ← p ( Z|X , ¯ Q Z ( Z ) ← p ( Z|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Z Q Z ( Z ) log P ( Z , X , θ ) Q θ ( θ ) ← exp d Z Q Z ( Z ) log P ( Z , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian). ◮ VB-E step has same complexity as corresponding E step.

The Variational Bayesian EM algorithm EM for MAP estimation Variational Bayesian EM Goal: maximize P ( θ |X , m ) wrt θ Goal: maximise bound on P ( X| m ) wrt Q θ E Step: compute VB-E Step: compute Q Z ( Z ) ← p ( Z|X , ¯ Q Z ( Z ) ← p ( Z|X , θ ) φ ) M Step: VB-M Step: � � θ ← argmax d Z Q Z ( Z ) log P ( Z , X , θ ) Q θ ( θ ) ← exp d Z Q Z ( Z ) log P ( Z , X , θ ) θ Properties: ◮ Reduces to the EM algorithm if Q θ ( θ ) = δ ( θ − θ ∗ ) . ◮ F m increases monotonically, and incorporates the model complexity penalty. ◮ Analytical parameter distributions (but not constrained to be Gaussian). ◮ VB-E step has same complexity as corresponding E step. ◮ We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters , ¯ φ .

VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters.

VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters. ◮ It also yields an optimised lower bound on the model evidence max F M ( Q Z , Q θ ) ≤ P ( D|M )

VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters. ◮ It also yields an optimised lower bound on the model evidence max F M ( Q Z , Q θ ) ≤ P ( D|M ) ◮ These lower bounds can be compared amongst models to learn the right (structure, connectivity . . . of the) model

VB and model selection ◮ Variational Bayesian EM yields an approximate posterior Q θ over model parameters. ◮ It also yields an optimised lower bound on the model evidence max F M ( Q Z , Q θ ) ≤ P ( D|M ) ◮ These lower bounds can be compared amongst models to learn the right (structure, connectivity . . . of the) model ◮ If a continuous domain of models is specified by a hyperparameter η , then the VB free energy depends on that parameter: d Z d θ Q Z ( Z ) Q θ ( θ ) log P ( X , Z , θ | η ) �� F ( Q Z , Q θ , η ) = Q Z ( Z ) Q θ ( θ ) ≤ P ( X| η ) A hyper-M step maximises the current bound wrt η : �� η ← argmax d Z d θ Q Z ( Z ) Q θ ( θ ) log P ( X , Z , θ | η ) η

ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality.

ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: 0 , α − 1 � � x ∼ N (Λ z , Ψ) z ∼ N ( 0 , I ) with a column-wise prior Λ : i ∼ N I i

ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: 0 , α − 1 � � x ∼ N (Λ z , Ψ) z ∼ N ( 0 , I ) with a column-wise prior Λ : i ∼ N I i ◮ The VB free energy is F ( Q Z ( Z ) , Q Λ (Λ) , Ψ , α ) = � log P ( X , Z| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) � Q Z Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ

ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: 0 , α − 1 � � x ∼ N (Λ z , Ψ) z ∼ N ( 0 , I ) with a column-wise prior Λ : i ∼ N I i ◮ The VB free energy is F ( Q Z ( Z ) , Q Λ (Λ) , Ψ , α ) = � log P ( X , Z| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) � Q Z Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ ◮ Now Q Λ is Gaussian, with the same form as in linear regression, but with expected moments of z appearing in place of the inputs.

ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: 0 , α − 1 � � x ∼ N (Λ z , Ψ) z ∼ N ( 0 , I ) with a column-wise prior Λ : i ∼ N I i ◮ The VB free energy is F ( Q Z ( Z ) , Q Λ (Λ) , Ψ , α ) = � log P ( X , Z| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) � Q Z Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ ◮ Now Q Λ is Gaussian, with the same form as in linear regression, but with expected moments of z appearing in place of the inputs. ◮ Optimisation wrt the distributions, Ψ and α in turn causes some α i to diverge as in regression ARD.

ARD for unsupervised learning Recall that ARD (automatic relevance determination) was a hyperparameter method to select relevant or useful inputs in regression. ◮ A similar idea used with variational Bayesian methods can learn a latent dimensionality. ◮ Consider factor analysis: 0 , α − 1 � � x ∼ N (Λ z , Ψ) z ∼ N ( 0 , I ) with a column-wise prior Λ : i ∼ N I i ◮ The VB free energy is F ( Q Z ( Z ) , Q Λ (Λ) , Ψ , α ) = � log P ( X , Z| Λ , Ψ) + log P (Λ | α ) + log P (Ψ) � Q Z Q Λ + . . . and so hyperparameter optimisation requires α ← argmax � log P (Λ | α ) � Q Λ ◮ Now Q Λ is Gaussian, with the same form as in linear regression, but with expected moments of z appearing in place of the inputs. ◮ Optimisation wrt the distributions, Ψ and α in turn causes some α i to diverge as in regression ARD. ◮ In this case, these parameters select “relevant” latent dimensions, effectively learning the dimensionality of z .

Augmented Variational Methods In our examples so far, the approximate variational distribution has been over the “natural” latent variables (and parameters) of the generative model. Sometimes it may be useful to introduce additional latent variables, solely to achieve computational tractability. Two examples are GP regression and the GPLVM.

Sparse GP approximations GP predictions: y ′ | X , Y , x ′ ∼ N � XX + σ 2 I ) K X x ′ + σ 2 � K x ′ X ( K XX + σ 2 I ) − 1 Y , K x ′ x ′ − K x ′ X ( K − 1 Evidence (for learning kernel hyperparameters): log P ( Y | X ) = − 1 2 log | 2 π ( K XX + σ 2 I ) | − 1 2 Y ( K XX + σ 2 I ) − 1 Y T Computing either form requires inverting the N × N matrix K XX , in O ( N 3 ) time.

Sparse GP approximations GP predictions: y ′ | X , Y , x ′ ∼ N � XX + σ 2 I ) K X x ′ + σ 2 � K x ′ X ( K XX + σ 2 I ) − 1 Y , K x ′ x ′ − K x ′ X ( K − 1 Evidence (for learning kernel hyperparameters): log P ( Y | X ) = − 1 2 log | 2 π ( K XX + σ 2 I ) | − 1 2 Y ( K XX + σ 2 I ) − 1 Y T Computing either form requires inverting the N × N matrix K XX , in O ( N 3 ) time. One proposal to make this more efficient is to find (or select) a smaller set of possibly fictitious measurements U at inputs Z such that P ( y ′ | Z , U , x ′ ) ≈ P ( y ′ | X , Y , x ′ ) .

Probabilistic & Unsupervised Learning Factored Variational - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational Bayes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Probabilistic & Unsupervised Learning Parametric Variational Methods and Recognition Models

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Lecture 3: Crossed Products by Finite Groups; the 1129 July 2016 Rokhlin Property Lecture 1

Convergence of ensemble Kalman filters in the large ensemble limit and infinite dimension Jan

Preliminary Results on n e / n t selection at DUNE FD. CP violation & n t physics perspectives.

s rs rts

Informational and Computational Limits of Clustering and other questions about clustering Nati

Lecture 2.1: Separation of Variables Matthew Macauley Department of Mathematical Sciences

Brentuximab Vedotin in ALCL Bar arbara Pro, MD Nor orthwestern rn Univ iversity CD30 A ( (

Helmets and Neck Injuries in Fatal Motorcycle Crashes James V. Ouellet David R. Thom Terry A.

Probabilistic & Unsupervised Learning Factored Variational - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational Bayes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

Probabilistic &amp; Unsupervised Learning Factored Variational Approximations and Variational

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Probabilistic &amp; Unsupervised Learning Parametric Variational Methods and Recognition Models

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Lecture 3: Crossed Products by Finite Groups; the 1129 July 2016 Rokhlin Property Lecture 1

Convergence of ensemble Kalman filters in the large ensemble limit and infinite dimension Jan

Preliminary Results on n e / n t selection at DUNE FD. CP violation &amp; n t physics perspectives.

s rs rts

Informational and Computational Limits of Clustering and other questions about clustering Nati

Lecture 2.1: Separation of Variables Matthew Macauley Department of Mathematical Sciences

Brentuximab Vedotin in ALCL Bar arbara Pro, MD Nor orthwestern rn Univ iversity CD30 A ( (

Helmets and Neck Injuries in Fatal Motorcycle Crashes James V. Ouellet David R. Thom Terry A.

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational

Probabilistic & Unsupervised Learning Parametric Variational Methods and Recognition Models

Preliminary Results on n e / n t selection at DUNE FD. CP violation & n t physics perspectives.