Probabilistic & Unsupervised Learning Beyond linear-Gaussian - PowerPoint PPT Presentation

Independent Components Analysis Mixture of Heavy Tailed Sources Mixture of Light Tailed Sources 5 3 4 2 3 These distributions are gen- 2 1 erated by linearly combining 1 0 (or mixing) two non-Gaussian 0 −1 −1 sources. −2 −2 −3 −4 −3 −4 −2 0 2 4 −3 −2 −1 0 1 2 ◮ The ICA graphical model is identical to factor analysis: z 1 z 2 • • • z K � K x d = Λ dk z k + ǫ d k = 1 • • • x 1 x 2 x D iid ∼ P z non-Gaussian. but with z k ◮ Well-posed even with K ≥ D (e.g. K = D = 2 above). ◮ Tractable for 0 noise (“PCA-like” case).

Independent Components Analysis Mixture of Heavy Tailed Sources Mixture of Light Tailed Sources 5 3 4 2 3 These distributions are gen- 2 1 erated by linearly combining 1 0 (or mixing) two non-Gaussian 0 −1 −1 sources. −2 −2 −3 −4 −3 −4 −2 0 2 4 −3 −2 −1 0 1 2 ◮ The ICA graphical model is identical to factor analysis: z 1 z 2 • • • z K � K x d = Λ dk z k + ǫ d k = 1 • • • x 1 x 2 x D iid ∼ P z non-Gaussian. but with z k ◮ Well-posed even with K ≥ D (e.g. K = D = 2 above). ◮ Tractable for 0 noise (“PCA-like” case). ◮ Intractable in general: posterior non-Gaussian, MAP inference non-linear.

Independent Components Analysis Mixture of Heavy Tailed Sources Mixture of Light Tailed Sources 5 3 4 2 3 These distributions are gen- 2 1 erated by linearly combining 1 0 (or mixing) two non-Gaussian 0 −1 −1 sources. −2 −2 −3 −4 −3 −4 −2 0 2 4 −3 −2 −1 0 1 2 ◮ The ICA graphical model is identical to factor analysis: z 1 z 2 • • • z K � K x d = Λ dk z k + ǫ d k = 1 • • • x 1 x 2 x D iid ∼ P z non-Gaussian. but with z k ◮ Well-posed even with K ≥ D (e.g. K = D = 2 above). ◮ Tractable for 0 noise (“PCA-like” case). ◮ Intractable in general: posterior non-Gaussian, MAP inference non-linear. ◮ Exact inference and learning difficult ⇒ “noise” components or variational approx.

Square, Noiseless ICA ◮ The special case of K = D , and zero observation noise has been studied extensively (also called infomax ICA, c.f. information view of PCA): W = Λ − 1 x = Λ z ⇒ z = W x with z are called independent components; W is the unmixing matrix.

Square, Noiseless ICA ◮ The special case of K = D , and zero observation noise has been studied extensively (also called infomax ICA, c.f. information view of PCA): W = Λ − 1 x = Λ z ⇒ z = W x with z are called independent components; W is the unmixing matrix. ◮ The likelihood can be obtained by transforming the density of z to that of x . If F : z �→ x is a differentiable bijection, and if d z is a small neighbourhood around z , then � � � � � � ∇ F − 1 � d z P x ( x ) d x = P z ( z ) d z = P z ( F − 1 ( x )) � � � d x = P z ( F − 1 ( x )) � d x � d x

Square, Noiseless ICA ◮ The special case of K = D , and zero observation noise has been studied extensively (also called infomax ICA, c.f. information view of PCA): W = Λ − 1 x = Λ z ⇒ z = W x with z are called independent components; W is the unmixing matrix. ◮ The likelihood can be obtained by transforming the density of z to that of x . If F : z �→ x is a differentiable bijection, and if d z is a small neighbourhood around z , then � � � � � ∇ F − 1 � � d z P x ( x ) d x = P z ( z ) d z = P z ( F − 1 ( x )) � � � d x = P z ( F − 1 ( x )) � d x � d x ◮ This gives (for parameter W ): � P ( x | W ) = | W | P z ([ W x ] k ) � �� k z k

Learning in ICA ◮ Log likelihood of data: � log P ( x ) = log | W | + log P z ( W i x ) i

Learning in ICA ◮ Log likelihood of data: � log P ( x ) = log | W | + log P z ( W i x ) i ◮ Learning by gradient ascent: g ( z ) = ∂ log P z ( z ) ∆ W ∝ ∇ W log P ( x ) = W − T + g ( z ) x T ∂ z

Learning in ICA ◮ Log likelihood of data: � log P ( x ) = log | W | + log P z ( W i x ) i ◮ Learning by gradient ascent: g ( z ) = ∂ log P z ( z ) ∆ W ∝ ∇ W log P ( x ) = W − T + g ( z ) x T ∂ z ◮ Better approach: “natural” or covariant gradient ∆ W ∝ ∇ W log P ( x ) · ( W T W ) = W + g ( z ) z T W � �� ≈ �−∇∇ log P � − 1 (see MacKay 1996).

Learning in ICA ◮ Log likelihood of data: � log P ( x ) = log | W | + log P z ( W i x ) i ◮ Learning by gradient ascent: g ( z ) = ∂ log P z ( z ) ∆ W ∝ ∇ W log P ( x ) = W − T + g ( z ) x T ∂ z ◮ Better approach: “natural” or covariant gradient ∆ W ∝ ∇ W log P ( x ) · ( W T W ) = W + g ( z ) z T W � �� ≈ �−∇∇ log P � − 1 (see MacKay 1996). ◮ Note: we can’t use EM in the square noiseless causal ICA model. Why?

Infomax ICA ◮ Consider a feedforward model: z i = W i x ; ξ i = f i ( z i ) with a monotonic squashing function f i ( −∞ ) = 0, f i (+ ∞ ) = 1.

Infomax ICA ◮ Consider a feedforward model: z i = W i x ; ξ i = f i ( z i ) with a monotonic squashing function f i ( −∞ ) = 0, f i (+ ∞ ) = 1. ◮ Infomax finds filtering weights W maximizing the information carried by ξ about x : argmax I ( x ; ξ ) = argmax H ( ξ ) − H ( ξ | x ) = argmax H ( ξ ) W W W Thus we just have to maximize entropy of ξ : make it as uniform as possible on [ 0 , 1 ] (note squashing function).

Infomax ICA ◮ Consider a feedforward model: z i = W i x ; ξ i = f i ( z i ) with a monotonic squashing function f i ( −∞ ) = 0, f i (+ ∞ ) = 1. ◮ Infomax finds filtering weights W maximizing the information carried by ξ about x : argmax I ( x ; ξ ) = argmax H ( ξ ) − H ( ξ | x ) = argmax H ( ξ ) W W W Thus we just have to maximize entropy of ξ : make it as uniform as possible on [ 0 , 1 ] (note squashing function). ◮ But if data were generated from a square noiseless causal ICA then best we can do is if W = Λ − 1 ξ i = f i ( z i ) = cdf i ( z i ) and Infomax ICA ⇔ square noiseless causal ICA .

Infomax ICA ◮ Consider a feedforward model: z i = W i x ; ξ i = f i ( z i ) with a monotonic squashing function f i ( −∞ ) = 0, f i (+ ∞ ) = 1. ◮ Infomax finds filtering weights W maximizing the information carried by ξ about x : argmax I ( x ; ξ ) = argmax H ( ξ ) − H ( ξ | x ) = argmax H ( ξ ) W W W Thus we just have to maximize entropy of ξ : make it as uniform as possible on [ 0 , 1 ] (note squashing function). ◮ But if data were generated from a square noiseless causal ICA then best we can do is if W = Λ − 1 ξ i = f i ( z i ) = cdf i ( z i ) and Infomax ICA ⇔ square noiseless causal ICA . ◮ Another view: redundancy reduction in the representation ξ of the data x . � argmax H ( ξ ) = argmax H ( ξ i ) − I ( ξ 1 , . . . , ξ D ) W W i See: MacKay (1996), Pearlmutter and Parra (1996), Cardoso (1997) for equivalence, Teh et al (2003) for an energy-based view.

Kurtosis The kurtosis (or excess kurtosis) measures how “peaky” or “heavy-tailed” a distribution is: K = E (( x − µ ) 4 ) E (( x − µ ) 2 ) 2 − 3, where µ = E ( x ) is the mean of x . Gaussian distributions have zero kurtosis. Heavy tailed: positive kurtosis (leptokurtic). Light tailed: negative kurtosis (platykurtic). Linear mixtures of independent non-Gaussian sources tend to be “more” Gaussian ⇒ K → 0. Some ICA algorithms are essentially kurtosis pursuit approaches. Possibly fewer assumptions about generating distributions.

ICA and BSS Applications: ◮ Separating auditory sources ◮ Analysis of EEG data ◮ Analysis of functional MRI data ◮ Natural scene analysis ◮ . . . Extensions: ◮ Non-zero output noise – approximate posteriors and learning. ◮ Undercomplete ( K < D ) or overcomplete ( K > D ). ◮ Learning prior distributions (on z ). ◮ Dynamical hidden models (on z ). ◮ Learning number of sources. ◮ Time-varying mixing matrix. ◮ Nonparametric, kernel ICA. ◮ . . .

Blind Source Separation ? ◮ ICA solution to blind source separation assumes no dependence across time; still works fine much of the time. ◮ Many other algorithms: DCA, SOBI, JADE, . . .

Images a filters image patch, I W image ensemble Φ basis functions a causes

Natural Scenes Olshausen & Field (1996)

Nonlinear state-space models

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T z t + 1 = f ( z t , u t ) + w t f f f f z 1 z 2 z 3 • • • z T x t = g ( z t , u t ) + v t g g g g w t , v t usually still Gaussian. x 1 x 2 x 3 x T f ( z t ) z t

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T z t + 1 = f ( z t , u t ) + w t f f f f z 1 z 2 z 3 • • • z T x t = g ( z t , u t ) + v t g g g g w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � t , u t ) + ∂ f z t � z t z t + 1 ≈ f ( ˆ ( z t − ˆ t ) + w t � ∂ z t z t ˆ t � � , u t ) + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ ( z t − ˆ ) + v t � t t ∂ z t z t − 1 ˆ z t t z t ˆ t

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T � � � � B t B t B t B t z t + 1 = f ( z t , u t ) + w t � � � � A t A t A t A t y 1 y 2 y 3 • • • y T x t = g ( z t , u t ) + v t � � � � D t D t D t D t � � � � C t C t C t C t w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � + ∂ f z t � z t z t + 1 ≈ f ( ˆ t , u t ) ( z t − ˆ t ) + w t � ∂ z t � �� z t ˆ t � � �� B t u t � A t � � + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ , u t ) ( z t − ˆ ) + v t � t t ∂ z t � �� z t − 1 ˆ z t t � D t u t � �� z t ˆ � C t t Run the Kalman filter (smoother) on non-stationary linearised system ( � A t , � B t , � C t , � D t ):

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T � � � � B t B t B t B t z t + 1 = f ( z t , u t ) + w t � � � � A t A t A t A t y 1 y 2 y 3 • • • y T x t = g ( z t , u t ) + v t � � � � D t D t D t D t � � � � C t C t C t C t w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � + ∂ f z t � z t z t + 1 ≈ f ( ˆ t , u t ) ( z t − ˆ t ) + w t � ∂ z t � �� z t ˆ t � � �� B t u t � A t � � + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ , u t ) ( z t − ˆ ) + v t � t t ∂ z t � �� z t − 1 ˆ z t t � D t u t � �� z t ˆ � C t t Run the Kalman filter (smoother) on non-stationary linearised system ( � A t , � B t , � C t , � D t ): ◮ Adaptively approximates non-Gaussian messages by Gaussians.

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T � � � � B t B t B t B t z t + 1 = f ( z t , u t ) + w t � � � � A t A t A t A t y 1 y 2 y 3 • • • y T x t = g ( z t , u t ) + v t � � � � D t D t D t D t � � � � C t C t C t C t w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � + ∂ f z t � z t z t + 1 ≈ f ( ˆ t , u t ) ( z t − ˆ t ) + w t � ∂ z t � �� z t ˆ t � � �� B t u t � A t � � + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ , u t ) ( z t − ˆ ) + v t � t t ∂ z t � �� z t − 1 ˆ z t t � D t u t � �� z t ˆ � C t t Run the Kalman filter (smoother) on non-stationary linearised system ( � A t , � B t , � C t , � D t ): ◮ Adaptively approximates non-Gaussian messages by Gaussians. ◮ Local linearisation depends on central point of distribution ⇒ approximation degrades with increased state uncertainty.

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T � � � � B t B t B t B t z t + 1 = f ( z t , u t ) + w t � � � � A t A t A t A t y 1 y 2 y 3 • • • y T x t = g ( z t , u t ) + v t � � � � D t D t D t D t � � � � C t C t C t C t w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � + ∂ f z t � z t z t + 1 ≈ f ( ˆ t , u t ) ( z t − ˆ t ) + w t � ∂ z t � �� z t ˆ t � � �� B t u t � A t � � + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ , u t ) ( z t − ˆ ) + v t � t t ∂ z t � �� z t − 1 ˆ z t t � D t u t � �� z t ˆ � C t t Run the Kalman filter (smoother) on non-stationary linearised system ( � A t , � B t , � C t , � D t ): ◮ Adaptively approximates non-Gaussian messages by Gaussians. ◮ Local linearisation depends on central point of distribution ⇒ approximation degrades with increased state uncertainty. May work acceptably for close-to-linear systems.

Nonlinear state-space model (NLSSM) u 1 u 2 u 3 u T � � � � B t B t B t B t z t + 1 = f ( z t , u t ) + w t � � � � A t A t A t A t y 1 y 2 y 3 • • • y T x t = g ( z t , u t ) + v t � � � � D t D t D t D t � � � � C t C t C t C t w t , v t usually still Gaussian. x 1 x 2 x 3 x T z t Extended Kalman Filter (EKF) : linearise nonlinear functions about current estimate, ˆ t : � f ( z t ) � + ∂ f z t � z t z t + 1 ≈ f ( ˆ t , u t ) ( z t − ˆ t ) + w t � ∂ z t � �� z t ˆ t � � �� B t u t � A t � � + ∂ g z t − 1 � z t − 1 x t ≈ g ( ˆ , u t ) ( z t − ˆ ) + v t � t t ∂ z t � �� z t − 1 ˆ z t t � D t u t � �� z t ˆ � C t t Run the Kalman filter (smoother) on non-stationary linearised system ( � A t , � B t , � C t , � D t ): ◮ Adaptively approximates non-Gaussian messages by Gaussians. ◮ Local linearisation depends on central point of distribution ⇒ approximation degrades with increased state uncertainty. May work acceptably for close-to-linear systems. Can base EM-like algorithm on EKF/EKS (or alternatives).

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate).

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate). = = z t | x 1 , . . . , x t ] and Cov [ z t | x 1 , . . . , x t ] . These now Use EKF to compute online estimates of E [ include mean and posterior variance of parameter estimates.

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate). = = z t | x 1 , . . . , x t ] and Cov [ z t | x 1 , . . . , x t ] . These now Use EKF to compute online estimates of E [ include mean and posterior variance of parameter estimates. ◮ Pseudo-Bayesian approach: gives Gaussian distributions over parameters.

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate). = = z t | x 1 , . . . , x t ] and Cov [ z t | x 1 , . . . , x t ] . These now Use EKF to compute online estimates of E [ include mean and posterior variance of parameter estimates. ◮ Pseudo-Bayesian approach: gives Gaussian distributions over parameters. ◮ Can model nonstationarity by assuming non-zero innovations noise in A , C .

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate). = = z t | x 1 , . . . , x t ] and Cov [ z t | x 1 , . . . , x t ] . These now Use EKF to compute online estimates of E [ include mean and posterior variance of parameter estimates. ◮ Pseudo-Bayesian approach: gives Gaussian distributions over parameters. ◮ Can model nonstationarity by assuming non-zero innovations noise in A , C . ◮ Not simple to implement for Q and R (e.g. covariance constraints?).

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate). = = z t | x 1 , . . . , x t ] and Cov [ z t | x 1 , . . . , x t ] . These now Use EKF to compute online estimates of E [ include mean and posterior variance of parameter estimates. ◮ Pseudo-Bayesian approach: gives Gaussian distributions over parameters. ◮ Can model nonstationarity by assuming non-zero innovations noise in A , C . ◮ Not simple to implement for Q and R (e.g. covariance constraints?). ◮ May be faster than EM/gradient approaches.

Learning (online EKF) Nonlinear message passing can also be used to implement online parameter learning in (non)linear latent state-space systems:   z t =   , Eg: for linear model, augment state vector to include the model parameters : z t = A = = C and introduce nonlinear transition f and output map g :         z t A z t w t = = = = = =     =   ;   z t + 1 = f ( z t ) + w t f A A w t = 0 C C 0     z t = = =  = C z t    x t = g ( z t ) + v t g A C (where A and C need to be vectorised and de-vectorised as appropriate). = = z t | x 1 , . . . , x t ] and Cov [ z t | x 1 , . . . , x t ] . These now Use EKF to compute online estimates of E [ include mean and posterior variance of parameter estimates. ◮ Pseudo-Bayesian approach: gives Gaussian distributions over parameters. ◮ Can model nonstationarity by assuming non-zero innovations noise in A , C . ◮ Not simple to implement for Q and R (e.g. covariance constraints?). ◮ May be faster than EM/gradient approaches. Sometimes called the joint-EKF approach.

Binary models: Boltzmann Machines and Sigmoid Belief Nets

Boltzmann Machines Undirected graphical model (i.e. a Markov network) over a vector of binary variables s i ∈ { 0 , 1 } . Some variables may be hidden, some may be visible (observed). �� P ( s | W , b ) = 1 W ij s i s j − Z exp b i s i ij i where Z is the normalization constant (partition function). A jointly exponential-family model, with intractable normaliser.

Boltzmann Machines Undirected graphical model (i.e. a Markov network) over a vector of binary variables s i ∈ { 0 , 1 } . Some variables may be hidden, some may be visible (observed). �� P ( s | W , b ) = 1 W ij s i s j − Z exp b i s i ij i where Z is the normalization constant (partition function). A jointly exponential-family model, with intractable normaliser. ◮ Inference requires expectations of hidden nodes s H : � s H � � s H s H T � P ( s H | s V , W , b ) P ( s H | s V , W , b ) ◮ Usually requires approximate methods: sampling or loopy BP. ◮ Intractable normaliser also complicates M-step ⇒ doubly intractable.

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V )

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z ∂ W ij

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z = � s i s j � c − log Z ∂ W ij ∂ W ij

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z = � s i s j � c − log Z ∂ W ij ∂ W ij � � ij W ij s i s j − � = � s i s j � c − 1 ∂ i b i s i e Z ∂ W ij s

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z = � s i s j � c − log Z ∂ W ij ∂ W ij � � ij W ij s i s j − � = � s i s j � c − 1 ∂ i b i s i e Z ∂ W ij s � � ij W ij s i s j − � 1 i b i s i s i s j = � s i s j � c − Z e s

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z = � s i s j � c − log Z ∂ W ij ∂ W ij � � ij W ij s i s j − � = � s i s j � c − 1 ∂ i b i s i e Z ∂ W ij s � � ij W ij s i s j − � 1 i b i s i s i s j = � s i s j � c − Z e s � = � s i s j � c − P ( s | W , b ) s i s j s

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z = � s i s j � c − log Z ∂ W ij ∂ W ij � � ij W ij s i s j − � = � s i s j � c − 1 ∂ i b i s i e Z ∂ W ij s � � ij W ij s i s j − � 1 i b i s i s i s j = � s i s j � c − Z e s � = � s i s j � c − P ( s | W , b ) s i s j = � s i s j � c − � s i s j � u s with �� u (unclamped) expectation under the current joint.

Learning in Boltzmann Machines � � log P ( s V s H | W , b ) = W ij s i s j − b i s i − log Z ij i � ij W ij s i s j − � with Z = � i b i s i s e Generalised (gradient M-step) EM requires parameter step � � ∂ log P ( s V s H | W , b ) ∆ W ij ∝ ∂ W ij P ( s H | s V ) obs ) = � δ s v Write �� c (clamped) for expectations under P ( s | s V obs ) (with P ( s V | s V i , obs ). Then i , s V �� ij W ij � s i s j � c − � ∂ ∂ [ ∇ W log P ( s V , s H )] ij = i b i � s i � c − log Z = � s i s j � c − log Z ∂ W ij ∂ W ij � � ij W ij s i s j − � = � s i s j � c − 1 ∂ i b i s i e Z ∂ W ij s � � ij W ij s i s j − � 1 i b i s i s i s j = � s i s j � c − Z e s � = � s i s j � c − P ( s | W , b ) s i s j = � s i s j � c − � s i s j � u s with �� u (unclamped) expectation under the current joint. ⇒ ExpFam moment matching, but requires simulation and gradient ascent.

Sigmoid Belief Networks Directed graphical model (i.e. a Bayesian network) over a vector of binary variables s i ∈ { 0 , 1 } . � P ( s | W , b ) = P ( s i |{ s j } j < i , W , b ) • • • i � s i |{ s j } j < i , W , b ∼ Bernoulli ( σ ( W ij s j − b i )) j < i • • • 1 P ( s i = 1 |{ s j } j < i , W , b ) = 1 + exp {− � j < i W ij s j − b i } • • • ◮ parents most often grouped into layers ◮ logistic function σ of linear combination of parents ◮ “generative multilayer perceptron” (“neural network”) Learning algorithm: a gradient version of EM ◮ E step involves computing averages w.r.t. P ( s H | s V , W , b ) . This could be done either exactly or approximately using Gibbs sampling or mean field approximations. Or using a parallel ‘recognition network’ (the Helmholtz machine). ◮ Unlike Boltzmann machines, there is no partition function, so no need for an unclamped phase in the M step.

Restricted Boltzmann Machines Special case Boltzmann Machine: W ij = 0 for any two visible or any two hidden nodes (bipartite graph). � � j ∈ H W ij s i s j − � i ∈ V b i s i − � P ( s V | s H ) = 1 j ∈ H b j s j Z e i ∈ V � � = 1 s i j ∈ H W ij s j − b i s i e Z ′ i � � = Bernoulli ( σ ( W ij s j − b i )) i j ∈ H similarly � � P ( s H | s V ) = W ij s i − b j )) Bernoulli ( σ ( i ∈ V j ◮ So inference is tractable . . . ◮ . . . but learning still intractable because of normaliser. ◮ Unclamped samples can be generated efficiently by block Gibbs sampling. ◮ Often combined with a futher approximation called contrastive divergence learning.

Distributed state models

Factorial Hidden Markov Models s ( 3 ) s ( 3 ) s ( 3 ) s ( 3 ) • • • 1 2 3 T s ( 2 ) s ( 2 ) s ( 2 ) • • • s ( 2 ) 1 2 3 T s ( 1 ) s ( 1 ) s ( 1 ) s ( 1 ) • • • 1 2 3 T x 1 x 2 x 3 x T ◮ Hidden Markov models with many state variables (i.e. distributed state representation). ◮ Each state variable evolves independently. ◮ The state can capture many bits of information about the sequence (linear in the number of state variables). ◮ E step is typically intractable (due to explaining away in latent states). ◮ Example case for variational approximation

Dynamic Bayesian Networks A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2 ◮ Distributed HMM with structured dependencies amongst latent states.

Latent Dirichlet Allocation

Topic Modelling Topic modelling : given a corpus of documents, find the “topics” they discuss.

Topic Modelling Topic modelling : given a corpus of documents, find the “topics” they discuss. Example: consider abstracts of papers PNAS. Global climate change and mammalian species diversity in U.S. national parks National parks and bioreserves are key conservation tools used to protect species and their habitats within the confines of fixed political boundaries. This inflexibility may be their ”Achilles’ heel” as conservation tools in the face of emerging global-scale environmental problems such as climate change. Global climate change, brought about by rising levels of greenhouse gases, threatens to alter the geographic distribution of many habitats and their component species.... The influence of large-scale wind power on global climate Large-scale use of wind power can alter local and global climate by extracting kinetic energy and altering turbulent transport in the atmospheric boundary layer. We report climate-model simulations that address the possible climatic impacts of wind power at regional to global scales by using two general circulation models and several parameterizations of the interaction of wind turbines with the boundary layer.... Twentieth century climate change: Evidence from small glaciers The relation between changes in modern glaciers, not including the ice sheets of Greenland and Antarctica, and their climatic environment is investigated to shed light on paleoglacier evidence of past climate change and for projecting the effects of future climate warming on cold regions of the world. Loss of glacier volume has been more or less continuous since the 19th century, but it is not a simple adjustment to the end of an ”anomalous” Little Ice Age....

Topic Modelling Example topics discovered from PNAS abstracts (each topic represented in terms of the top 5 most common words in that topic).

Recap: Beta Distributions Recall the Bayesian coin toss example. P ( H | q ) = q P ( T | q ) = 1 − q The probability of a sequence of coin tosses is: P ( HHTT · · · HT | q ) = q #heads ( 1 − q ) #tails A conjugate prior for q is the Beta distribution: P ( q ) = Γ( a + b ) Γ( a )Γ( b ) q a − 1 ( 1 − q ) b − 1 a , b ≥ 0 2.5 2 1.5 P(q) 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q

Dirichlet Distributions Imagine a Bayesian dice throwing example. P ( 1 | q ) = q 1 P ( 2 | q ) = q 2 P ( 3 | q ) = q 3 P ( 4 | q ) = q 4 P ( 5 | q ) = q 5 P ( 6 | q ) = q 6 with q i ≥ 0 , � i q i = 1.

Dirichlet Distributions Imagine a Bayesian dice throwing example. P ( 1 | q ) = q 1 P ( 2 | q ) = q 2 P ( 3 | q ) = q 3 P ( 4 | q ) = q 4 P ( 5 | q ) = q 5 P ( 6 | q ) = q 6 with q i ≥ 0 , � i q i = 1. The probability of a sequence of dice throws is: 6 � q # face i P ( 34156 · · · 12 | q ) = i i = 1

Dirichlet Distributions Imagine a Bayesian dice throwing example. P ( 1 | q ) = q 1 P ( 2 | q ) = q 2 P ( 3 | q ) = q 3 P ( 4 | q ) = q 4 P ( 5 | q ) = q 5 P ( 6 | q ) = q 6 with q i ≥ 0 , � i q i = 1. The probability of a sequence of dice throws is: 6 � q # face i P ( 34156 · · · 12 | q ) = i i = 1 A conjugate prior for q is the Dirichlet distribution: P ( q ) = Γ( � � i a i ) q i ≥ 0 , � q a i − 1 � i q i = 1 a i ≥ 0 i i Γ( a i ) i Dirichlet [1,1,1] Dirichlet [2,2,2] Dirichlet [2,10,2] Dirichlet [0.9,0.9,0.9] q 2 q 2 q 2 q 2 q 1 q 1 q 1 q 1

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption.

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. ◮ For each document: document d = 1 ... D

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. ◮ For each document: ◮ generate words iid: x id word i = 1 ... Nd document d = 1 ... D

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. ◮ For each document: ◮ generate words iid: z id φ k x id ◮ draw word from a topic-specific dist: word i = 1 ... Nd x id ∼ Discrete ( φ z id ) document d = 1 ... D

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. ◮ For each document: θ d ◮ generate words iid: z id φ k ◮ draw topic from a document-specific dist: z id ∼ Discrete ( θ d ) x id ◮ draw word from a topic-specific dist: word i = 1 ... Nd x id ∼ Discrete ( φ z id ) document d = 1 ... D

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. α ◮ For each document: ◮ draw a distribution over topics θ d θ d ∼ Dir ( α, . . . , α ) ◮ generate words iid: z id φ k ◮ draw topic from a document-specific dist: z id ∼ Discrete ( θ d ) x id ◮ draw word from a topic-specific dist: word i = 1 ... Nd x id ∼ Discrete ( φ z id ) document d = 1 ... D

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. ◮ Draw topic distributions from a prior φ k ∼ Dir ( β, . . . , β ) α ◮ For each document: ◮ draw a distribution over topics θ d β θ d ∼ Dir ( α, . . . , α ) ◮ generate words iid: z id φ k ◮ draw topic from a document-specific dist: topic k = 1 ... K z id ∼ Discrete ( θ d ) x id ◮ draw word from a topic-specific dist: word i = 1 ... Nd x id ∼ Discrete ( φ z id ) document d = 1 ... D

Latent Dirichlet Allocation Each document is a sequence of words, we model it using a mixture model by ignoring the sequential nature—“bag-of-words” assumption. ◮ Draw topic distributions from a prior φ k ∼ Dir ( β, . . . , β ) α ◮ For each document: ◮ draw a distribution over topics θ d β θ d ∼ Dir ( α, . . . , α ) ◮ generate words iid: z id φ k ◮ draw topic from a document-specific dist: topic k = 1 ... K z id ∼ Discrete ( θ d ) x id ◮ draw word from a topic-specific dist: word i = 1 ... Nd x id ∼ Discrete ( φ z id ) document d = 1 ... D Multiple mixtures of discrete distributions, sharing the same set of components (topics).

Latent Dirichlet Allocation as Matrix Decomposition Let N dw be the number of times word w appears in document d , and P dw is the probability of word w appearing in document d . � N dw p ( N | P ) = P likelihood term dw dw � � K p ( pick topic k ) p ( pick word w | k ) = P dw = θ dk φ kw k = 1 k = θ dk · φ kw P dw This decomposition is similar to PCA and factor analysis, but not Gaussian. Related to non-negative matrix factorisation (NMF).

Latent Dirichlet Allocation ◮ Exact inference in latent Dirichlet allocation is intractable, and typically either variational or Markov chain Monte Carlo approximations are deployed. ◮ Latent Dirichlet allocation is an example of a mixed membership model from statistics. ◮ Latent Dirichlet allocation has also been applied to computer vision, social network modelling, natural language processing. . . ◮ Generalizations: ◮ Relax the bag-of-words assumption (e.g. a Markov model). ◮ Model changes in topics through time. ◮ Model correlations among occurrences of topics. ◮ Model authors, recipients, multiple corpora. ◮ Cross modal interactions (images and tags). ◮ Nonparametric generalisations.

Nonlinear Dimensionality Reduction / Manifold Recovery

Nonlinear Dimensionality Reduction We can see matrix factorisation methods as performing linear dimensionality reduction. There are many ways to generalise PCA and FA to deal with data which lie on a nonlinear manifold: ◮ Nonlinear autoencoders ◮ Generative topographic mappings (GTM) and Kohonen self-organising maps (SOM) ◮ Multi-dimensional scaling (MDS) ◮ Kernel PCA (based on MDS representation) ◮ Isomap ◮ Locally linear embedding (LLE) ◮ Stochastic Neighbour Embedding ◮ Gaussian Process Latent Variable Models (GPLVM)

Another view of PCA: matching inner products We have viewed PCA as providing a decomposition of the covariance or scatter matrix S . We obtain similar results if we approximate the Gram matrix: � ( G ij − z i · z j ) 2 E = minimise ij for z ∈ R k . That is, look for a k -dimensional embedding in which dot products (which depend on lengths, and angles) are preserved as well as possible. We will see that this is also equivalent to preserving distances between points.

Probabilistic & Unsupervised Learning Beyond linear-Gaussian - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Beyond linear-Gaussian models and Mixtures Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1,

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Test for Lorentz violation with MiniBooNE low energy excess Teppei Katori Massachusetts

From Math 2220 Class 31 Line and Path Integrals Properties Interpretations Dr. Allen Back

Experimental Analysis 2. Program Optimization Marco Chiarandini slides partly based on

CS 241: Systems Programming Lecture 14. Pointers and Arrays Spring 2020 Prof. Stephen Checkoway

B e l n e t U p d a t e o n Storage systems Jean-Philippe Evrard & Mario Vandaele 22/09/2014

Class updates Ch 10 Middle Pleistocene hominins and Neandertal 1 2 3 4 Practice: Identify

The Cold Weather Plan The Cold Weather Plan update for this winter p Preventing Illness by

from analytic continuation Based on Finite temperature Nicolas Wink Pawlowski, Strodthoff, NW,

Probabilistic & Unsupervised Learning Beyond linear-Gaussian - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Beyond linear-Gaussian models and Mixtures Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1,

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic &amp; Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic &amp; Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Test for Lorentz violation with MiniBooNE low energy excess Teppei Katori Massachusetts

From Math 2220 Class 31 Line and Path Integrals Properties Interpretations Dr. Allen Back

Experimental Analysis 2. Program Optimization Marco Chiarandini slides partly based on

CS 241: Systems Programming Lecture 14. Pointers and Arrays Spring 2020 Prof. Stephen Checkoway

B e l n e t U p d a t e o n Storage systems Jean-Philippe Evrard &amp; Mario Vandaele 22/09/2014

Class updates Ch 10 Middle Pleistocene hominins and Neandertal 1 2 3 4 Practice: Identify

The Cold Weather Plan The Cold Weather Plan update for this winter p Preventing Illness by

from analytic continuation Based on Finite temperature Nicolas Wink Pawlowski, Strodthoff, NW,

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

Probabilistic & Unsupervised Learning Beyond linear-Gaussian and Mixture models Maneesh

B e l n e t U p d a t e o n Storage systems Jean-Philippe Evrard & Mario Vandaele 22/09/2014