Probabilistic & Unsupervised Learning Parametric Variational - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Parametric Variational Methods and Recognition Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017

Variational methods ◮ Our treatment of variational methods has (except EP) emphasised ‘natural’ choices of variational family – most often factorised using the same functional (ExpFam) form as joint. ◮ mostly restricted to joint exponential families – facilitates hierarchical and distributed models, but not non-linear/non-conjugate. ◮ Parametric variational methods might extend our reach. Define a parametric family of posterior approximations q ( Y ; ρ ) . The constrained (approximate) variational E-step becomes: � q ( Y ) , θ ( k − 1 ) � ρ ( k ) := argmax � q ( Y ; ρ ) , θ ( k − 1 ) � q ( Y ) := argmax F ⇒ F q ∈{ q ( Y ; ρ ) } ρ and so we can replace constrained optimisation of F ( q , θ ) with unconstrained optimisation of a constrained F ( ρ, θ ) : � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] It might still be valuable to use coordinate ascent in ρ and θ , although this is no longer necessary.

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare.

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare. ◮ Otherwise we might seek to follow ∇ ρ F .

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare. ◮ Otherwise we might seek to follow ∇ ρ F . ◮ Naively, this requires evaluting a high-dimensional expectation wrt q ( Y , ρ ) as a function of ρ – not simple.

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare. ◮ Otherwise we might seek to follow ∇ ρ F . ◮ Naively, this requires evaluting a high-dimensional expectation wrt q ( Y , ρ ) as a function of ρ – not simple. ◮ At least three solutions:

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare. ◮ Otherwise we might seek to follow ∇ ρ F . ◮ Naively, this requires evaluting a high-dimensional expectation wrt q ( Y , ρ ) as a function of ρ – not simple. ◮ At least three solutions: ◮ “Score-based” gradient estimate, and Monte-Carlo (Ranganath et al. 2014).

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare. ◮ Otherwise we might seek to follow ∇ ρ F . ◮ Naively, this requires evaluting a high-dimensional expectation wrt q ( Y , ρ ) as a function of ρ – not simple. ◮ At least three solutions: ◮ “Score-based” gradient estimate, and Monte-Carlo (Ranganath et al. 2014). ◮ Recognition network trained in separate phase – not strictly variational (Dayan et al. 1995).

Optimising the variational parameters � � log P ( X , Y| θ ( k − 1 ) ) F ( ρ, θ ) = q ( Y ; ρ ) + H [ ρ ] ◮ In some special cases, the expectations of the log-joint under q ( Y ; ρ ) can be expressed in closed form, but these are rare. ◮ Otherwise we might seek to follow ∇ ρ F . ◮ Naively, this requires evaluting a high-dimensional expectation wrt q ( Y , ρ ) as a function of ρ – not simple. ◮ At least three solutions: ◮ “Score-based” gradient estimate, and Monte-Carlo (Ranganath et al. 2014). ◮ Recognition network trained in separate phase – not strictly variational (Dayan et al. 1995). ◮ Recognition network trained simultaneously with generative model using “frozen” samples (Kingma and Welling 2014; Rezende et al. 2014).

Score-based gradient estimate We have: � ∇ ρ F ( ρ, θ ) = ∇ ρ d Y q ( Y ; ρ )( log P ( X , Y| θ ) − log q ( Y ; ρ )) � = d Y [ ∇ ρ q ( Y ; ρ )]( log P ( X , Y| θ ) − log q ( Y ; ρ )) + q ( Y ; ρ ) ∇ ρ [ log P ( X , Y| θ ) − log q ( Y ; ρ )]

Score-based gradient estimate We have: � ∇ ρ F ( ρ, θ ) = ∇ ρ d Y q ( Y ; ρ )( log P ( X , Y| θ ) − log q ( Y ; ρ )) � = d Y [ ∇ ρ q ( Y ; ρ )]( log P ( X , Y| θ ) − log q ( Y ; ρ )) + q ( Y ; ρ ) ∇ ρ [ log P ( X , Y| θ ) − log q ( Y ; ρ )] Now, ∇ ρ log P ( X , Y| θ ) = 0 (no direct dependence) � � d Y q ( Y ; ρ ) ∇ ρ log q ( Y ; ρ ) = ∇ ρ d Y q ( Y ; ρ ) = 0 (always normalised) ∇ ρ q ( Y ; ρ ) = q ( Y ; ρ ) ∇ ρ log q ( Y ; ρ ) So, � � ∇ ρ F ( ρ, θ ) = [ ∇ ρ log q ( Y ; ρ )]( log P ( X , Y| θ ) − log q ( Y ; ρ )) q ( Y ; ρ )

Score-based gradient estimate We have: � ∇ ρ F ( ρ, θ ) = ∇ ρ d Y q ( Y ; ρ )( log P ( X , Y| θ ) − log q ( Y ; ρ )) � = d Y [ ∇ ρ q ( Y ; ρ )]( log P ( X , Y| θ ) − log q ( Y ; ρ )) + q ( Y ; ρ ) ∇ ρ [ log P ( X , Y| θ ) − log q ( Y ; ρ )] Now, ∇ ρ log P ( X , Y| θ ) = 0 (no direct dependence) � � d Y q ( Y ; ρ ) ∇ ρ log q ( Y ; ρ ) = ∇ ρ d Y q ( Y ; ρ ) = 0 (always normalised) ∇ ρ q ( Y ; ρ ) = q ( Y ; ρ ) ∇ ρ log q ( Y ; ρ ) So, � � ∇ ρ F ( ρ, θ ) = [ ∇ ρ log q ( Y ; ρ )]( log P ( X , Y| θ ) − log q ( Y ; ρ )) q ( Y ; ρ ) Reduced gradient of expectation to expectation of gradient – easier to compute.

Factorisation � � ∇ ρ F ( ρ, θ ) = [ ∇ ρ log q ( Y ; ρ )]( log P ( X , Y| θ ) − log q ( Y ; ρ )) q ( Y ; ρ ) ◮ Still requires a high-dimensional expectation, but can now be evaluated by Monte-Carlo. ◮ Dimensionality reduced by factorisation (particularly where P ( X , Y ) is factorised). Let q ( Y ) = � i q ( Y i | ρ i ) factor over disjoint cliques; let ¯ Y i be the minimal Markov blanket of Y i in the joint; P ¯ Y i be the product of joint factors that include any element of Y i (so the union of their arguments is ¯ Y i ); and P ¬ ¯ Y i the remaining factors. Then, � � � j log q ( Y j ; ρ j )]( log P ( X , Y| θ ) − � ∇ ρ i F ( { ρ j } , θ ) = [ ∇ ρ i j log q ( Y j ; ρ j )) q ( Y ) � � Y i ( X , ¯ = [ ∇ ρ i log q ( Y i ; ρ i )]( log P ¯ Y i ) − log q ( Y i ; ρ i ) q ( ¯ Y i ) � � � + [ ∇ ρ i log q ( Y i ; ρ i )] ( log P ¬ ¯ Y i ( X , Y ¬ i ) − log q ( Y j ; ρ j ) q ( Y ) j � = i � �� constant wrt Y i So the second term is proportional to �∇ ρ i log q ( Y i ; ρ i ) � q ( Y i ) , which = 0 as before. So expectations are only needed wrt q ( ¯ Y i ) → Message passing!

Sampling So the “black-box” variational approach is as follows: ◮ Choose a parametric (factored) variational family q ( Y ) = � i q ( Y i ; ρ i ) . ◮ Initialise factors. ◮ Repeat to convergence: ◮ Stochastic VE-step . For each i : ◮ Sample from q ( ¯ Y i ) and estimate expected gradient ∇ ρ i F . ◮ Update ρ i along gradient. ◮ Stochastic M-step . For each i : ◮ Sample from each q ( ¯ Y i ) . ◮ Update corresponding parameters. ◮ Stochastic gradient steps may employ a Robbins-Munro step-size sequence to promote convergence. ◮ Variance of the gradient estimators can also be controlled by clever Monte-Carlo techniques (orginal authors used a “control variate” method that we have not studied).

Recognition Models We have not generally distinguished between multivariate models and iid data instances. However, even for large models (such as HMMs), we often work with multiple data draws (e.g. multiple strings) and each instance requires its own variational optimisation. Suppose we have fixed length vectors { ( x i , y i ) } ( y is still latent). ◮ Optimal variational distribution q ∗ ( y i ) depends on x i . � � ◮ Learn this mapping (in parametric form): q y i ; f ( x i ; ρ ) . ◮ f is a general function approximator (a GP , neural network or similar) parametrised by ρ , trained to map x i to the variational parameters of q ( y i ) . ◮ The mapping function f is called a recognition model. ◮ This is approach is now sometimes called amortised inference. How to learn f ?

The Helmholtz Machine Dayan et al. (1995) originally studied binary sigmoid belief net, with parallel recognition model: • • • • • • • • • • • • • • • • • • Two phase learning: ◮ Wake phase: given current f , estimate mean-field representation from data (mean sufficient stats for Bernouilli are just probabilities): ˆ y i = f ( x i ; ρ ) Update generative parameters θ according to ∇ θ F ( { ˆ y i } , θ ) . ◮ Sleep phase: sample { y s , x s } S s = 1 from current generative model. Update recognition parameters ρ to direct f ( x s ) towards y s (simple gradient learning). � ∆ ρ ∝ ( y s − f ( x s ; ρ )) ∇ ρ f ( x s ; ρ ) s

Probabilistic & Unsupervised Learning Parametric Variational - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Parametric Variational Methods and Recognition Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

MLSE Channel Estimation MLSE Channel Estimation MLSE Channel Estimation Parametric or Non-

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Semi-parametric and response setup non-parametric approaches to Parametric models

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Convergence of Bender/Denk algorithm L 2 -error estimates and rate of convergence for Monte Carlo

Computing integrals in many dimensions Whats new? Ian H. Sloan i.sloan@unsw.edu.au The

Improving Educational and Vocational Outcomes for Incarcerated Youth in Iowa Nina Salomon, Senior

A Semigroup Approach for Weak Approximations with an Application to Infinite Activity L evy

THE LENT PARTICLE FORMULA Nicolas BOULEAU, Laurent DENIS, Paris. Workshop on Stochastic Analysis

Teachers, Electoral Cycles and Learning in India Sonja Fagerns and Panu Pelkonen University of

Change: Detection, Estimation, Segmentation Abstract The maximum score statistic is used to

Specific Learning Disabilities: The Role of Working Memory and Other Domain-specific Deficits