Machine Learning Lecture 12: Variational Autoencoder Nevin L. Zhang - PowerPoint PPT Presentation

Machine Learning Lecture 12: Variational Autoencoder Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and Auto-encoding variational bayes DP Kingma, M Welling (2013). Auto-encoding variational bayes. https://arxiv.org/abs/1312.6114 C Doersch (2016). Tutorial on Variational Autoencoders. https://arxiv.org/abs/1606.05908 Nevin L. Zhang (HKUST) Machine Learning 1 / 39

Introduction to Unsupervised Learning Outline 1 Introduction to Unsupervised Learning 2 The Task 3 The Objective function 4 Optimization 5 Generating Examples 6 Discussions Nevin L. Zhang (HKUST) Machine Learning 2 / 39

Introduction to Unsupervised Learning Introduction So far, supervised learning Discriminative methods: { x i , y i } N i =1 − → p ( y | x ) Generative methods: { x i , y i } N i =1 − → P ( y ) , p ( x | y ) Next, unsupervised learning : Finite mixture models for clustering [Skipped] { x i } N i =1 − → P ( z ) , p ( x | z ) Varitional autoencoder for data generation and representation learning { x i } N i =1 , p ( z ) − → p ( x | z ) q ( z | x ) used in inference Generative adversarial networks for data generation { x i } N i =1 , p ( z ) − → x = g ( z ) Nevin L. Zhang (HKUST) Machine Learning 3 / 39

The Task Outline 1 Introduction to Unsupervised Learning 2 The Task 3 The Objective function 4 Optimization 5 Generating Examples 6 Discussions Nevin L. Zhang (HKUST) Machine Learning 4 / 39

The Task The Task Suppose we have an unlabeled dataset X = { x ( i ) } N i =1 , where each training example x ( i ) is a vector that represents an image and each component of x ( i ) represents a pixel in the image. We would like to learn a distribution p ( x ) from the dataset so that we can generate more images that are similar (but different) to those in the dataset. If we can solve this task, then we have the ability to learn very complex probabilistic model for high dimensional data . The ability to generate realistic looking images would be useful for video game designers. Nevin L. Zhang (HKUST) Machine Learning 5 / 39

The Task The Generative Model We assume that each image has a label z that is not observed. z is a vector of much lower dimension that x . We further assume that the images are generated as follows: ∼ p ( z ) = N ( 0 , I ) where I is the identity matrix z ∼ p θ ( x | z ) where θ denotes model parameters x Then we have � p θ ( x ) = p θ ( x | z ) p ( z ) d z Nevin L. Zhang (HKUST) Machine Learning 6 / 39

The Task The Generative Model In addition, we assume that the conditional distribution is a Gaussian p θ ( x | z ) = N ( x | µ x ( z , θ ) , σ 2 x ( z , θ ) I ) With mean vector is µ x ( z , θ ) and diagonal covariance matrix is σ x ( z , θ ) I . The mean vector µ x ( z , θ ) and the vector σ x ( z , θ ) of sd’s are deterministically determined by z via a deep neural network with parameters θ . So, we make use of the ability of neural network in representing complex functions to learn complicated probabilistic models. Nevin L. Zhang (HKUST) Machine Learning 7 / 39

The Objective function Outline 1 Introduction to Unsupervised Learning 2 The Task 3 The Objective function 4 Optimization 5 Generating Examples 6 Discussions Nevin L. Zhang (HKUST) Machine Learning 8 / 39

The Objective function The Likelihood Function To learn the model parameters, we need maximize the following likelihood function: N � log p θ ( x ( i ) ) log p θ ( X ) = i =1 where � log p θ ( x ( i ) ) = log p θ ( x ( i ) | z ) p ( z ) d z We want to use gradient ascent to maximize the likelihood function, which requires the gradient ∇ θ log p θ ( x ( i ) ) . The gradient is intractable because of the integration. Nevin L. Zhang (HKUST) Machine Learning 9 / 39

The Objective function Naive Monte Carlo Gradient Estimator Here is a naive method to estimate p θ ( x ( i ) ) and hence the gradients. Sample L points z (1) , . . . , z ( L ) from p ( z ), and estimate p θ ( x ( i ) ) using L p θ ( x ( i ) ) ≈ 1 � p θ ( x ( i ) | z ( l ) ) L l =1 Then we can compute ∇ θ log p θ ( x ( i ) ). Unfortunately, this would not work. The reason is that x is high-dimensional (thousands to millions of dimensions). Given z , p θ ( x | z ) is highly skewed, taking non-negligible values only in a very small region. To state it another way, for a given data point x ( i ) , p θ ( x ( i ) | z ) takes non-negligible values only for z from a very small region. As such, L needs to be extremely large for the estimate to be accurate. Nevin L. Zhang (HKUST) Machine Learning 10 / 39

The Objective function Recognition Model To overcome the aforementioned difficulty, we introduce a recognition model q φ ( z | x ) q φ ( z | x ) = N ( z | µ z ( x , φ ) , σ 2 z ( x , φ ) I ) The mean vector µ z ( x , φ ) and the vector σ z ( x , φ ) of sd’s are deterministically determined by z via a deep neural network with parameters φ . We hope to get from q φ ( z | x ( i ) ) samples of z for which p θ ( x ( i ) | z ) has non-negligible values. The question now is: How to make use of q φ ( z | x ) when maximize the likelihood log p θ ( X ) = � N i =1 log p θ ( x ( i ) ). The answer is: Variational inference . Nevin L. Zhang (HKUST) Machine Learning 11 / 39

The Objective function The Variational Lower Bound � � log p θ ( x ( i ) ) log p θ ( x ( i ) ) = E z ∼ q φ ( z | x ( i ) ) log p θ ( x ( i ) | z ) p θ ( z ) � � = E z ∼ q φ p θ ( z | x ( i ) ) log p θ ( x ( i ) | z ) p θ ( z ) q φ ( z | x ( i ) ) � � = E z ∼ q φ p θ ( z | x ( i ) ) q φ ( z | x ( i ) ) � q φ ( z | x ( i ) ) � q φ ( z | x ( i ) ) � � � � log p θ ( x ( i ) | z ) = E z ∼ q φ − E z ∼ q φ + E z ∼ q φ p θ ( z | x ( i ) ) p θ ( z ) � q φ ( z | x ( i ) ) � � log p θ ( x ( i ) | z ) − D KL [ q φ ( z | x ( i ) ) || p θ ( z )] + E z ∼ q φ = E z ∼ q φ p θ ( z | x ( i ) ) L ( x ( i ) , θ , φ ) + D KL [ q φ ( z | x ( i ) ) || p θ ( z | x ( i ) )] = So, we have the following variational lower bound on loglikelihood, which is tight if q has high capacity. log p θ ( x ( i ) ) ≥ L ( x ( i ) , θ , φ ) Nevin L. Zhang (HKUST) Machine Learning 12 / 39

The Objective function The Variational Lower Bound: Alternative Perspective Nevin L. Zhang (HKUST) Machine Learning 13 / 39

The Objective function The Variational Lower Bound: Alternative Perspective Nevin L. Zhang (HKUST) Machine Learning 14 / 39

The Objective function The Objective Function Our new objective is to maximize the variational bound w.r.t both θ and φ � � L ( x ( i ) , θ , φ ) = E z ∼ q φ ( z | x ( i ) ) log p θ ( x ( i ) | z ) − D KL [ q φ ( z | x ( i ) ) || p θ ( z )] Interpretation The recognition model q φ ( z | x ( i ) ) can be viewed as a encoder that takes a data point x ( i ) and probabilistically encodes it into a latent vector z . The decoder p θ ( x | z ) then takes the latent representation and probabilistically decodes it into a vector x in the data space. The first term in L measure how well (the distribution of) the decoded output match the input x ( i ) . It is the reconstruction error . The second term is a regularization terms that encourages the posterior distribution q φ ( z | x ( i ) ) of the encoding z to be close to the prior p θ ( z ). So, the method is called variational autoencoder (VAE) . Nevin L. Zhang (HKUST) Machine Learning 15 / 39

The Objective function Illustration of Variational Autoencoder � � L ( x ( i ) , θ , φ ) = E z ∼ q φ ( z | x ( i ) ) log p θ ( x ( i ) | z ) − D KL [ q φ ( z | x ( i ) ) || p θ ( z )] Nevin L. Zhang (HKUST) Machine Learning 16 / 39

The Objective function Illustration of Variational Autoencoder The encoder maps the data distribution, which is complex, to approximately an Gaussian distribution. The decoder maps a Gaussian distribution to the data distribution. Nevin L. Zhang (HKUST) Machine Learning 17 / 39

The Objective function Illustration of Variational Autoencoder Fake images generated by picking points in the latent space and map them back to the data space using the decoder. Nevin L. Zhang (HKUST) Machine Learning 18 / 39

Optimization Outline 1 Introduction to Unsupervised Learning 2 The Task 3 The Objective function 4 Optimization 5 Generating Examples 6 Discussions Nevin L. Zhang (HKUST) Machine Learning 19 / 39

Optimization The Need For Reparameterization The computation of the first term L 1 of L requires sampling : L ≈ 1 � � log p θ ( x ( i ) | z ) � log p θ ( x ( i ) | z ( i , l ) ) L 1 = E z ∼ q φ ( z | x ( i ) ) L l =1 where z ( i , l ) ∼ q φ ( z | x ( i ) ). But sampling looses gradient ∇ φ While the LHS depends on φ , the RHS does not. So, the stochastic connections from µ z and σ z to z makes backpropagation impossible. Nevin L. Zhang (HKUST) Machine Learning 20 / 39

Optimization The Reparameterization Trick Here is the recognition model q φ ( z | x ) = N ( z | µ z ( x , φ ) , σ 2 z ( x , φ ) I ) Using the reparameterization trick, we change it into the following equivalent form z = µ z ( x , φ ) + σ z ( x , φ ) ⊙ ǫ , ǫ ∼ N ( 0 , I ) where ⊙ is element-wise product. Note that, now, z depends on µ z , σ z and ǫ deterministically. ǫ is stochastic, but it is an input the the network. Nevin L. Zhang (HKUST) Machine Learning 21 / 39

Machine Learning Lecture 12: Variational Autoencoder Nevin L. Zhang - PowerPoint PPT Presentation

Machine Learning Lecture 12: Variational Autoencoder Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on internet resources and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Class Probabilities and the Log-sum-exp Trick Oren Freifeld Computer Science, Ben-Gurion

10-12-2019 Outline Summary of Mondays lesson Monitoring and data filtering DLM II

4. Lecture Image enhancement: Filtering 1 Image preprocessing Aims: Improvement of

Lab Preparation - Go through the entire manual and try to understand the required functionality

AND MACHINE LEARNING CHAPTER 10: MIXTURE MODELS AND EM Mixture Models - Define a joint

Statistics I Chapter 7 Sampling Distributions (Part 2) Ling-Chieh Kung Department of

Synthesizing Multiple Evaluative Statements into a Summative Evaluative Conclusion Cristian

Lecture 2: The Slowness-Enhanced Back-Projection Improving Imaging Quality Low Resolution High

Sambuz

Useful Links

Newsletter

Mail Us