CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Module 22.1: Neural Autoregressive Density Estimator (NADE) 2/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z Similarly in VAEs we assumed P ( x | z ) = N (0 , I ) + which effectively means that given the latent ∗ ǫ variables, the x ’s are independent of each other (Since Σ = I) µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 The above factorization contains n factors and some of these factors contain many parameters ( O (2 n ) in total ) 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network What does this mean? Let us see! 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

At the output layer we want to predict ) ) , x , x 3 3 ) ) n conditional probability distributions (each , x , x , x , x 2 2 2 2 ) ) | x | x | x | x | x | x ) ) 1 1 1 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x p ( x p ( x p ( x p ( x 1 1 2 2 3 3 4 4 distribution) At the input layer we are given the n input V 3 V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs In particular, when we are computing p ( x 3 | x 2 , x 1 ) the only inputs that we should consider are x 1 , x 2 because these are the only variables given to us while computing the W .,<k W .,<k conditional x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W We now compute the output p ( x k | x k − 1 ) as: W 1 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) x 1 x 2 x 3 x 4 1 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Let us look at the equations carefully ) , x 3 ) , x , x 2 2 ) | x | x | x ) 1 1 1 h k = σ ( W .,<k x <k + b ) p ( x p ( x p ( x p ( x 1 2 3 4 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) 1 V 3 How many parameters does this model have ? Note that W ∈ R d × n and b ∈ R d × 1 are shared h 1 h 2 h 3 h 4 parameters and the same W, b are used for computing h k for all the n factors (of course only the relevant columns of W are used for each k ) resulting in nd + d parameters In addition, we have V k ∈ R d × 1 and c k ∈ R d × 1 W .,< 3 for each of the n factors resulting in a total of nd + n parameters x 1 x 2 x 3 x 4 8/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

There is also an additional parameter h 1 ∈ R d ) , x 3 ) , x , x (similar to the initial state in LSTMs, RNNs) 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 The total number of parameters in the model is thus 2 nd + n + 2 d which is linear in n In other words, the model does not have an V 3 exponential number of parameters which is typically the case for the default factorization h 1 h 2 h 3 h 4 n � p ( x i | x <k ) p ( x ) = i =1 Why? Because we are sharing the parameters across the factors The same W, b contribute to all the factors W .,< 3 x 1 x 2 x 3 x 4 9/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

How will you train such a network? ) , x 3 ) backpropagation: its a neural network after all , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 What is the loss function that you will choose? For every output node we know the true probability distribution V 3 For example, for a given training instance, if X 3 = 1 then the true probability distribution h 1 h 2 h 3 h 4 is given by p ( x 3 = 1 | x 2 , x 1 ) = 1 , p ( x 3 = 0 | x 2 , x 1 ) = 0 or p = [0 , 1] If the predicted distribution is q = [0 . 7 , 0 . 3] then we can just take the cross entropy between p and q as the loss function W .,< 3 The total loss will be the sum of this cross entropy loss for all the n output nodes x 1 x 2 x 3 x 4 10/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Now let’s ask a couple of questions about the ) , x 3 ) model (assume training is done) , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 Can the model be used for abstraction? i.e., if we give it a test instance x , can the model give us a hidden abstract representation for x V 3 Well, you will get a sequence of hidden representations h 1 , h 2 , ..., h n but these are not h 1 h 2 h 3 h 4 really the kind of abstract representations that we are interested in For example, h n only captures the information required to reconstruct x n given x 1 to x n − 1 (compare this with an autoencoder wherein W .,< 3 the hidden representation can reconstruct all of x 1 , x 2 , ..., x n ) x 1 x 2 x 3 x 4 These are not latent variable models and are, by design, not meant for abstraction 11/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Can the model do generation? How? ) , x 3 ) , x , x Well, we first compute p ( x 1 = 1) as y 1 = 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 σ ( V 1 h 1 + c 1 ) Note that V 1 , h 1 , c 1 are all parameters of the model which will be learned during training V 1 We will then sample a value for x 1 from the distribution Bernoulli ( y 1 ) h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 12/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now use the sampled value of x 1 and ) , x 3 ) compute h 2 as , x , x 2 2 ) | x | x | x ) 1 1 1 h 2 = σ ( W .,< 2 x < 2 + b ) p ( x p ( x p ( x p ( x 1 2 3 4 Using h 2 we will compute P ( x 2 = 1 | x 1 = x 1 ) as y 2 = σ ( V 2 h 2 + c 2 ) V 1 V 4 We will then sample a value for x 2 from the distribution Bernoulli ( y 2 ) h 1 h 2 h 3 h 4 We will then continue this process till x n generating the value of one random variable at a time If x is an image then this is equivalent to generating the image one pixel at a time (very W .,< 4 W slow) x 1 x 2 x 3 x 4 13/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Of course, the model requires a lot of computations because for generating each pixel we need to compute h k = σ ( W .,<k x <k + b ) y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) 1 However notice that W .,<k +1 x <k +1 + b = W .,<k x <k + b + W .,k x k Thus we can reuse some of the computations done for pixel k while predicting the pixel k + 1 (this can be done even at training time) 14/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Things to remember about NADE n Uses the explicit representation of the joint distribution p ( x ) = � p ( x i | x <k ) i =1 Each node in the output layer corresponds to one factor in this explicit representation Reduces the number of parameters by sharing weights in the neural network Not designed for abstraction Generation is slow because the model generates one pixel (or one random variable) at a time Possible to speed up the computation by reusing some previous computations 15/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Module 22.2: Masked Autoencoder Density Estimator (MADE) 16/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Suppose the input x ∈ { 0 , 1 } n , then the p ( x 4 | x 1 , x 2 , x 3 ) p ( x 3 | x 1 , x 2 ) output layer of an autoencoder also contains p ( x 2 | x 1 ) n units p ( x 1 ) Notice the explicit factorization of the joint distribution p ( x ) also contains n factors n V � p ( x ) = p ( x k | x <k ) k =1 Question: Can we tweak an autoencoder so W 2 that its output units predict the n conditional distributions instead of reconstructing the n inputs? W 1 x 1 x 2 x 3 x 4 17/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

ArDec: Autoregressive-based time series decomposition in R Susana Barbosa Universidade do Porto,

Adaptive Control Chapter 7: Digital Control Strategies 1 Adaptive Control Landau,Lozano,

Quasi-maximum likelihood estimation for multivariate CARMA processes Eckhard Schlemm Institute

Modeling stationary data by classes of generalized Ornstein-Uhlenbeck processes. Alejandra Caba

Autoregressive Models Overview Direct Structures P Direct structures x ( n ) = a k x

Sequence Models Instructor: John Thickstun Discussion Board: Available on Ed! Zoom Link: Available

Learning unknown forces in nonlinear models with Gaussian processes and autoregressive flows Wil

Attention Is All You Need Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,