CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Module 22.1: Neural Autoregressive Density Estimator (NADE) 2/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m ˆ x P φ ( x | z ) z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m ˆ x P φ ( x | z ) z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z Similarly in VAEs we assumed P ( x | z ) = N (0 , I ) + which effectively means that given the latent ∗ ǫ variables, the x ’s are independent of each other (Since Σ = I) µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now look at Autoregressive (AR) Models which do not contain any latent variables 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 The above factorization contains n factors and some of these factors contain many parameters ( O (2 n ) in total ) 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network What does this mean? Let us see! 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) V 3 W .,<k 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables W .,<k x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs W .,<k x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs In particular, when we are computing p ( x 3 | x 2 , x 1 ) the only inputs that we should consider are x 1 , x 2 because these are the only variables given to us while computing the W .,<k conditional x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 V h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V 3 For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W W .,< 3 x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V 3 For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W We now compute the output p ( x k | x k − 1 ) as: W .,< 3 1 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) x 1 x 2 x 3 x 4 1 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Let us look at the equations carefully ) , x 3 ) , x , x 2 2 ) | x | x | x ) 1 1 1 h k = σ ( W .,<k x <k + b ) p ( x p ( x p ( x p ( x 1 2 3 4 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) 1 V 3 h 1 h 2 h 3 h 4 W .,< 3 x 1 x 2 x 3 x 4 8/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

From Tree Adjoining Grammars to Higher Order Representations of Abstract Meaning Representations

Learning and Representation for Generalized Planning Hector Geffner ICREA & Universitat

Choosing a Representation Language You need to represent a problem to solve it on a computer.

An Incremental Parser for Abstract Meaning Representation 1 Marco Damonte 1 Shay B. Cohen 2

1 Abstract Data Types (ADT) 4 Why use the objects? The need for data abstraction

Rebecca C. Thurston, PhD Departments of Psychiatry, Psychology, and Epidemiology University of

1 9/14/2019 Deb ebate Str trategies: Deb ebate Str trategies: Google search- I got

News Jo Journal Publications Print & Digi igital the very best in local community

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

From Tree Adjoining Grammars to Higher Order Representations of Abstract Meaning Representations

Learning and Representation for Generalized Planning Hector Geffner ICREA &amp; Universitat

Choosing a Representation Language You need to represent a problem to solve it on a computer.

An Incremental Parser for Abstract Meaning Representation 1 Marco Damonte 1 Shay B. Cohen 2

1 Abstract Data Types (ADT) 4 Why use the objects? The need for data abstraction

Rebecca C. Thurston, PhD Departments of Psychiatry, Psychology, and Epidemiology University of

1 9/14/2019 Deb ebate Str trategies: Deb ebate Str trategies: Google search- I got

News Jo Journal Publications Print &amp; Digi igital the very best in local community

Learning and Representation for Generalized Planning Hector Geffner ICREA & Universitat

News Jo Journal Publications Print & Digi igital the very best in local community