cs7015 deep learning lecture 22
play

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module


  1. CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  2. Module 22.1: Neural Autoregressive Density Estimator (NADE) 2/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  3. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m ˆ x P φ ( x | z ) z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  4. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m ˆ x P φ ( x | z ) z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  5. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z + ∗ ǫ µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  6. H ∈ { 0 , 1 } n c 1 c 2 c n So far we have seen a few latent variable h 1 h 2 · · · h n generation models such as RBMs and VAEs Latent variable models make certain independence w 1 , 1 w m,n W ∈ R m × n assumptions which reduces the number of factors v 1 v 2 · · · v m and in turn the number of parameters in the model b 1 b 2 b m V ∈ { 0 , 1 } m For example, in RBMs we assumed that the visible variables were independent given the hidden x ˆ variables which allowed us to do Block Gibbs P φ ( x | z ) Sampling z Similarly in VAEs we assumed P ( x | z ) = N (0 , I ) + which effectively means that given the latent ∗ ǫ variables, the x ’s are independent of each other (Since Σ = I) µ Σ Q θ ( z | x ) x 3/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  7. We will now look at Autoregressive (AR) Models which do not contain any latent variables 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  8. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  9. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  10. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  11. We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim of course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ { 0 , 1 } n AR models do not make any independence x 1 x 2 x 3 x 4 assumption but use the default factorization of p ( x ) given by the chain rule p ( x ) = n � p ( x i | x <k ) i =1 The above factorization contains n factors and some of these factors contain many parameters ( O (2 n ) in total ) 4/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  12. Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  13. Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  14. Obviously, it is infeasible to learn such an x 1 x 2 x 3 x 4 exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network What does this mean? Let us see! 5/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  15. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) V 3 W .,<k 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  16. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables W .,<k x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  17. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs W .,<k x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  18. At the output layer we want to predict ) , x 3 ) n conditional probability distributions (each , x , x 2 2 ) | x | x | x ) 1 1 1 corresponding to one of the factors in our joint p ( x p ( x p ( x p ( x 1 2 3 4 distribution) At the input layer we are given the n input V 3 variables Now the catch is that the n th output should only be connected to the previous n -1 inputs In particular, when we are computing p ( x 3 | x 2 , x 1 ) the only inputs that we should consider are x 1 , x 2 because these are the only variables given to us while computing the W .,<k conditional x 1 x 2 x 3 x 4 6/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  19. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 V h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  20. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V h 1 h 2 h 3 h 4 W x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  21. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V 3 For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W W .,< 3 x 1 x 2 x 3 x 4 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  22. The Neural Autoregressive Density Estimator ) , x 3 ) (NADE) proposes a simple solution for this , x , x 2 2 ) | x | x | x ) 1 1 1 p ( x p ( x p ( x p ( x 1 2 3 4 First, for every output unit, we compute a hidden representation using only the relevant input units V 3 For example, for the k th output unit, the hidden representation will be computed using: h 1 h 2 h 3 h 4 h k = σ ( W .,<k x <k + b ) where h k ∈ R d , W ∈ R d × n , W .,<k are the first k columns of W We now compute the output p ( x k | x k − 1 ) as: W .,< 3 1 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) x 1 x 2 x 3 x 4 1 7/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

  23. Let us look at the equations carefully ) , x 3 ) , x , x 2 2 ) | x | x | x ) 1 1 1 h k = σ ( W .,<k x <k + b ) p ( x p ( x p ( x p ( x 1 2 3 4 y k = p ( x k | x k − 1 ) = σ ( V k h k + c k ) 1 V 3 h 1 h 2 h 3 h 4 W .,< 3 x 1 x 2 x 3 x 4 8/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend