CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - - PowerPoint PPT Presentation

cs7015 deep learning lecture 22
SMART_READER_LITE
LIVE PREVIEW

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, - - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/24 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22 Module


slide-1
SLIDE 1

1/24

CS7015 (Deep Learning) : Lecture 22

Autoregressive Models (NADE, MADE) Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-2
SLIDE 2

2/24

Module 22.1: Neural Autoregressive Density Estimator (NADE)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-3
SLIDE 3

3/24 v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ

z x Qθ(z|x) Σ µ Pφ(x|z) ˆ x

So far we have seen a few latent variable generation models such as RBMs and VAEs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-4
SLIDE 4

3/24 v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ

z x Qθ(z|x) Σ µ Pφ(x|z) ˆ x

So far we have seen a few latent variable generation models such as RBMs and VAEs Latent variable models make certain independence assumptions which reduces the number of factors and in turn the number of parameters in the model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-5
SLIDE 5

3/24 v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ

z x Qθ(z|x) Σ µ Pφ(x|z) ˆ x

So far we have seen a few latent variable generation models such as RBMs and VAEs Latent variable models make certain independence assumptions which reduces the number of factors and in turn the number of parameters in the model For example, in RBMs we assumed that the visible variables were independent given the hidden variables which allowed us to do Block Gibbs Sampling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-6
SLIDE 6

3/24 v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ

z x Qθ(z|x) Σ µ Pφ(x|z) ˆ x

So far we have seen a few latent variable generation models such as RBMs and VAEs Latent variable models make certain independence assumptions which reduces the number of factors and in turn the number of parameters in the model For example, in RBMs we assumed that the visible variables were independent given the hidden variables which allowed us to do Block Gibbs Sampling Similarly in VAEs we assumed P(x|z) = N(0, I) which effectively means that given the latent variables, the x’s are independent of each other (Since Σ = I)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-7
SLIDE 7

4/24

We will now look at Autoregressive (AR) Models which do not contain any latent variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-8
SLIDE 8

4/24

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim

  • f

course is to learn a joint distribution over x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-9
SLIDE 9

4/24

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim

  • f

course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ {0, 1}n

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-10
SLIDE 10

4/24

x1 x2 x3 x4

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim

  • f

course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ {0, 1}n AR models do not make any independence assumption but use the default factorization

  • f p(x) given by the chain rule p(x)

=

n

  • i=1

p(xi|x<k)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-11
SLIDE 11

4/24

x1 x2 x3 x4

We will now look at Autoregressive (AR) Models which do not contain any latent variables The aim

  • f

course is to learn a joint distribution over x As usual, for ease of illustration we will assume x ∈ {0, 1}n AR models do not make any independence assumption but use the default factorization

  • f p(x) given by the chain rule p(x)

=

n

  • i=1

p(xi|x<k) The above factorization contains n factors and some of these factors contain many parameters ( O(2n) in total )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-12
SLIDE 12

5/24

x1 x2 x3 x4

Obviously, it is infeasible to learn such an exponential number of parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-13
SLIDE 13

5/24

x1 x2 x3 x4

Obviously, it is infeasible to learn such an exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-14
SLIDE 14

5/24

x1 x2 x3 x4

Obviously, it is infeasible to learn such an exponential number of parameters AR models work around this by using a neural network to parameterize these factors and then learn the parameters of this neural network What does this mean? Let us see!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-15
SLIDE 15

6/24

p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

V3 W.,<k

At the output layer we want to predict n conditional probability distributions (each corresponding to one of the factors in our joint distribution)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-16
SLIDE 16

6/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

V3 W.,<k

At the output layer we want to predict n conditional probability distributions (each corresponding to one of the factors in our joint distribution) At the input layer we are given the n input variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-17
SLIDE 17

6/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

V3 W.,<k

At the output layer we want to predict n conditional probability distributions (each corresponding to one of the factors in our joint distribution) At the input layer we are given the n input variables Now the catch is that the nth output should

  • nly be connected to the previous n-1 inputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-18
SLIDE 18

6/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

V3 W.,<k

At the output layer we want to predict n conditional probability distributions (each corresponding to one of the factors in our joint distribution) At the input layer we are given the n input variables Now the catch is that the nth output should

  • nly be connected to the previous n-1 inputs

In particular, when we are computing p(x3|x2, x1) the only inputs that we should consider are x1, x2 because these are the only variables given to us while computing the conditional

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-19
SLIDE 19

7/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V W

The Neural Autoregressive Density Estimator (NADE) proposes a simple solution for this

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-20
SLIDE 20

7/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V W

The Neural Autoregressive Density Estimator (NADE) proposes a simple solution for this First, for every output unit, we compute a hidden representation using only the relevant input units

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-21
SLIDE 21

7/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

The Neural Autoregressive Density Estimator (NADE) proposes a simple solution for this First, for every output unit, we compute a hidden representation using only the relevant input units For example, for the kth output unit, the hidden representation will be computed using: hk = σ(W.,<kx<k + b) where hk ∈ Rd, W ∈ Rd×n, W.,<k are the first k columns of W

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-22
SLIDE 22

7/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

The Neural Autoregressive Density Estimator (NADE) proposes a simple solution for this First, for every output unit, we compute a hidden representation using only the relevant input units For example, for the kth output unit, the hidden representation will be computed using: hk = σ(W.,<kx<k + b) where hk ∈ Rd, W ∈ Rd×n, W.,<k are the first k columns of W We now compute the output p(xk|xk−1

1

) as: yk = p(xk|xk−1

1

) = σ(Vkhk + ck)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-23
SLIDE 23

8/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Let us look at the equations carefully hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-24
SLIDE 24

8/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Let us look at the equations carefully hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck) How many parameters does this model have ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-25
SLIDE 25

8/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Let us look at the equations carefully hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck) How many parameters does this model have ? Note that W ∈ Rd×n and b ∈ Rd×1 are shared parameters and the same W, b are used for computing hk for all the n factors (of course

  • nly the relevant columns of W are used for

each k) resulting in nd + d parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-26
SLIDE 26

8/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Let us look at the equations carefully hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck) How many parameters does this model have ? Note that W ∈ Rd×n and b ∈ Rd×1 are shared parameters and the same W, b are used for computing hk for all the n factors (of course

  • nly the relevant columns of W are used for

each k) resulting in nd + d parameters In addition, we have Vk ∈ Rd×1 and ck ∈ Rd×1 for each of the n factors resulting in a total of nd + n parameters

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-27
SLIDE 27

9/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

There is also an additional parameter h1 ∈ Rd (similar to the initial state in LSTMs, RNNs)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-28
SLIDE 28

9/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

There is also an additional parameter h1 ∈ Rd (similar to the initial state in LSTMs, RNNs) The total number of parameters in the model is thus 2nd + n + 2d which is linear in n

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-29
SLIDE 29

9/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

There is also an additional parameter h1 ∈ Rd (similar to the initial state in LSTMs, RNNs) The total number of parameters in the model is thus 2nd + n + 2d which is linear in n In other words, the model does not have an exponential number of parameters which is typically the case for the default factorization p(x) =

n

  • i=1

p(xi|x<k)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-30
SLIDE 30

9/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

There is also an additional parameter h1 ∈ Rd (similar to the initial state in LSTMs, RNNs) The total number of parameters in the model is thus 2nd + n + 2d which is linear in n In other words, the model does not have an exponential number of parameters which is typically the case for the default factorization p(x) =

n

  • i=1

p(xi|x<k) Why?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-31
SLIDE 31

9/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

There is also an additional parameter h1 ∈ Rd (similar to the initial state in LSTMs, RNNs) The total number of parameters in the model is thus 2nd + n + 2d which is linear in n In other words, the model does not have an exponential number of parameters which is typically the case for the default factorization p(x) =

n

  • i=1

p(xi|x<k) Why? Because we are sharing the parameters across the factors

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-32
SLIDE 32

9/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

There is also an additional parameter h1 ∈ Rd (similar to the initial state in LSTMs, RNNs) The total number of parameters in the model is thus 2nd + n + 2d which is linear in n In other words, the model does not have an exponential number of parameters which is typically the case for the default factorization p(x) =

n

  • i=1

p(xi|x<k) Why? Because we are sharing the parameters across the factors The same W, b contribute to all the factors

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-33
SLIDE 33

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-34
SLIDE 34

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network? backpropagation: its a neural network after all

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-35
SLIDE 35

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network? backpropagation: its a neural network after all What is the loss function that you will choose?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-36
SLIDE 36

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network? backpropagation: its a neural network after all What is the loss function that you will choose? For every output node we know the true probability distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-37
SLIDE 37

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network? backpropagation: its a neural network after all What is the loss function that you will choose? For every output node we know the true probability distribution For example, for a given training instance, if X3 = 1 then the true probability distribution is given by p(x3 = 1|x2, x1) = 1, p(x3 = 0|x2, x1) = 0 or p = [0, 1]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-38
SLIDE 38

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network? backpropagation: its a neural network after all What is the loss function that you will choose? For every output node we know the true probability distribution For example, for a given training instance, if X3 = 1 then the true probability distribution is given by p(x3 = 1|x2, x1) = 1, p(x3 = 0|x2, x1) = 0 or p = [0, 1] If the predicted distribution is q = [0.7, 0.3] then we can just take the cross entropy between p and q as the loss function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-39
SLIDE 39

10/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

How will you train such a network? backpropagation: its a neural network after all What is the loss function that you will choose? For every output node we know the true probability distribution For example, for a given training instance, if X3 = 1 then the true probability distribution is given by p(x3 = 1|x2, x1) = 1, p(x3 = 0|x2, x1) = 0 or p = [0, 1] If the predicted distribution is q = [0.7, 0.3] then we can just take the cross entropy between p and q as the loss function The total loss will be the sum of this cross entropy loss for all the n output nodes

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-40
SLIDE 40

11/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Now let’s ask a couple of questions about the model (assume training is done)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-41
SLIDE 41

11/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Now let’s ask a couple of questions about the model (assume training is done) Can the model be used for abstraction? i.e., if we give it a test instance x, can the model give us a hidden abstract representation for x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-42
SLIDE 42

11/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Now let’s ask a couple of questions about the model (assume training is done) Can the model be used for abstraction? i.e., if we give it a test instance x, can the model give us a hidden abstract representation for x Well, you will get a sequence of hidden representations h1, h2, ..., hn but these are not really the kind of abstract representations that we are interested in

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-43
SLIDE 43

11/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Now let’s ask a couple of questions about the model (assume training is done) Can the model be used for abstraction? i.e., if we give it a test instance x, can the model give us a hidden abstract representation for x Well, you will get a sequence of hidden representations h1, h2, ..., hn but these are not really the kind of abstract representations that we are interested in For example, hn only captures the information required to reconstruct xn given x1 to xn−1 (compare this with an autoencoder wherein the hidden representation can reconstruct all

  • f x1, x2, ..., xn)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-44
SLIDE 44

11/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V3 W.,<3

Now let’s ask a couple of questions about the model (assume training is done) Can the model be used for abstraction? i.e., if we give it a test instance x, can the model give us a hidden abstract representation for x Well, you will get a sequence of hidden representations h1, h2, ..., hn but these are not really the kind of abstract representations that we are interested in For example, hn only captures the information required to reconstruct xn given x1 to xn−1 (compare this with an autoencoder wherein the hidden representation can reconstruct all

  • f x1, x2, ..., xn)

These are not latent variable models and are, by design, not meant for abstraction

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-45
SLIDE 45

12/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 W

Can the model to generation? How?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-46
SLIDE 46

12/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 W

Can the model to generation? How? Well, we first compute p(x1 = 1) as y1 = σ(V1h1 + c1)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-47
SLIDE 47

12/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 W

Can the model to generation? How? Well, we first compute p(x1 = 1) as y1 = σ(V1h1 + c1) Note that V1, h1, c1 are all parameters of the model which will be learned during training

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-48
SLIDE 48

12/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 W

Can the model to generation? How? Well, we first compute p(x1 = 1) as y1 = σ(V1h1 + c1) Note that V1, h1, c1 are all parameters of the model which will be learned during training We will then sample a value for x1 from the distribution Bernoulli(y1)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-49
SLIDE 49

13/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 V2 W

We will now use the sampled value of x1 and compute h2 as h2 = σ(W.,<2x<2 + b)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-50
SLIDE 50

13/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 V2 W

We will now use the sampled value of x1 and compute h2 as h2 = σ(W.,<2x<2 + b) Using h2 we will compute P(x2 = 1|x1 = x1) as y2 = σ(V2h2 + c2)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-51
SLIDE 51

13/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 V2 W

We will now use the sampled value of x1 and compute h2 as h2 = σ(W.,<2x<2 + b) Using h2 we will compute P(x2 = 1|x1 = x1) as y2 = σ(V2h2 + c2) We will then sample a value for x2 from the distribution Bernoulli(y2)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-52
SLIDE 52

13/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 V4 W

We will now use the sampled value of x1 and compute h2 as h2 = σ(W.,<2x<2 + b) Using h2 we will compute P(x2 = 1|x1 = x1) as y2 = σ(V2h2 + c2) We will then sample a value for x2 from the distribution Bernoulli(y2) We will then continue this process till xn generating the value of one random variable at a time

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-53
SLIDE 53

13/24

x1 x2 x3 x4 p(x

1

) p(x

2

|x

1

) p(x

3

|x

1

, x

2

) p(x

4

|x

1

, x

2

, x

3

)

h1 h2 h3 h4

V1 V4 W

We will now use the sampled value of x1 and compute h2 as h2 = σ(W.,<2x<2 + b) Using h2 we will compute P(x2 = 1|x1 = x1) as y2 = σ(V2h2 + c2) We will then sample a value for x2 from the distribution Bernoulli(y2) We will then continue this process till xn generating the value of one random variable at a time If x is an image then this is equivalent to generating the image one pixel at a time (very slow)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-54
SLIDE 54

14/24

Of course, the model requires a lot of computations because for generating each pixel we need to compute hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-55
SLIDE 55

14/24

Of course, the model requires a lot of computations because for generating each pixel we need to compute hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck) However notice that W.,<k+1x<k+1 + b = W.,<kx<k + b + W.,kxk

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-56
SLIDE 56

14/24

Of course, the model requires a lot of computations because for generating each pixel we need to compute hk = σ(W.,<kx<k + b) yk = p(xk|xk−1

1

) = σ(Vkhk + ck) However notice that W.,<k+1x<k+1 + b = W.,<kx<k + b + W.,kxk Thus we can reuse some of the computations done for pixel k while predicting the pixel k + 1 (this can be done even at training time)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-57
SLIDE 57

15/24

Things to remember about NADE Uses the explicit representation of the joint distribution p(x) =

n

  • i=1

p(xi|x<k)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-58
SLIDE 58

15/24

Things to remember about NADE Uses the explicit representation of the joint distribution p(x) =

n

  • i=1

p(xi|x<k) Each node in the output layer corresponds to one factor in this explicit representation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-59
SLIDE 59

15/24

Things to remember about NADE Uses the explicit representation of the joint distribution p(x) =

n

  • i=1

p(xi|x<k) Each node in the output layer corresponds to one factor in this explicit representation Reduces the number of parameters by sharing weights in the neural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-60
SLIDE 60

15/24

Things to remember about NADE Uses the explicit representation of the joint distribution p(x) =

n

  • i=1

p(xi|x<k) Each node in the output layer corresponds to one factor in this explicit representation Reduces the number of parameters by sharing weights in the neural network Not designed for abstraction

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-61
SLIDE 61

15/24

Things to remember about NADE Uses the explicit representation of the joint distribution p(x) =

n

  • i=1

p(xi|x<k) Each node in the output layer corresponds to one factor in this explicit representation Reduces the number of parameters by sharing weights in the neural network Not designed for abstraction Generation is slow because the model generates one pixel (or one random variable) at a time

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-62
SLIDE 62

15/24

Things to remember about NADE Uses the explicit representation of the joint distribution p(x) =

n

  • i=1

p(xi|x<k) Each node in the output layer corresponds to one factor in this explicit representation Reduces the number of parameters by sharing weights in the neural network Not designed for abstraction Generation is slow because the model generates one pixel (or one random variable) at a time Possible to speed up the computation by reusing some previous computations

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-63
SLIDE 63

16/24

Module 22.2: Masked Autoencoder Density Estimator (MADE)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-64
SLIDE 64

17/24

x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Suppose the input x ∈ {0, 1}n, then the

  • utput layer of an autoencoder also contains

n units

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-65
SLIDE 65

17/24

x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Suppose the input x ∈ {0, 1}n, then the

  • utput layer of an autoencoder also contains

n units Notice the explicit factorization of the joint distribution p(x) also contains n factors p(x) =

n

  • k=1

p(xk|x<k)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-66
SLIDE 66

17/24

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

W1 W2 V Suppose the input x ∈ {0, 1}n, then the

  • utput layer of an autoencoder also contains

n units Notice the explicit factorization of the joint distribution p(x) also contains n factors p(x) =

n

  • k=1

p(xk|x<k) Question: Can we tweak an autoencoder so that its output units predict the n conditional distributions instead of reconstructing the n inputs?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-67
SLIDE 67

18/24

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

W1 W2 V Note that this is not straightforward because we need to make sure that the k-th output unit only depends on the previous k−1 inputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-68
SLIDE 68

18/24

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

W1 W2 V Note that this is not straightforward because we need to make sure that the k-th output unit only depends on the previous k−1 inputs In a standard autoencoder with fully connected layers the k-th unit obviously depends on all the input units

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-69
SLIDE 69

18/24

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

W1 W2 V Note that this is not straightforward because we need to make sure that the k-th output unit only depends on the previous k−1 inputs In a standard autoencoder with fully connected layers the k-th unit obviously depends on all the input units In simple words, there is a path from each of the input units to each of the output units

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-70
SLIDE 70

18/24

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

W1 W2 V Note that this is not straightforward because we need to make sure that the k-th output unit only depends on the previous k−1 inputs In a standard autoencoder with fully connected layers the k-th unit obviously depends on all the input units In simple words, there is a path from each of the input units to each of the output units We cannot allow this if we want to predict the conditional distributions p(xk|x<k) (we need to ensure that we are only seeing the given variables x<k and nothing else)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-71
SLIDE 71

19/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4 p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

We could ensure this by masking some

  • f

the connections in the network to ensure that yk

  • nly

depends on x<k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-72
SLIDE 72

19/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

We could ensure this by masking some

  • f

the connections in the network to ensure that yk

  • nly

depends on x<k We will start by assuming some

  • rdering
  • n

the inputs and just number them from 1 to n

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-73
SLIDE 73

19/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

We could ensure this by masking some

  • f

the connections in the network to ensure that yk

  • nly

depends on x<k We will start by assuming some

  • rdering
  • n

the inputs and just number them from 1 to n Now we will randomly assign each hidden unit a number between 1 to n-1 which indicates the number of inputs it will be connected to

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-74
SLIDE 74

19/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

We could ensure this by masking some

  • f

the connections in the network to ensure that yk

  • nly

depends on x<k We will start by assuming some

  • rdering
  • n

the inputs and just number them from 1 to n Now we will randomly assign each hidden unit a number between 1 to n-1 which indicates the number of inputs it will be connected to For example, if we assign a node the number 2 then it will be connected to the first two inputs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-75
SLIDE 75

19/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

We could ensure this by masking some

  • f

the connections in the network to ensure that yk

  • nly

depends on x<k We will start by assuming some

  • rdering
  • n

the inputs and just number them from 1 to n Now we will randomly assign each hidden unit a number between 1 to n-1 which indicates the number of inputs it will be connected to For example, if we assign a node the number 2 then it will be connected to the first two inputs We will do a similar assignment for all the hidden layers

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-76
SLIDE 76

20/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Let us see what this means

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-77
SLIDE 77

20/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Let us see what this means For the first hidden layer this numbering is clear

  • it

simply indicates the number

  • f
  • rdered

inputs to which this node will be connected

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-78
SLIDE 78

20/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 2 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Let us see what this means For the first hidden layer this numbering is clear

  • it

simply indicates the number

  • f
  • rdered

inputs to which this node will be connected Let us now focus on the highlighted node in the second layer which has the number 2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-79
SLIDE 79

20/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 2 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Let us see what this means For the first hidden layer this numbering is clear

  • it

simply indicates the number

  • f
  • rdered

inputs to which this node will be connected Let us now focus on the highlighted node in the second layer which has the number 2 This node is only allowed to depend

  • n inputs x1 and x2 (since it is

numbered 2)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-80
SLIDE 80

20/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 3 1 2 1 2 1 1 2 1 3 2 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Let us see what this means For the first hidden layer this numbering is clear

  • it

simply indicates the number

  • f
  • rdered

inputs to which this node will be connected Let us now focus on the highlighted node in the second layer which has the number 2 This node is only allowed to depend

  • n inputs x1 and x2 (since it is

numbered 2) This means that it should be only connected to those nodes in the previous hidden layer which have seen

  • nly x1 and x2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-81
SLIDE 81

20/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 3 1 2 1 2 1 1 2 1 3 2 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Let us see what this means For the first hidden layer this numbering is clear

  • it

simply indicates the number

  • f
  • rdered

inputs to which this node will be connected Let us now focus on the highlighted node in the second layer which has the number 2 This node is only allowed to depend

  • n inputs x1 and x2 (since it is

numbered 2) This means that it should be only connected to those nodes in the previous hidden layer which have seen

  • nly x1 and x2

In other words it should only have connections from those nodes, which have been assigned a number ≤ 2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-82
SLIDE 82

21/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Now consider the node labeled 3 in the output layer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-83
SLIDE 83

21/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Now consider the node labeled 3 in the output layer This node is only allowed to see inputs x1 and x2 because it predicts p(x3|x2, x1) (and hence the given variables should only be x1 and x2)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-84
SLIDE 84

21/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Now consider the node labeled 3 in the output layer This node is only allowed to see inputs x1 and x2 because it predicts p(x3|x2, x1) (and hence the given variables should only be x1 and x2) By the same argument that we made

  • n the previous slide, this means that

it should be only connected to those nodes in the previous hidden layer which have seen only x1 and x2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-85
SLIDE 85

21/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p ( x

1

) p ( x

2

| x

1

) p ( x

3

| x

1

, x

2

) p ( x

4

| x

1

, x

2

, x

3

)

Now consider the node labeled 3 in the output layer This node is only allowed to see inputs x1 and x2 because it predicts p(x3|x2, x1) (and hence the given variables should only be x1 and x2) By the same argument that we made

  • n the previous slide, this means that

it should be only connected to those nodes in the previous hidden layer which have seen only x1 and x2 We can implement this by taking the weight matrices W 1, W 2 and V and applying an appropriate mask to them so that the disallowed connections are dropped

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-86
SLIDE 86

22/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

For example we can apply the following mask at layer 2       W 2

11

W 2

12

W 2

13

W 2

14

W 2

15

W 2

21

W 2

22

W 2

23

W 2

24

W 2

25

W 2

31

W 2

32

W 2

33

W 2

34

W 2

35

W 2

41

W 2

42

W 2

43

W 2

44

W 2

45

W 2

51

W 2

52

W 2

53

W 2

54

W 2

55

      ⊙       1 1 1 1 1 1 1 1 1 1 1 1 1 1 1      

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-87
SLIDE 87

23/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

The objective function for this network would again be a sum of cross entropies

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-88
SLIDE 88

23/24 x1 x2 x3 x4 ˆ x1 ˆ x2 ˆ x3 ˆ x4

W1 W2 V Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

The objective function for this network would again be a sum of cross entropies The network can be trained using backpropagation such that the errors will only be propagated along the active (unmasked) connections (similar to what happens in dropout)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-89
SLIDE 89

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

Similar to NADE, this model is not designed for abstraction but for generation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-90
SLIDE 90

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

Similar to NADE, this model is not designed for abstraction but for generation How will you do generation in this model?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-91
SLIDE 91

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

Similar to NADE, this model is not designed for abstraction but for generation How will you do generation in this model? Using the same iterative process that we used with NADE

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-92
SLIDE 92

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 3 1 1 2 1 3 1 2 3 4 1

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) p(x1)

Similar to NADE, this model is not designed for abstraction but for generation How will you do generation in this model? Using the same iterative process that we used with NADE First sample a value of x1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-93
SLIDE 93

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 1 2 1 2 3 1 1 2 1 3 1 2 3 4 1 2 1 1 1 1 1

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) p(x1) p(x2|x1)

Similar to NADE, this model is not designed for abstraction but for generation How will you do generation in this model? Using the same iterative process that we used with NADE First sample a value of x1 Now feed this value of x1 to the network and compute y2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-94
SLIDE 94

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 1 2 1 2 3 1 1 2 1 3 1 2 3 4 1 2 3 1 1 1 1 1 1 2 1 2 1 1 2 1

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) p(x1) p(x2|x1) p(x3|x1, x2)

Similar to NADE, this model is not designed for abstraction but for generation How will you do generation in this model? Using the same iterative process that we used with NADE First sample a value of x1 Now feed this value of x1 to the network and compute y2 Now sample x2 from Bernoulli (y2) and repeat the process till you generate all variables upto xn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22

slide-95
SLIDE 95

24/24

Masks

= MV = MW2 = MW1

x1 x2 x3 x4

1 2 3 4 1 2 3 1 2 1 2 3 1 1 2 1 3 1 2 3 4 1 2 3 4 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 1 2 3 1 1 2 1 3

p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3)

Similar to NADE, this model is not designed for abstraction but for generation How will you do generation in this model? Using the same iterative process that we used with NADE First sample a value of x1 Now feed this value of x1 to the network and compute y2 Now sample x2 from Bernoulli (y2) and repeat the process till you generate all variables upto xn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 22