cs7015 deep learning lecture 20
play

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence for training RBMs Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/61 Mitesh M.


  1. CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence for training RBMs Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  2. Module 20.1 : Markov Chains 2/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  3. Let us first begin by restating our goals Goal 1 : Given a random variable X ∈ R n , we are interested in drawing samples from the joint distribution P ( X ) Goal 2 : Given a function f ( X ) defined over the random variable X, we are interested in X ∈ R 1024 computing the expectation E P ( X ) [ f ( X )] We will use Gibbs Sampling (class of Metropolis-Hastings algorithm) to achieve these goals We will first understand the intuition be- E P ( X ) [ f ( X )] hind Gibbs Sampling and then understand the math behind it 3/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  4. Suppose instead of a single random variable X ∈ R n , we have a chain of random variables X 1 , X 2 , . . . , X K each X i ∈ R n The i here corresponds to a time step For example, X i could be a n-dimensional vec- tor containing the number of customers in a given set of n restaurants on day i X ∈ R 1024 In our case, X i could be a 1024 dimensional image sent by our friend on day i For ease of illustration we will stick to the res- taurant example and assume that instead of actual counts we are interested only in binary E P ( X ) [ f ( X )] counts (high=1, low=0) Thus X i ∈ { 0 , 1 } n 4/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  5. On day 1, let X 1 take on the value x 1 ( x 1 is one of the possible 2 n vectors) On day 2, let X 2 take on the value x 2 ( x 2 is again one of the possible 2 n vectors) One way of looking at this is that the state x 1 x 2 x 3 has transitioned from x 1 to x 2 Similarly, on day 3, if X 3 takes on the value x 3 then we can say that the state has transitioned from x 1 to x 2 to x 3 Finally, on day n , we can say that the state has transitioned from x 1 to x 2 to x 3 to . . . x n 5/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  6. We may now be interested in knowing what is the most likely value that the state will take on day i given the states on day 1 to day i − 1 More formally, we may be interested in the following distribution P ( X i = x i | X 1 = x 1 , X 2 = x 2 , . . . , X i − 1 = x i − 1 ) Now suppose the chain exhibits the following Markov x 1 x 2 x 3 x i · · · property P ( X i = x i | X 1 = x 1 , X 2 = x 2 , . . . , X i − 1 = x i − 1 ) = P ( X i = x i | X i − 1 = x i − 1 ) In other words, given the previous state X i − 1 , X i is independent of all preceding states Can we draw a graphical model to encode this inde- pendence assumption ? 6/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  7. In this graphical model, the random variables are X 1 , X 2 , . . . , X k We will have a node corresponding to each of X 1 X 2 · · · X k these random variables What will be the edges in the graph ? Well, each node only depends on its prede- cessor, so we will just have an edge between successive nodes 7/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  8. ⊥ X i − 2 This property ( X i ⊥ | X i − 1 ) is called the 1 Markov property And the resulting chain X 1 , X 2 , . . . , X k is called a Markov chain Further, since we are considering discrete time steps, this is called a discrete time Markov X 1 X 2 · · · X k Chain Further, since X i ’s take on discrete values this is called a discrete time discrete space Markov Chain Okay, but why are we interested in Markov chains? (we will get there soon! for now let us just focus on these definitions) 8/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  9. X 1 X 2 · · · X k Let us delve a bit deeper into Markov Chains and define a few more quantities Let us assume 2 n = l ( i.e. , X i can take l val- Recall that each X i ∈ { 0 , 1 } n ues) How many values do we need to specify the X i − 1 X i − 2 T ab distribution 1 1 0.05 1 2 0.06 . . . . . . ( l 2 ) . . . P ( X i = x i | X i − 1 = x i − 1 )? 1 l 0.02 2 1 0.03 We can represent this as a matrix T ∈ l × 2 2 0.07 . . . . . . . . . l where the entry T a,b of the matrix denotes 2 l 0.01 the probability of transitioning to state b from . . . . . . . . . state a ( i.e. , P ( X i = b | X i − 1 = a )) l 1 0.1 l 2 0.09 The matrix T is called the transition matrix . . . . . . . . . l l 0.21 9/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  10. We need to define this transition matrix T ab , i.e. , · · · X 1 X 2 X k P ( X i = b | X i − 1 = a ) ∀ a, b ∀ i X i − 1 X i − 2 T ab Why do we need to define this ∀ i ? Well, 1 1 0.05 because this transition probabilities may be 1 2 0.06 . . . . . . . . . different for different time steps 1 l 0.02 For example, the transition in the number 2 1 0.03 2 2 0.07 of customers may be different from Friday . . . . . . . . . to Saturday (weekend) as compared to from 2 l 0.01 . . . Sunday to Monday(weekday) . . . . . . l 1 0.1 Thus, for a Markov chain X 1 , X 2 , . . . , X k l 2 0.09 . . . we will have k such transition matrices . . . . . . l l 0.21 T 1 , T 2 , . . . , T k 10/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  11. · · · However, for this discussion we will assume X 1 X 2 X k that the Markov chain is time homogeneous What does that mean? It means that X i − 1 X i − 2 T ab 1 1 0.05 T 1 = T 2 = · · · = T k = T 1 2 0.06 . . . . . . . . . In other words 1 l 0.02 2 1 0.03 2 2 0.07 P ( X i = b | X i − 1 = a ) = T ab ∀ a, b ∀ i . . . . . . . . . 2 l 0.01 . . . The transition matrix does not depend on the . . . . . . l 1 0.1 time i and hence such a Markov Chain is l 2 0.09 called time homogeneous . . . . . . . . . l l 0.21 11/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  12. Now suppose the starting distribution at time step 0 is given by µ 0 ) Just to be clear µ 0 is a 2 n dimensional vector such that µ 0 a = P ( X 0 = a ) µ 0 a is the probability that the random variable takes on the value a among all the possible 2 n X 1 X 2 · · · X k values Given µ 0 and T how will you compute µ k where µ k a = P ( X k = a ) µ k is again a 2 n dimensional vector whose a th entry tells us the probability that X k will take on the value a among all the possible 2 n values 12/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  13. Let us consider P ( X 1 = b ) � P ( X 1 = b ) = P ( X 0 = a, X 1 = b ) X 0 X 1 a 1 The above sum essentially captures all the paths of reaching X 1 = b irrespective of the 2 b value of X 0 � . . P ( X 1 = b ) = P ( X 0 = a, X 1 = b ) . . . . a � = P ( X 0 = a ) P ( X 1 = b | X 0 = a ) l a � µ 0 = a T ab a 13/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  14. X 0 X 1 1 1 0.2 0 . 3 0.5 Let us see if there is a more compact way of writing the distribution P ( X 1 ) 0.3 0.3 ( i.e. , of specifying P ( X 1 = b ) ∀ b ) 2 2 0.6 Let us consider a simple case when 0.1 0 . 4 l = 3 (as opposed to 2 n ) 0.4 Thus, µ 0 ∈ R 3 and T ∈ R 3 × 3 0.2 What does the product µ 0 T give us ? 3 3 0.4 It gives us the distribution µ 1 ! (the 0 . 3  0 . 2 0 . 5 0 . 3  b th entry of this vector is � a µ 0 a T ab µ 0 T = � � 0 . 3 0 . 4 0 . 3 0 . 3 0 . 6 0 . 1 which is P ( X 1 = b ))   0 . 4 0 . 2 0 . 4 � � = 0 . 3 0 . 45 0 . 25 14/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  15. Let us consider P ( X 2 = b ) � X 0 X 1 X 2 P ( X 2 = b ) = P ( X 1 = a, X 2 = b ) a 1 The above sum essentially captures all the paths of reaching X 2 = b irrespective of the value of X 1 2 b � P ( X 2 = b ) = P ( X 1 = a, X 2 = b ) a . . . . . . � . . . = P ( X 1 = a ) P ( X 2 = b | X 1 = a ) a � µ 1 l = a T ab a 15/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  16. Once again we can write P ( X 2 ) compactly as P ( X 2 ) = µ 1 T = ( µ 0 T ) T = µ 0 T 2 X 0 X 1 X 2 In general, 1 P ( X k ) = µ 0 T k 2 b Thus the distribution at any time step can be computed by finding the appropriate element . . . from the following series . . . . . . µ 0 T 1 , µ 0 T 2 , µ 0 T 3 , . . . , µ 0 T k , . . . l Note that this is still computationally expens- ive because it involves a product of µ 0 (2 n ) and T k (2 n × 2 n ) (but later on we will see that we do not need this full product) 16/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

  17. If at a certain time step t , µ t reaches a distri- bution π such that πT = π Then for all subsequent time steps X 0 X 1 X 2 µ j = π ( j ≥ t ) 1 π is then called the stationary distribution of 2 b the Markov chain X t , X t +1 , X t +2 , . . . will all follow the same dis- . . . . . . . . . tribution π In other words, if we have X t = x t , X t +1 = l x t +1 , X t +2 = x t +2 and so on then we can think of x t , x t +1 , x t +2 as samples drawn from the same distribution π (this is a crucial property and we will return back to it soon) 17/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend