CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence for training RBMs Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Module 20.1 : Markov Chains 2/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Let us first begin by restating our goals Goal 1 : Given a random variable X ∈ R n , we are interested in drawing samples from the joint distribution P ( X ) Goal 2 : Given a function f ( X ) defined over the random variable X, we are interested in X ∈ R 1024 computing the expectation E P ( X ) [ f ( X )] We will use Gibbs Sampling (class of Metropolis-Hastings algorithm) to achieve these goals We will first understand the intuition be- E P ( X ) [ f ( X )] hind Gibbs Sampling and then understand the math behind it 3/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Suppose instead of a single random variable X ∈ R n , we have a chain of random variables X 1 , X 2 , . . . , X K each X i ∈ R n The i here corresponds to a time step For example, X i could be a n-dimensional vector containing the number of customers in a given set of n restaurants on day i X ∈ R 1024 In our case, X i could be a 1024 dimensional image sent by our friend on day i For ease of illustration we will stick to the res- taurant example and assume that instead of actual counts we are interested only in binary E P ( X ) [ f ( X )] counts (high=1, low=0) Thus X i ∈ { 0 , 1 } n 4/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

On day 1, let X 1 take on the value x 1 ( x 1 is one of the possible 2 n vectors) On day 2, let X 2 take on the value x 2 ( x 2 is again one of the possible 2 n vectors) One way of looking at this is that the state x 1 x 2 x 3 has transitioned from x 1 to x 2 Similarly, on day 3, if X 3 takes on the value x 3 then we can say that the state has transitioned from x 1 to x 2 to x 3 Finally, on day n , we can say that the state has transitioned from x 1 to x 2 to x 3 to . . . x n 5/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

We may now be interested in knowing what is the most likely value that the state will take on day i given the states on day 1 to day i − 1 More formally, we may be interested in the following distribution P ( X i = x i | X 1 = x 1 , X 2 = x 2 , . . . , X i − 1 = x i − 1 ) Now suppose the chain exhibits the following Markov x 1 x 2 x 3 x i · · · property P ( X i = x i | X 1 = x 1 , X 2 = x 2 , . . . , X i − 1 = x i − 1 ) = P ( X i = x i | X i − 1 = x i − 1 ) In other words, given the previous state X i − 1 , X i is independent of all preceding states Can we draw a graphical model to encode this inde- pendence assumption ? 6/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

In this graphical model, the random variables are X 1 , X 2 , . . . , X k We will have a node corresponding to each of X 1 X 2 · · · X k these random variables What will be the edges in the graph ? Well, each node only depends on its prede- cessor, so we will just have an edge between successive nodes 7/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

⊥ X i − 2 This property ( X i ⊥ | X i − 1 ) is called the 1 Markov property And the resulting chain X 1 , X 2 , . . . , X k is called a Markov chain Further, since we are considering discrete time steps, this is called a discrete time Markov X 1 X 2 · · · X k Chain Further, since X i ’s take on discrete values this is called a discrete time discrete space Markov Chain Okay, but why are we interested in Markov chains? (we will get there soon! for now let us just focus on these definitions) 8/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

X 1 X 2 · · · X k Let us delve a bit deeper into Markov Chains and define a few more quantities Let us assume 2 n = l ( i.e. , X i can take l val- Recall that each X i ∈ { 0 , 1 } n ues) How many values do we need to specify the X i − 1 X i − 2 T ab distribution 1 1 0.05 1 2 0.06 . . . . . . ( l 2 ) . . . P ( X i = x i | X i − 1 = x i − 1 )? 1 l 0.02 2 1 0.03 We can represent this as a matrix T ∈ l × 2 2 0.07 . . . . . . . . . l where the entry T a,b of the matrix denotes 2 l 0.01 the probability of transitioning to state b from . . . . . . . . . state a ( i.e. , P ( X i = b | X i − 1 = a )) l 1 0.1 l 2 0.09 The matrix T is called the transition matrix . . . . . . . . . l l 0.21 9/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

We need to define this transition matrix T ab , i.e. , · · · X 1 X 2 X k P ( X i = b | X i − 1 = a ) ∀ a, b ∀ i X i − 1 X i − 2 T ab Why do we need to define this ∀ i ? Well, 1 1 0.05 because this transition probabilities may be 1 2 0.06 . . . . . . . . . different for different time steps 1 l 0.02 For example, the transition in the number 2 1 0.03 2 2 0.07 of customers may be different from Friday . . . . . . . . . to Saturday (weekend) as compared to from 2 l 0.01 . . . Sunday to Monday(weekday) . . . . . . l 1 0.1 Thus, for a Markov chain X 1 , X 2 , . . . , X k l 2 0.09 . . . we will have k such transition matrices . . . . . . l l 0.21 T 1 , T 2 , . . . , T k 10/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

· · · However, for this discussion we will assume X 1 X 2 X k that the Markov chain is time homogeneous What does that mean? It means that X i − 1 X i − 2 T ab 1 1 0.05 T 1 = T 2 = · · · = T k = T 1 2 0.06 . . . . . . . . . In other words 1 l 0.02 2 1 0.03 2 2 0.07 P ( X i = b | X i − 1 = a ) = T ab ∀ a, b ∀ i . . . . . . . . . 2 l 0.01 . . . The transition matrix does not depend on the . . . . . . l 1 0.1 time i and hence such a Markov Chain is l 2 0.09 called time homogeneous . . . . . . . . . l l 0.21 11/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Now suppose the starting distribution at time step 0 is given by µ 0 ) Just to be clear µ 0 is a 2 n dimensional vector such that µ 0 a = P ( X 0 = a ) µ 0 a is the probability that the random variable takes on the value a among all the possible 2 n X 1 X 2 · · · X k values Given µ 0 and T how will you compute µ k where µ k a = P ( X k = a ) µ k is again a 2 n dimensional vector whose a th entry tells us the probability that X k will take on the value a among all the possible 2 n values 12/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Let us consider P ( X 1 = b ) � P ( X 1 = b ) = P ( X 0 = a, X 1 = b ) X 0 X 1 a 1 The above sum essentially captures all the paths of reaching X 1 = b irrespective of the 2 b value of X 0 � . . P ( X 1 = b ) = P ( X 0 = a, X 1 = b ) . . . . a � = P ( X 0 = a ) P ( X 1 = b | X 0 = a ) l a � µ 0 = a T ab a 13/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

X 0 X 1 1 1 0.2 0 . 3 0.5 Let us see if there is a more compact way of writing the distribution P ( X 1 ) 0.3 0.3 ( i.e. , of specifying P ( X 1 = b ) ∀ b ) 2 2 0.6 Let us consider a simple case when 0.1 0 . 4 l = 3 (as opposed to 2 n ) 0.4 Thus, µ 0 ∈ R 3 and T ∈ R 3 × 3 0.2 What does the product µ 0 T give us ? 3 3 0.4 It gives us the distribution µ 1 ! (the 0 . 3  0 . 2 0 . 5 0 . 3  b th entry of this vector is � a µ 0 a T ab µ 0 T = � � 0 . 3 0 . 4 0 . 3 0 . 3 0 . 6 0 . 1 which is P ( X 1 = b ))   0 . 4 0 . 2 0 . 4 � � = 0 . 3 0 . 45 0 . 25 14/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Let us consider P ( X 2 = b ) � X 0 X 1 X 2 P ( X 2 = b ) = P ( X 1 = a, X 2 = b ) a 1 The above sum essentially captures all the paths of reaching X 2 = b irrespective of the value of X 1 2 b � P ( X 2 = b ) = P ( X 1 = a, X 2 = b ) a . . . . . . � . . . = P ( X 1 = a ) P ( X 2 = b | X 1 = a ) a � µ 1 l = a T ab a 15/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

Once again we can write P ( X 2 ) compactly as P ( X 2 ) = µ 1 T = ( µ 0 T ) T = µ 0 T 2 X 0 X 1 X 2 In general, 1 P ( X k ) = µ 0 T k 2 b Thus the distribution at any time step can be computed by finding the appropriate element . . . from the following series . . . . . . µ 0 T 1 , µ 0 T 2 , µ 0 T 3 , . . . , µ 0 T k , . . . l Note that this is still computationally expens- ive because it involves a product of µ 0 (2 n ) and T k (2 n × 2 n ) (but later on we will see that we do not need this full product) 16/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

If at a certain time step t , µ t reaches a distribution π such that πT = π Then for all subsequent time steps X 0 X 1 X 2 µ j = π ( j ≥ t ) 1 π is then called the stationary distribution of 2 b the Markov chain X t , X t +1 , X t +2 , . . . will all follow the same dis- . . . . . . . . . tribution π In other words, if we have X t = x t , X t +1 = l x t +1 , X t +2 = x t +2 and so on then we can think of x t , x t +1 , x t +2 as samples drawn from the same distribution π (this is a crucial property and we will return back to it soon) 17/61 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 20

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence for training RBMs Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/61 Mitesh M.

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

Infinitary logic and basically disconnected compact Hausdorff spaces ToLo VI Serafina Lapenta

The Maximum Clique Interdiction Game Fabio Furini, Ivana Ljubi c, Sbastien Martin, Pablo San

Regular Sets of Trees and Probability Matteo Mio CNRS & ENSLyon Matteo Mio Workshop on

Kitami, Hokkaido 2003.12.28 Reflection cardinals of coloring of graphs Saka e Fuchino (

uniprof: Transparent Unikernel Performance Profiling & Debugging Florian Schmidt, Research

Two-Player Zero-sum Games Played on Graphs: -Regular and Quantitative Objectives

Maximal left ideals of operators acting on a Banach space s

NLP from (almost) Scratch Bhuvan Venkatesh, Sarah Schieferstein (bvenkat2, schfrst2)

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 20 Markov Chains, Gibbs Sampling for Training RBMs, Contrastive Divergence for training RBMs Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/61 Mitesh M.

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

Infinitary logic and basically disconnected compact Hausdorff spaces ToLo VI Serafina Lapenta

The Maximum Clique Interdiction Game Fabio Furini, Ivana Ljubi c, Sbastien Martin, Pablo San

Regular Sets of Trees and Probability Matteo Mio CNRS &amp; ENSLyon Matteo Mio Workshop on

Kitami, Hokkaido 2003.12.28 Reflection cardinals of coloring of graphs Saka e Fuchino (

uniprof: Transparent Unikernel Performance Profiling &amp; Debugging Florian Schmidt, Research

Two-Player Zero-sum Games Played on Graphs: -Regular and Quantitative Objectives

Maximal left ideals of operators acting on a Banach space s

NLP from (almost) Scratch Bhuvan Venkatesh, Sarah Schieferstein (bvenkat2, schfrst2)

Regular Sets of Trees and Probability Matteo Mio CNRS & ENSLyon Matteo Mio Workshop on

uniprof: Transparent Unikernel Performance Profiling & Debugging Florian Schmidt, Research