CS7015 (Deep Learning) : Lecture 19 Using joint distributions for - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling, Latent Variables, Restricted Boltzmann Machines, Unsupervised Learning, Motivation for Sampling Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Acknowledgments Probabilistic Graphical models: Principles and Techniques, Daphne Koller and Nir Friedman An Introduction to Restricted Boltzmann Machines, Asja Fischer and Christian Igel 2/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Module 19.1: Using joint distributions for classification and sampling 3/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Now that we have some understanding of joint probability distributions and efficient ways of representing them, let us see some more practical examples where we can use these joint distributions 4/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Consider a movie critic who writes reviews M1: An unexpected and necessary masterpiece for movies M2: Delightfully merged information and comedy For simplicity let us assume that he always M3: Director’s first true masterpiece M4: Sci-fi perfection,truly mesmerizing film. writes reviews containing a maximum of 5 M5: Waste of time and money words M6: Best Lame Historical Movie Ever Further, let us assume that there are a total of 50 words in his vocabulary Each of the 5 words in his review can be treated as a random variable which takes one of the 50 values Given many such reviews written by the reviewer we could learn the joint probability distribution P ( X 1 , X 2 , . . . , X 5 ) 5/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

In fact, we can even think of a very simple factorization for this model M1: An unexpected and necessary masterpiece P ( X 1 , X 2 , . . . , X 5 ) = � P ( X i | X i − 1 , X i − 2 ) M2: Delightfully merged information and comedy M3: Director’s first true masterpiece M4: Sci-fi perfection,truly mesmerizing film. In other words, we are assuming that the i-th M5: Waste of time and money word only depends on the previous 2 words M6: Best Lame Historical Movie Ever and not anything before that of and Let us consider one such factor P ( X i = time | X i − 2 = waste, X i − 1 = of ) waste We can estimate this as count (waste of time) money time count (waste of) And the two counts mentioned above can be computed by going over all the reviews We could similarly compute the probabilities of all such factors 6/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Okay, so now what can we do M7: More realistic than real life with this joint distribution? Given a review, classify if this P ( X i = w | , P ( X i = w | , P ( X i = w | w . . . was written by the reviewer X i − 2 = more, X i − 2 = realistic, X i − 2 = than, X i − 1 = realistic ) X i − 1 = than ) X i − 1 = real ) new reviews which Generate than 0.61 0.01 0.20 . . . would look like reviews written as 0.12 0.10 0.16 . . . by this reviewer for 0.14 0.09 0.05 . . . How would you do this? By real 0.01 0.50 0.01 . . . sampling from this distribution! the 0.02 0.12 0.12 . . . What does that mean? Let us life 0.05 0.11 0.33 . . . see! P ( M 7) = P ( X 1 = more ) .P ( X 2 = realistic | X 1 = more ) . P ( X 3 = than | X 1 = more, X 2 = realistic ) . P ( X 4 = real | X 2 = realistic, X 3 = than ) . P ( X 5 = life | X 3 = than, X 4 = real ) = 0 . 2 × 0 . 25 × 0 . 61 × 0 . 50 × 0 . 33 = 0 . 005 7/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

How does the reviewer start his reviews (what is the first word that he chooses)? P ( X i = w | , We could take the word which has the P ( X 2 = w | , w P ( X 1 = w ) . . . X i − 2 = the, highest probability and put it as the X 1 = the ) X i − 1 = movie ) first word in our review the 0.62 0.01 0.01 . . . Having selected this what is the most movie 0.10 0.40 0.01 . . . likely second word that the reviewer amazing 0.01 0.22 0.01 . . . uses? useless 0.01 0.20 0.03 . . . Having selected the first two words was 0.01 0.00 0.60 . . . . . . . what is the most likely third word . . . . . . . . . . . that the reviewer uses? and so on... The movie was really amazing 8/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

But there is a catch here! Selecting the most likely word at each time step will only give us the same P ( X i = w | , review again and again! P ( X 2 = w | , w P ( X 1 = w ) . . . X i − 2 = the, But we would like to generate X 1 = the ) X i − 1 = movie ) different reviews the 0.62 0.01 0.01 . . . So instead of taking the max value we movie 0.10 0.40 0.01 . . . can sample from this distribution amazing 0.01 0.22 0.01 . . . How? Let us see! useless 0.01 0.20 0.03 . . . was 0.01 0.00 0.60 . . . . . . . . . . . . . . . . . . The movie was really amazing 9/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Suppose there are 10 words in the P ( X i = w | , P ( X 2 = w | , vocabulary w P ( X 1 = w ) . . . X i − 2 = the, X 1 = the ) We have computed the probability X i − 1 = movie ) distribution P ( X 1 = word ) the 0.62 0.01 0.01 . . . P ( X 1 = the ) is the fraction of reviews movie 0.10 0.40 0.01 . . . having the as the first word amazing 0.01 0.22 0.01 . . . useless 0.01 0.20 0.03 . . . Similarly, we have computed was 0.01 0.00 0.60 . . . P ( X 2 = word 2 | X 1 = word 1 ) and is 0.01 0.00 0.30 . . . P ( X 3 = word 3 | X 1 = word 1 , X 2 = word 2 ) masterpiece 0.01 0.11 0.01 . . . I 0.21 0.00 0.01 . . . liked 0.01 0.01 0.01 . . . decent 0.01 0.02 0.01 . . . 10/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Now consider that we want to generate the 3rd word The movie . . . in the review given the first 2 words of the review We can think of the 10 words as forming a 10 sided dice where each side corresponds to a word The probability of each side showing up is not uniform P ( X i = w | , Index Word X i − 2 = the, . . . but as per the values given in the table X i − 1 = movie ) We can select the next word by rolling this dice and 0 the 0.01 . . . picking up the word which shows up 1 movie 0.01 . . . 2 amazing 0.01 . . . You can write a python program to roll such a biased 3 useless 0.03 . . . dice 4 was 0.60 . . . 5 is 0.30 . . . 6 masterpiece 0.01 . . . 7 I 0.01 . . . 8 liked 0.01 . . . 9 decent 0.01 . . . 11/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Now, at each timestep we do not Generated Reviews pick the most likely word but all words are possible depending on the movie is liked decent their probability (just as rolling I liked the amazing movie a biased dice or tossing a biased the movie is masterpiece coin) the movie I liked useless Every run will now give us a different review! 12/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Returning back to our story.... 13/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Okay, so now what can we do M7: More realistic than real life with this joint distribution? Given a review, classify if this P ( X i = w | , P ( X i = w | , P ( X i = w | w . . . was written by the reviewer X i − 2 = more, X i − 2 = realistic, X i − 2 = than, X i − 1 = realistic ) X i − 1 = than ) X i − 1 = real ) new reviews which Generate than 0.61 0.01 0.20 . . . would look like reviews written as 0.12 0.10 0.16 . . . by this reviewer for 0.14 0.09 0.05 . . . Correct noisy reviews or help in real 0.01 0.50 0.01 . . . completing incomplete reviews the 0.02 0.12 0.12 . . . life 0.05 0.11 0.33 . . . argmax P ( X 1 = the, X 2 = movie, X 5 P ( M 7) = P ( X 1 = more ) .P ( X 2 = realistic | X 1 = more ) . X 3 = was, P ( X 3 = than | X 1 = more, X 2 = realistic ) . X 4 = amazingly, P ( X 4 = real | X 2 = realistic, X 3 = than ) . X 5 =?) P ( X 5 = life | X 3 = than, X 4 = real ) = 0 . 2 × 0 . 25 × 0 . 61 × 0 . 50 × 0 . 33 = 0 . 005 14/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Let us take an example from another domain 15/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Consider images which contain m × n pixels (say 32 × 32) Each pixel here is a random variable which can take values from 0 to 255 (colors) We thus have a total of 32 × 32 = 1024 random variables ( X 1 , X 2 , ..., X 1024 ) Together these pixels define the image and different combinations of pixel values lead to different images Given many such images we want to learn the joint distribution P ( X 1 , X 2 , ..., X 1024 ) 16/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

We can assume each pixel is dependent only on its neighbors In this case we could factorize the distribution over a Markov network � φ ( D i ) where D i is a set of variables which form a maximal clique (basically, groups of neighboring pixels) 17/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

Again, what can we do with this joint distribution? Given a new image, classify if is indeed a bedroom Generate new images which would look like bedrooms (say, if you are an interior designer) Correct noisy images or help in completing incomplete images 18/71 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 19

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling, Latent Variables, Restricted Boltzmann Machines, Unsupervised Learning, Motivation for Sampling Mitesh M. Khapra Department of Computer Science and

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

About generative aspects of Variational Autoencoders LOD19 The Fifth International Conference

User Recommendation in Content Curation Platforms Jianling Wang, Ziwei Zhu and James Caverlee

Latent Variables and Real-Time Forecasting in DSGE Models with Occasionally Binding

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural

Discriminative L earning over C onstrained L atent R epresentations Ming-Wei Chang , Dan

Representing Documents via Latent Keyphrase Inference April. 15 th , 2016 Document Representation

Discriminating Languages in a Probabilistic Latent Subspace Aleksandr Sizov , Kong Aik Lee, Tomi

Video Synthesis from the StyleGAN Latent Space Advisor Dr. Chris Pollett Committee Members By

Sambuz

Useful Links

Newsletter

Mail Us