CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, Bayesian Networks, Conditional Independence in Bayesian Networks Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Module 17.0: Recap of Probability Theory 2/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

We will start with a quick recap of some basic concepts from probability 3/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Axioms of Probability For any event A , Ω P ( A ) ≥ 0 A 1 A 3 If A 1 , A 2 , A 3 , ...., A n are disjoint events (i.e., A i ∩ A j = φ ∀ i � = j ) then A 2 A 4 A 5 � P ( ∪ A i ) = P ( A i ) i If Ω is the universal set containing all events then P (Ω) = 1 4/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Random Variable (intuition) Suppose a student can get one of 3 possible grades in a course: A, B, C Ω One way of interpreting this is that B there are 3 possible events here A Another way of looking at this is C there is a random variable G which each student to one of the 3 possible values Grades And we are interested in P ( G = g ) A where g ∈ { A, B, C } B Ω G C Of course, both interpretations are conceptually equivalent 5/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Random Variable (intuition) But the second one (using random variables) is more compact Grades Specially, when there are multiple A attributes associated with a student B Ω G C (outcome) - grade, height, age, etc. We could have one random variable Height corresponding to each attribute H Short And then ask for outcomes (or stu- Tall dents) where Grade = g , Height = h , Age = a and so on Age A Adult Young 6/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Random Variable (formal) A random variable is a function which maps each outcome in Ω to a Grades value A In the previous example, G (or f grade ) B Ω G C maps each student in Ω to a value: A , B or C Height The event Grade = A is a shorthand H Short for the event { ω ∈ Ω : f Grade = A } Tall Age A Adult Young 7/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Random Variable (continuous v/s discrete) A random variable can either take Grades continuous values (for example, A weight, height ) B Ω G C Or discrete values (for example, grade, nationality ) Height For this discussion we will mainly fo- 120cm . H cus on discrete random variables . . 200cm Weight 120kg A . . . 45kg 8/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Marginal Distribution What do we mean by marginal distribution over a random variable ? G P ( G = Consider our random variable G for g ) grades A 0.1 Specifying the marginal distribution B 0.2 over G means specifying C 0.7 P ( G = g ) ∀ g ∈ A, B, C We denote this marginal distribution compactly by P ( G ) 9/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Joint Distribution Consider two random variable G (grade) and I (intellegence ∈ { H igh, L ow } ) G I P ( G = g, I = i ) The joint distribution over these two random A High 0 . 3 variables assigns probabilities to all events in- A Low 0 . 1 volving these two random variables B High 0 . 15 P ( G = g, I = i ) ∀ ( g, i ) ∈ { A, B, C } × { H, L } B Low 0 . 15 C High 0 . 1 We denote this joint distribution compactly C Low 0 . 2 by P ( G, I ) 10/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Conditional Distribution G P ( G | I = H ) Consider two random variable G (grade) and I (intellegence) A 0.6 B 0.3 Suppose we are given the value of I (say, I = H ) then C 0.1 the conditional distribution P ( G | I ) is defined as G P ( G | I = L ) P ( G = g | I = H ) = P ( G = g, I = H ) A 0.3 ∀ g ∈ { A, B, C } P ( I = H ) B 0.4 C 0.3 More compactly defined as P ( G | I ) = P ( G, I ) P ( I ) or P ( G, I ) = P ( G | I ) ∗ P ( I ) � �� joint conditional marginal 11/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Joint Distribution ( n random variables) The joint distribution of n random variables assigns probabilities to all events involving the n random variables, X 1 . . . X n P ( X 1 , X 2 , . . . , X n ) . . . . . . . . . . . . In other words it assigns . . . . . . . . . . . . P ( X 1 = x 1 , X 2 = x 2 , ..., X n = x n ) . . . . . . . . . . . . for all possible values that variable X i can take � = 1 If each random variable X i can take two values then the joint distribution will assign probabilities to the 2 n possible events 12/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Joint Distribution ( n random variables) The joint distribution over two random variables X 1 and X 2 can be written as, X 1 . . . X n P ( X 1 , X 2 , . . . , X n ) P ( X 1 , X 2 ) = P ( X 2 | X 1 ) P ( X 1 ) = P ( X 1 | X 2 ) P ( X 2 ) . . . . . . . . . . . . Similarly for n random variables . . . . . . . . . . . . . . . . . . . . . . . . P ( X 1 , X 2 , ..., X n ) = P ( X 2 , ..., X n | X 1 ) P ( X 1 ) = P ( X 3 , ..., X n | X 1 , X 2 ) P ( X 2 | X 1 ) P ( X 1 ) = P ( X 4 , ..., X n | X 1 , X 2 , X 3 ) P ( X 3 | X 2 , X 1 ) P ( X 2 | X 1 ) P ( X 1 ) n � P ( X i | X i − 1 = P ( X 1 ) ) (chain rule) 1 i =2 13/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

From Joint Distributions to Marginal Distributions A B P ( A = a, B = b ) Suppose we are given a joint distribtion over High High 0.3 High Low 0.25 two random variables A , B Low High 0.35 The marginal distributions of A and B can be Low Low 0.1 computed as � A P ( A = a ) P ( A = a ) = P ( A = a, B = b ) High 0.55 ∀ b Low 0.45 � P ( B = b ) = P ( A = a, B = b ) B P ( B = a ) ∀ a High 0.65 More compactly written as Low 0.35 � P ( A ) = P ( A, B ) B � P ( B ) = P ( A, B ) 14/86 A Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

What if there are n random variables ? A B P ( A = a, B = b ) Suppose we are given a joint distribtion over n random variables X 1 , X 2 , ..., X n High High 0.3 High Low 0.25 The marginal distributions over X 1 can be Low High 0.35 computed as Low Low 0.1 P ( X 1 = x 1 ) A P ( A = a ) � = P ( X 1 = x 1 , X 2 = x 2 , ..., X n = x n ) High 0.55 ∀ x 2 ,x 3 ,...,x n Low 0.45 B P ( B = a ) More compactly written as High 0.65 � P ( X 1 ) = P ( X 1 , X 2 , ..., X n ) Low 0.35 X 2 ,X 3 ,...,X n 15/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Conditional Independence Recall that by Chain Rule of Two random variables X and Y are said to be Probability independent if P ( X, Y ) = P ( X ) P ( Y | X ) P ( X | Y ) = P ( X ) However, if X and Y are in- We denote this as X ⊥ ⊥ Y dependent, then In other words, knowing the value of Y does not change our belief about X P ( X, Y ) = P ( X ) P ( Y ) We would expect G rade to be dependent on I ntelligence but independent of W eight 16/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Okay, we are now ready to move on to Bayesian Networks or Directed Graphical Models 17/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Module 17.1: Why are we interested in Joint Distributions 18/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

In many real world applications, we Y have to deal with a large number of Oil random variables X 1 X 2 X 3 Salinity Depth Pressure For example, an oil company may be interested in computing the probabil- X 4 X 5 ity of finding oil at a particular loca- Temperature Biodiversity tion X 6 Density This may depend on various (random) variables P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) The company is interested in knowing the joint distribution 19/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

But why joint distribution? Y Oil Aren’t we just interested in X 1 X 2 X 3 P ( Y | X 1 , X 2 , ..., X n )? Salinity Depth Pressure Well, if we know the joint distribu- X 4 X 5 tion, we can find answers to a bunch Temperature Biodiversity of interesting questions X 6 Let us see some such questions of in- Density terest P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) 20/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

We can find the conditional distribution Y P ( Y, X 1 , ..., X n ) P ( Y | X 1 , ..., X n ) = Oil � X 1 ,...,X n P ( Y, X 1 , ..., X n ) X 1 X 2 X 3 Salinity Depth Pressure We can find the marginal distribution, X 4 X 5 � P ( Y ) = P ( Y, X 1 , X 2 , ..., X n ) Temperature Biodiversity X 1 ,...,X n X 6 Density We can find the conditional independencies, P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) P ( Y, X 1 ) = P ( Y ) P ( X 1 ) 21/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Module 17.2: How do we represent a joint distribution 22/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Let us return to the case of n random Y (yes/no) variables Oil For simplicity assume each of these variables can take binary values X 1 (high/low) X 2 (high/low) X 3 (deep/shallow) Salinity Depth Pressure To specify the joint distribution, we need to specify 2 n − 1 values. Why X 4 (high/low) X 5 (high/low) not (2 n )? Temperature Biodiversity If we specify these 2 n − 1 values, we X 6 (high/low) have an explicit representation for the Density joint distribution P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) 23/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, Bayesian Networks, Conditional Independence in Bayesian Networks Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/86

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

s r strtr qt

Gibbs sampling Dr. Jarad Niemi Iowa State University March 29, 2018 Jarad Niemi (Iowa State)

Probability Density (1) Let f ( x 1 , x 2 . . . x n ) be a probability density for the variables {

Lecture 5: Probability Distributions Random Variables Probability Distributions

Calculating probabilities of two events F OUN DATION S OF P ROBABILITY IN P YTH ON Alexander

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs,

Probabilistic Graphical Models Lecture 2 Bayesian Networks Representation CS/CNS/EE 155

Probability, continued CMPUT 296: Basics of Machine Learning 2.2-2.4 Recap Probabilities