cs7015 deep learning lecture 17
play

CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, Bayesian Networks, Conditional Independence in Bayesian Networks Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/86


  1. CS7015 (Deep Learning) : Lecture 17 Recap of Probability Theory, Bayesian Networks, Conditional Independence in Bayesian Networks Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  2. Module 17.0: Recap of Probability Theory 2/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  3. We will start with a quick recap of some basic concepts from probability 3/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  4. Axioms of Probability For any event A , Ω P ( A ) ≥ 0 A 1 A 3 If A 1 , A 2 , A 3 , ...., A n are disjoint events (i.e., A i ∩ A j = φ ∀ i � = j ) then A 2 A 4 A 5 � P ( ∪ A i ) = P ( A i ) i If Ω is the universal set containing all events then P (Ω) = 1 4/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  5. Random Variable (intuition) Suppose a student can get one of 3 possible grades in a course: A, B, C Ω One way of interpreting this is that B there are 3 possible events here A Another way of looking at this is C there is a random variable G which each student to one of the 3 possible values Grades And we are interested in P ( G = g ) A where g ∈ { A, B, C } B Ω G C Of course, both interpretations are conceptually equivalent 5/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  6. Random Variable (intuition) But the second one (using random variables) is more compact Grades Specially, when there are multiple A attributes associated with a student B Ω G C (outcome) - grade, height, age, etc. We could have one random variable Height corresponding to each attribute H Short And then ask for outcomes (or stu- Tall dents) where Grade = g , Height = h , Age = a and so on Age A Adult Young 6/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  7. Random Variable (formal) A random variable is a function which maps each outcome in Ω to a Grades value A In the previous example, G (or f grade ) B Ω G C maps each student in Ω to a value: A , B or C Height The event Grade = A is a shorthand H Short for the event { ω ∈ Ω : f Grade = A } Tall Age A Adult Young 7/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  8. Random Variable (continuous v/s discrete) A random variable can either take Grades continuous values (for example, A weight, height ) B Ω G C Or discrete values (for example, grade, nationality ) Height For this discussion we will mainly fo- 120cm . H cus on discrete random variables . . 200cm Weight 120kg A . . . 45kg 8/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  9. Marginal Distribution What do we mean by marginal dis- tribution over a random variable ? G P ( G = Consider our random variable G for g ) grades A 0.1 Specifying the marginal distribution B 0.2 over G means specifying C 0.7 P ( G = g ) ∀ g ∈ A, B, C We denote this marginal distribution compactly by P ( G ) 9/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  10. Joint Distribution Consider two random variable G (grade) and I (intellegence ∈ { H igh, L ow } ) G I P ( G = g, I = i ) The joint distribution over these two random A High 0 . 3 variables assigns probabilities to all events in- A Low 0 . 1 volving these two random variables B High 0 . 15 P ( G = g, I = i ) ∀ ( g, i ) ∈ { A, B, C } × { H, L } B Low 0 . 15 C High 0 . 1 We denote this joint distribution compactly C Low 0 . 2 by P ( G, I ) 10/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  11. Conditional Distribution G P ( G | I = H ) Consider two random variable G (grade) and I (intel- legence) A 0.6 B 0.3 Suppose we are given the value of I (say, I = H ) then C 0.1 the conditional distribution P ( G | I ) is defined as G P ( G | I = L ) P ( G = g | I = H ) = P ( G = g, I = H ) A 0.3 ∀ g ∈ { A, B, C } P ( I = H ) B 0.4 C 0.3 More compactly defined as P ( G | I ) = P ( G, I ) P ( I ) or P ( G, I ) = P ( G | I ) ∗ P ( I ) � �� � � �� � ���� joint conditional marginal 11/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  12. Joint Distribution ( n random variables) The joint distribution of n random variables assigns probabilities to all events involving the n random variables, X 1 . . . X n P ( X 1 , X 2 , . . . , X n ) . . . . . . . . . . . . In other words it assigns . . . . . . . . . . . . P ( X 1 = x 1 , X 2 = x 2 , ..., X n = x n ) . . . . . . . . . . . . for all possible values that variable X i can take � = 1 If each random variable X i can take two values then the joint distribution will assign probab- ilities to the 2 n possible events 12/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  13. Joint Distribution ( n random variables) The joint distribution over two random vari- ables X 1 and X 2 can be written as, X 1 . . . X n P ( X 1 , X 2 , . . . , X n ) P ( X 1 , X 2 ) = P ( X 2 | X 1 ) P ( X 1 ) = P ( X 1 | X 2 ) P ( X 2 ) . . . . . . . . . . . . Similarly for n random variables . . . . . . . . . . . . . . . . . . . . . . . . P ( X 1 , X 2 , ..., X n ) = P ( X 2 , ..., X n | X 1 ) P ( X 1 ) = P ( X 3 , ..., X n | X 1 , X 2 ) P ( X 2 | X 1 ) P ( X 1 ) = P ( X 4 , ..., X n | X 1 , X 2 , X 3 ) P ( X 3 | X 2 , X 1 ) P ( X 2 | X 1 ) P ( X 1 ) n � P ( X i | X i − 1 = P ( X 1 ) ) (chain rule) 1 i =2 13/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  14. From Joint Distributions to Marginal Distributions A B P ( A = a, B = b ) Suppose we are given a joint distribtion over High High 0.3 High Low 0.25 two random variables A , B Low High 0.35 The marginal distributions of A and B can be Low Low 0.1 computed as � A P ( A = a ) P ( A = a ) = P ( A = a, B = b ) High 0.55 ∀ b Low 0.45 � P ( B = b ) = P ( A = a, B = b ) B P ( B = a ) ∀ a High 0.65 More compactly written as Low 0.35 � P ( A ) = P ( A, B ) B � P ( B ) = P ( A, B ) 14/86 A Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  15. What if there are n random variables ? A B P ( A = a, B = b ) Suppose we are given a joint distribtion over n random variables X 1 , X 2 , ..., X n High High 0.3 High Low 0.25 The marginal distributions over X 1 can be Low High 0.35 computed as Low Low 0.1 P ( X 1 = x 1 ) A P ( A = a ) � = P ( X 1 = x 1 , X 2 = x 2 , ..., X n = x n ) High 0.55 ∀ x 2 ,x 3 ,...,x n Low 0.45 B P ( B = a ) More compactly written as High 0.65 � P ( X 1 ) = P ( X 1 , X 2 , ..., X n ) Low 0.35 X 2 ,X 3 ,...,X n 15/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  16. Conditional Independence Recall that by Chain Rule of Two random variables X and Y are said to be Probability independent if P ( X, Y ) = P ( X ) P ( Y | X ) P ( X | Y ) = P ( X ) However, if X and Y are in- We denote this as X ⊥ ⊥ Y dependent, then In other words, knowing the value of Y does not change our belief about X P ( X, Y ) = P ( X ) P ( Y ) We would expect G rade to be dependent on I ntelligence but independent of W eight 16/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  17. Okay, we are now ready to move on to Bayesian Networks or Directed Graphical Models 17/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  18. Module 17.1: Why are we interested in Joint Distributions 18/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  19. In many real world applications, we Y have to deal with a large number of Oil random variables X 1 X 2 X 3 Salinity Depth Pressure For example, an oil company may be interested in computing the probabil- X 4 X 5 ity of finding oil at a particular loca- Temperature Biodiversity tion X 6 Density This may depend on various (ran- dom) variables P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) The company is interested in knowing the joint distribution 19/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  20. But why joint distribution? Y Oil Aren’t we just interested in X 1 X 2 X 3 P ( Y | X 1 , X 2 , ..., X n )? Salinity Depth Pressure Well, if we know the joint distribu- X 4 X 5 tion, we can find answers to a bunch Temperature Biodiversity of interesting questions X 6 Let us see some such questions of in- Density terest P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) 20/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  21. We can find the conditional distribution Y P ( Y, X 1 , ..., X n ) P ( Y | X 1 , ..., X n ) = Oil � X 1 ,...,X n P ( Y, X 1 , ..., X n ) X 1 X 2 X 3 Salinity Depth Pressure We can find the marginal distribution, X 4 X 5 � P ( Y ) = P ( Y, X 1 , X 2 , ..., X n ) Temperature Biodiversity X 1 ,...,X n X 6 Density We can find the conditional independencies, P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) P ( Y, X 1 ) = P ( Y ) P ( X 1 ) 21/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  22. Module 17.2: How do we represent a joint distribution 22/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

  23. Let us return to the case of n random Y (yes/no) variables Oil For simplicity assume each of these variables can take binary values X 1 (high/low) X 2 (high/low) X 3 (deep/shallow) Salinity Depth Pressure To specify the joint distribution, we need to specify 2 n − 1 values. Why X 4 (high/low) X 5 (high/low) not (2 n )? Temperature Biodiversity If we specify these 2 n − 1 values, we X 6 (high/low) have an explicit representation for the Density joint distribution P ( Y, X 1 , X 2 , X 3 , X 4 , X 5 , X 6 ) 23/86 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend