directed graphical models undirected graphical models
play

Directed Graphical Models + Undirected Graphical Models Matt - PowerPoint PPT Presentation

10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019 1 Q&A


  1. 10-418 / 10-618 Machine Learning for Structured Data Machine Learning Department School of Computer Science Carnegie Mellon University Directed Graphical Models + Undirected Graphical Models Matt Gormley Lecture 7 Sep. 18, 2019 1

  2. Q&A Q: How will I earn the 5% participation points? A: Very gradually. There will be a few aspects of the course (polls, surveys, meetings with the course staff) that we will attach participation points to. That said, we might not actually use the whole 5% that is being held out. 2

  3. Q&A Q: When should I prefer a directed graphical model to an undirected graphical model? A: As we’ll see today, the primary differences between them are: 1. the conditional independence assumptions they define 2. the normalization assumptions they make (Bayes Nets are locally normalized) (That said, we’ll also tie them together via a single framework: factor graphs.) There are also some practical differences (e.g. ease of learning) that result from the locally vs. globally normalized difference. 3

  4. Reminders • Homework 1: DAgger for seq2seq – Out: Thu, Sep. 12 – Due: Thu, Sep. 26 at 11:59pm 4

  5. SUPERVISED LEARNING FOR BAYES NETS 5

  6. Recipe for Closed-form MLE 1. Assume data was generated i.i.d. from some model (i.e. write the generative story) x (i) ~ p(x| θ ) 2. Write log-likelihood l ( θ ) = log p(x (1) | θ ) + … + log p(x (N) | θ ) 3. Compute partial derivatives (i.e. gradient) ! l ( θ )/ ! θ 1 = … ! l ( θ )/ ! θ 2 = … … ! l ( θ )/ ! θ M = … 4. Set derivatives to zero and solve for θ ! l ( θ )/ ! θ m = 0 for all m ∈ {1, …, M} θ MLE = solution to system of M equations and M variables Compute the second derivative and check that l ( θ ) is concave down 5. at θ MLE 6

  7. Machine Learning Our model The data inspires defines a score the structures for each structure we want to predict It also tells us Domain Mathematical Knowledge Modeling what to optimize ML Inference finds Optimization Combinatorial { best structure, marginals, Optimization partition function }for a new observation Learning tunes the parameters of the (Inference is usually model called as a subroutine in learning) 7

  8. Machine Learning Model Data X 1 X 3 arrow X 2 an like flies time X 4 X 5 Objective Inference Learning (Inference is usually called as a subroutine in learning) 8

  9. Learning Fully Observed BNs X 1 p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = X 3 X 2 p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) X 4 X 5 9

  10. Learning Fully Observed BNs X 1 p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = X 3 X 2 p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) X 4 X 5 10

  11. Learning Fully Observed BNs X 1 p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = X 3 X 2 p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) X 4 X 5 How do we learn these conditional and marginal distributions for a Bayes Net? 11

  12. Learning Fully Observed BNs Learning this fully observed p ( X 1 , X 2 , X 3 , X 4 , X 5 ) = Bayesian Network is p ( X 5 | X 3 ) p ( X 4 | X 2 , X 3 ) equivalent to learning five (small / simple) independent p ( X 3 ) p ( X 2 | X 1 ) p ( X 1 ) networks from the same data X 1 X 1 X 1 X 3 X 2 X 3 X 2 X 3 X 3 X 2 X 4 X 5 X 4 X 5 12

  13. Learning Fully Observed BNs How do we learn these θ ∗ = argmax conditional and marginal log p ( X 1 , X 2 , X 3 , X 4 , X 5 ) distributions for a Bayes Net? θ = argmax log p ( X 5 | X 3 , θ 5 ) + log p ( X 4 | X 2 , X 3 , θ 4 ) θ X 1 + log p ( X 3 | θ 3 ) + log p ( X 2 | X 1 , θ 2 ) + log p ( X 1 | θ 1 ) X 3 X 2 θ ∗ 1 = argmax log p ( X 1 | θ 1 ) θ 1 X 4 X 5 θ ∗ 2 = argmax log p ( X 2 | X 1 , θ 2 ) θ 2 θ ∗ 3 = argmax log p ( X 3 | θ 3 ) θ 3 θ ∗ 4 = argmax log p ( X 4 | X 2 , X 3 , θ 4 ) θ 4 5 = argmax log p ( X 5 | X 3 , θ 5 ) θ ∗ θ 5 13

  14. Learning Fully Observed BNs 14

  15. INFERENCE FOR BAYESIAN NETWORKS 16

  16. A Few Problems for Bayes Nets Suppose we already have the parameters of a Bayesian Network… 1. How do we compute the probability of a specific assignment to the variables? P(T=t, H=h, A=a, C=c) 2. How do we draw a sample from the joint distribution? t,h,a,c ∼ P(T, H, A, C) 3. How do we compute marginal probabilities? P(A) = … 4. How do we draw samples from a conditional distribution? t,h,a ∼ P(T, H, A | C = c) 5. How do we compute conditional marginal probabilities? P(H | C = c) = … 17

  17. GRAPHICAL MODELS: DETERMINING CONDITIONAL INDEPENDENCIES

  18. What Independencies does a Bayes Net Model? • In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. • This follows from n P ( X 1 … X n ) = ∏ P ( X i | parents ( X i )) i = 1 n ∏ P ( X i | X 1 … X i − 1 ) = i = 1 • But what else does it imply? Slide from William Cohen

  19. What Independencies does a Bayes Net Model? Three cases of interest… Cascade Common Parent V-Structure Z Y X Z Y Y X Z X 20

  20. What Independencies does a Bayes Net Model? Three cases of interest… Cascade Common Parent V-Structure Z Y X Z Y Y X Z X �� Z | Y X � ⊥ Z | Y ⊥ Z | Y X ⊥ X ⊥ Knowing Y Knowing Y decouples X and Z couples X and Z 21

  21. Whiteboard Common Parent Y (The other two Proof of cases can be conditional shown just as X Z independence easily.) ⊥ Z | Y X ⊥ 22

  22. The � Burglar Alarm � example • Your house has a twitchy burglar Burglar Earthquake alarm that is also sometimes triggered by earthquakes. Alarm • Earth arguably doesn’t care whether your house is currently being burgled Phone Call • While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! Quiz: True or False? ⊥ Earthquake | PhoneCall Burglar ⊥ Slide from William Cohen

  23. Markov Blanket (Directed) Def: the co-parents of a node are the parents of its children Def: the Markov Blanket of a X 1 node in a directed graphical model is the set containing the X 2 X 4 X 3 node’s parents, children, and co-parents. X 5 X 8 X 6 X 7 X 9 X 10 X 11 X 13 X 12 25

  24. Markov Blanket (Directed) Example: The Markov Def: the co-parents of a node Blanket of X 6 is are the parents of its children { X 3 , X 4 , X 5 , X 8 , X 9 , X 10 } Def: the Markov Blanket of a X 1 node in a directed graphical model is the set containing the X 2 X 4 X 3 node’s parents, children, and co-parents. Parents Parents X 5 X 8 X 6 X 7 Co-parents Parents X 9 X 10 X 11 Parents Children X 13 X 12 26

  25. Markov Blanket (Directed) Example: The Markov Def: the co-parents of a node Blanket of X 6 is are the parents of its children { X 3 , X 4 , X 5 , X 8 , X 9 , X 10 } Def: the Markov Blanket of a X 1 node in a directed graphical model is the set containing the X 2 X 4 X 3 node’s parents, children, and co-parents. Parents Parents X 5 X 8 X 6 X 7 Theorem: a node is Co-parents Parents conditionally independent of X 9 X 10 every other node in the graph X 11 given its Markov blanket Parents Children X 13 X 12 27

  26. D-Separation If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E Definition #1: Variables X and Z are d-separated given a set of evidence variables E iff every path from X to Z is “blocked”. A path is “blocked” whenever: ∃ Y on path s.t. Y ∈ E and Y is a “common parent” 1. … … Z X Y ∃ Y on path s.t. Y ∈ E and Y is in a “cascade” 2. … … Z X Y ∃ Y on path s.t. {Y, descendants(Y)} ∉ E and Y is in a “v-structure” 3. … … Z X Y 28

  27. D-Separation If variables X and Z are d-separated given a set of variables E Then X and Z are conditionally independent given the set E Definition #2: Variables X and Z are d-separated given a set of evidence variables E iff there does not exist a path in the undirected ancestral moral graph with E removed . 1. Ancestral graph : keep only X, Z, E and their ancestors 2. Moral graph : add undirected edge between all pairs of each node’s parents 3. Undirected graph : convert all directed edges to undirected 4. Givens Removed: delete any nodes in E Example Query: A ⫫ B | {D, E} Moral: Original: Ancestral: Undirected: Givens Removed: A A A A A B B B B B C C C C C ⇒ A and B connected D E D E D E D E ⇒ not d-separated F 29

  28. Learning Objectives Bayesian Networks You should be able to… 1. Identify the conditional independence assumptions given by a generative story or a specification of a joint distribution 2. Draw a Bayesian network given a set of conditional independence assumptions 3. Define the joint distribution specified by a Bayesian network 4. User domain knowledge to construct a (simple) Bayesian network for a real-world modeling problem 5. Depict familiar models as Bayesian networks 6. Use d-separation to prove the existence of conditional independencies in a Bayesian network 7. Employ a Markov blanket to identify conditional independence assumptions of a graphical model 8. Develop a supervised learning algorithm for a Bayesian network 30

  29. TYPES OF GRAPHICAL MODELS 31

  30. Three Types of Graphical Models Directed Graphical Undirected Graphical Factor Graph Model Model X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 X 1 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend