undirected graphical models
play

Undirected Graphical Models Aaron Courville, Universit de Montral 2 - PowerPoint PPT Presentation

Undirected Graphical Models Aaron Courville, Universit de Montral 2 (UNDIRECTED) GRAPHICAL MODELS Overview : Directed versus undirected graphical models Conditional independence Energy function formalism Maximum likelihood


  1. Undirected Graphical Models Aaron Courville, Université de Montréal

  2. 2 (UNDIRECTED) GRAPHICAL MODELS Overview : •Directed versus undirected graphical models •Conditional independence •Energy function formalism •Maximum likelihood learning •Restricted Boltzmann Machine •Spike-and-slab RBM

  3. 3 Probabilistic Graphical Models • Graphs endowed with a probability distribution - N odes represent random variables and the edges encode conditional independence assumptions • Graphical model express sets of conditional independence via graph structure (and conditional independence is useful) • Graph structure plus associated parameters define joint probability distribution of the set of nodes/variables Graph Probability theory theory Probabilistic graphical theory

  4. 4 Probabilistic Graphical Models • Graphical models come in two main flavors: 1. Directed graphical models (a.k.a Bayes Net, Belief Networks): - Consists of a set of nodes with arrows (directed edges) between some of the nodes - Arrows encode factorized conditional probability distributions 2. Undirected graphical models (a.k.a Markov random fields): - Consists of a set of nodes with undirected edges between some of the nodes - Edges (or more accurately the lack of edges) encode conditional independence. • Today, we will focus almost exclusively on undirected graphs.

  5. 5 PROBABILITY REVIEW: CONDITIONAL INDEPENDENCE Definition : X is conditionally independent of Y given Z if the probabil- ity distribution governing X is independent of the value of Y , given the value of Z : for all ( i, j, k ) P ( X = x i , Y = y j | Z = z k ) = P ( X = x i | Z = z k ) P ( Y = y j | Z = z k ) P ( X, Y | Z ) = P ( X | Z ) P ( Y | Z ) Or equivalently (by the product rule): P ( X | Y, Z ) = P ( X | Z ) P ( Y | X, Z ) = P ( Y | Z ) Why? Recall from the probability product rule P ( X, Y, Z ) = P ( X | Y, Z ) P ( Y | Z ) P ( Z ) = P ( X | Z ) P ( Y | Z ) P ( Z ) Example: P ( Thunder | Rain , Lightning ) = P ( Thunder | Lightning )

  6. 6 TYPES OF GRAPHICAL MODELS Probabilistic Models Graphical Models Directed Undirected

  7. 7 REPRESENTING CONDITIONAL INDEPENDENCE Some conditional independencies cannot be represented by directed graphical models: ‣ Consider 4 variables: A, B, C, D } ( A ⊥ C | B, D ) ‣ How do we represent the conditional independences: ( B ⊥ D | A, C ) A A C A B D B D D B C C ( A ⊥ C ) ( A ⊥ C | B, D ) Undirected model ( B ⊥ D | A, C ) ( B ⊥ D | A )

  8. 8 WHY UNDIRECTED GRAPHICAL MODELS? Sometime its awkward to model phenomena with directed models X 11 X 12 X 13 X 14 X 15 X 11 X 12 X 13 X 14 X 15 X 21 X 22 X 24 X 21 X 22 X 24 X 25 X 23 X 25 X 23 X 31 X 32 X 33 X 35 X 31 X 32 X 33 X 35 X 34 X 34 X 41 X 42 X 43 X 44 X 45 X 41 X 42 X 43 X 44 X 45 Image from “CRF as RNN Semantic Image Segmentation Live Demo” (http://www.robots.ox.ac.uk/~szheng/crfasrnndemo/)

  9. 9 CONDITIONAL INDEPENDENCE PROPERTIES • Undirected graphical models: ‣ Conditional independence encoded by simple graph separation. ‣ Formally, consider 3 sets of nodes: A , B and C , we say iff C x A ⊥ x B | x C separates A and B in the graph. ‣ C separates A and B in the graph: If we remove all nodes in C , there is no path from A to B in the graph. X 11 X 12 X 13 X 14 X 15 X 21 X 22 X 24 X 25 X 23 A C B

  10. 10 MARKOV BLANKET • Markov Blanket: For a given node x , the Markov Blanket is the smallest set of nodes which renders x conditionally independent of all other nodes in the graph. • Markov blanket of the 2-d lattice MRF: X 11 X 12 X 13 X 14 X 15 X 21 X 22 X 24 X 25 X 23 X 31 X 32 X 33 X 35 X 34 X 41 X 42 X 43 X 44 X 45

  11. 11 RELATING DIRECTED AND UNDIRECTED MODELS X 11 X 12 X 13 X 14 X 15 • Markov blanket of the 2-d lattice MRF: X 21 X 22 X 24 X 25 X 23 neighbours of X 23 X 31 X 32 X 33 X 35 X 34 X 41 X 42 X 43 X 44 X 45 • Markov blanket of the 2-d causal MRF: X 11 X 12 X 13 X 14 X 15 parents of X 23 X 21 X 22 X 24 X 23 X 25 children of X 23 X 31 X 32 X 33 X 35 X 34 parents of children of X 41 X 42 X 43 X 44 X 45 X 23

  12. 12 PARAMETERIZING DIRECTED GRAPHICAL MODELS Directed graphical models: • Parameterized by local conditional probability densities (CPDs) A P ( A | B ) B • Joint distributions are given as products of CPDs: N � P ( X 1 , . . . , X N ) = P ( X i | X parents(i) ) i =0

  13. 13 PARAMETERIZING MARKOV NETWORKS: FACTORS Undirected graphical models: A • Parameterized by symmetric factors or potential functions . B φ ( A, B ) - Generalizes both the CPD and the joint distribution. - Note: unlike the CPDs, the potential function are not required to normalize. • Definition : Let be a set of cliques. For each , we define a factor (also C c ∈ C called potential function or clique potential) as a nonnegative function φ c φ c ( x c ) → R where is the set of variables in clique c . x c

  14. 14 PARAMETERIZING MARKOV NETWORKS: JOINT DISTRIBUTION • Joint distribution given by a normalized product of factors: P ( x 1 , . . . , x n ) = 1 � φ c ( x c ) Z c ∈ C • Z is the partition function , it’s the normalization constant: � � Z = φ c ( x c ) c ∈ C x 1 ,...,x n • Our 4 variable example: A P ( a, b, c, d ) = 1 Z φ 1 ( a, b ) φ 2 ( b, c ) φ 3 ( c, d ) φ 4 ( d, a ) B D � Z = φ 1 ( a, b ) φ 2 ( b, c ) φ 3 ( c, d ) φ 4 ( d, a ) C a,b,c,d

  15. 15 CLIQUES AND MAXIMAL CLIQUES • What is a clique ? A subset of nodes who’s induced subgraph is complete • A maximal clique is one where you cannot add any more nodes and remain a clique A A D D B B C C Examples of maximal cliques.

  16. 16 OF GRAPHS AND DISTRIBUTIONS • Interesting fact : any positive distribution whose conditional independencies can be represented with an undirected graph can be parameterize by a product of factors ( Hammersley-Clifford theorem ). A A D D B B C C Examples of maximal cliques.

  17. 17 TYPES OF GRAPHICAL MODELS Probabilistic Models Graphical Models Directed Undirected ?

  18. 18 RELATING DIRECTED AND UNDIRECTED MODELS • What kind of probability models can be encoded by both a directed and an undirected graphical model. ➡ Answer: any probability mode whose cond. indep. relations are consistent with a chordal graph. • Chordal graph: All undirected cycles of four or more vertices have a chord. • Chord: Edge that is not part of the cycle but connects two vertices of the cycle. Not chordal: Chordal: A A A D D D B B B C C C

  19. 19 TYPES OF GRAPHICAL MODELS Probabilistic Models Graphical Models Directed Chordal Undirected

  20. 20 ENERGY -BASED MODELS • The undirected models that most interest us are energy-based models . • We reformulate the factor in log-space: φ ( x c ) φ ( x c ) = exp( − � ( x c )) or alternatively, , where . � ( x c ) = − log φ ( x c ) � ( x c ) ∈ R P ( x 1 , . . . , x n ) = 1 • Energy-based formulation of joint dist: Z exp ( − E ( x 1 , . . . , x n )) � � = 1 � E ( x 1 , . . . , x n ) is called the energy function. Z exp � c ( x c ) − c ∈ C � � where Z = exp [ − E ( x 1 , . . . , x n )] · · · x 1 x n

  21. 21 LOG-LINEAR MODEL • Log-linear models are a type of energy-based model with a particular, linear, parametrization. • In log-linear models, for clique c , the coresponding element of the energy function is composed of: � c ( x c ) 1. A parameter w c 2. A feature of the observed data f c ( x c ) � � P ( x 1 , . . . , x n ) = 1 • The joint distribution is given by � Z exp w c f c ( x c ) − c ∈ C

  22. 22 MAXIMUM LIKELIHOOD LEARNING • Maximum likelihood learning in the context of a fully observable MRF. D w ML = argmax � p ( x ( i ) ; w ) log w i =1 �� � D � log φ c ( x ( i ) = argmax c ; w c ) − log Z ( w ) w i =1 c �� D � � � � log φ c ( x ( i ) = argmax c ; w c ) − |D| log Z ( w ) w log-linear model i =1 c �� D � � � � w c f c ( x ( i ) = argmax c ) − |D| log Z ( w ) w i =1 c decomposes over the cliques does not decompose

  23. 23 MAXIMUM LIKELIHOOD LEARNING • In general, there is no closed form solution for the optimal parameters. �� � � log Z ( w ) = log exp w c f c ( x c ) x c • We can compute a gradient of the partition function. �� �� �� ∂ ∂ log Z ( w ) = log exp w c � f c � ( x c � ) ∂ w c ∂ w c c � x � x c exp ( w c f c ( x c )) f c ( x c ) = � x c exp ( � c w c f c ( x c )) = E p ( x c ; w c ) [ f c ( x c )]

  24. 24 MAXIMUM LIKELIHOOD LEARNING • The gradient of the log-likelihood �� D � � D ∂ ∂ w c � f c � ( x ( i ) � � � log p ( x ( i ) ; w ) = c � ) − D log Z ( w ) ∂ w c ∂ w c i =1 i =1 c � � D � − D ∂ � f c ( x ( i ) = c ) log Z ( w ) ∂ w c i =1 = D E p (data) [ f c ( x c )] − D E p ( x c ; w c ) [ f c ( x c )] data term model term often tractable often intractable (e.g. fully observable x ) (e.g. fully observable x )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend