probabilistic graphical models
play

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 1 / 20 Approximate marginal inference Given the joint p ( x 1 , . . . , x n ) represented


  1. Probabilistic Graphical Models David Sontag New York University Lecture 8, March 22, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 1 / 20

  2. Approximate marginal inference Given the joint p ( x 1 , . . . , x n ) represented as a graphical model, how do we perform marginal inference , e.g. to compute p ( x 1 )? We showed in Lecture 5 that doing this exactly is NP-hard Nearly all approximate inference algorithms are either: Monte-carlo methods (e.g., likelihood reweighting, MCMC) 1 Variational algorithms (e.g., mean-field, TRW, loopy belief 2 propagation) These next two lectures will be on variational methods David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 2 / 20

  3. Variational methods Goal : Approximate difficult distribution p ( x ) with a new distribution q ( x ) such that: p ( x ) and q ( x ) are “close” 1 Computation on q ( x ) is easy 2 How should we measure distance between distributions? The Kullback-Leibler divergence (KL-divergence) between two distributions p and q is defined as p ( x ) log p ( x ) � D ( p � q ) = q ( x ) . x (measures the expected number of extra bits required to describe samples from p ( x ) using a code based on q instead of p ) As you showed in your homework, D ( p � q ) ≥ 0 for all p , q , with equality if and only if p = q Notice that KL-divergence is asymmetric David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 3 / 20

  4. KL-divergence (see Section 8.5 of K&F) p ( x ) log p ( x ) � D ( p � q ) = q ( x ) . x Suppose p is the true distribution we wish to do inference with What is the difference between the solution to arg min q D ( p � q ) (called the M-projection of q onto p ) and arg min q D ( q � p ) (called the I-projection )? These two will differ only when q is minimized over a restricted set of probability distributions Q = { q 1 , . . . } , and in particular when p �∈ Q David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 4 / 20

  5. KL-divergence – M-projection p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x For example, suppose that p ( z ) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z 2 0.5 0 0 0.5 1 z 1 (b) p =Green, q =Red David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 5 / 20

  6. KL-divergence – I-projection q ( x ) log q ( x ) q ∗ = arg min � q ∈ Q D ( q � p ) = p ( x ) . x For example, suppose that p ( z ) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z 2 0.5 0 0 0.5 1 z 1 (a) p =Green, q =Red David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 6 / 20

  7. KL-divergence (single Gaussian) In this simple example, both the M-projection and I-projection find an approximate q ( x ) that has the correct mean (i.e. E p [ z ] = E q [ z ]): 1 1 z 2 z 2 0.5 0.5 0 0 0 0.5 1 0 0.5 1 z 1 z 1 (b) (a) What if p ( x ) is multi-modal? David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 7 / 20

  8. KL-divergence – M-projection (mixture of Gaussians) p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x Now suppose that p ( x ) is mixture of two 2D Gaussians and Q is the set of all 2D Gaussian distributions (with arbitrary covariance matrices): p =Blue, q =Red M-projection yields distribution q ( x ) with the correct mean and covariance. David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 8 / 20

  9. KL-divergence – I-projection (mixture of Gaussians) q ( x ) log q ( x ) q ∗ = arg min � q ∈ Q D ( q � p ) = p ( x ) . x p =Blue, q =Red (two equivalently good solutions!) Unlike the M-projection, the I-projection does not necessarily yield the correct moments. David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 9 / 20

  10. Finding the M-projection is the same as exact inference M-projection is: p ( x ) log p ( x ) q ∗ = arg min � q ∈ Q D ( p � q ) = q ( x ) . x Recall the definition of probability distributions in the exponential family: q ( x ; η ) = h ( x ) exp { η · f ( x ) − ln Z ( η ) } f ( x ) are called the sufficient statistics In the exponential family, there is a one-to-one correspondance between distributions q ( x ; η ) and marginal vectors E q [ f ( x )] Suppose that Q is an exponential family ( p ( x ) can be arbitrary) It can be shown (see Thm 8.6) that the expected sufficient statistics, with respect to q ∗ ( x ), are exactly the corresponding marginals under p ( x ): E q ∗ [ f ( x )] = E p [ f ( x )] Thus, solving for the M-projection is just as hard as the original inference problem David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 10 / 20

  11. Most variational inference algorithms make use of the I-projection David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 11 / 20

  12. Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C All of the approaches begin as follows: q ( x ) ln q ( x ) � D ( q � p ) = p ( x ) x 1 � � = − q ( x ) ln p ( x ) − q ( x ) ln q ( x ) x x � � � � = − q ( x ) θ c ( x c ) − ln Z ( θ ) − H ( q ( x )) x c ∈ C � � � = − q ( x ) θ c ( x c ) + q ( x ) ln Z ( θ ) − H ( q ( x )) x x c ∈ C � = − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) . c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 12 / 20

  13. Variational approach Since D ( q � p ) ≥ 0, we have � − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) ≥ 0 , c ∈ C which implies that � ln Z ( θ ) ≥ E q [ θ c ( x c )] + H ( q ( x )) . c ∈ C Thus, any approximating distribution q ( x ) gives a lower bound on the log-partition function Recall that D ( q � p ) = 0 if and only if p = q .Thus, if we allow ourselves to optimize over all distributions, we have: � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 13 / 20

  14. Mean-field algorithms � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C Although this function is concave and thus in theory should be easy to optimize, we need some compact way of representing q ( x ) Mean-field algorithms assume a factored representation of the joint distribution: � q ( x ) = q i ( x i ) i ∈ V The objective function to use for variational inference then becomes: � � � � max θ c ( x c ) q i ( x i ) + H ( q i ) { q i ( x i ) ≥ 0 , � xi q i ( x i )=1 } x c c ∈ C i ∈ c i ∈ V Key difficulties: (1) highly non-convex optimization problem, and (2) factored distribution is usually too big of an approximation David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 14 / 20

  15. Convex relaxation � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C Assume that p ( x ) is in the exponential family, and let f ( x ) be its sufficient statistic vector Let Q be the exponential family with sufficient statistics f ( x ) Define µ q = E q [ f ( x )] be the marginals of q ( x ) We can re-write the objective as � � θ c ( x c ) µ c ln Z ( θ ) = max q ( x c ) + H ( µ q ) , q c ∈ C x c where we define H ( µ q ) to be the entropy of the maximum entropy distribution with marginals µ q Next, instead of optimizing over distributions q ( x ), optimize over valid marginal vectors µ . We obtain: � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M x c c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 15 / 20

  16. Marginal polytope (same as from Lecture 7!) 1 ! Assignment for X 1 " 0 ! Marginal polytope ! 0 ! 1 ! 0 ! 0 ! Assignment for X 2 " (Wainwright & Jordan, ’03) ! 1 ! 1 ! 1 ! Assignment for X 3 ! 1 ! µ � = � 0 ! 0 ! 1 " 0 " Edge assignment for " 0 " 0 " X 1 X 3 ! µ = � 0 " 1 " 0 " 0 " 0 ! 0 ! Edge assignment for " 1 � � 1 ! µ � + � 0 ! � X 1 X 2 ! µ 0 ! 0 ! 2 0 ! valid marginal probabilities ! 1 ! 0 ! Edge assignment for " 0 ! 0 ! X 1 ! = 1 ! X 2 X 3 ! 0 ! 1 ! 1 ! X 1 ! 0 " = 0 ! 0 " X 3 ! X 2 ! = 0 ! = 1 ! X 3 ! X 2 ! = 0 ! = 1 ! David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 16 / 20

  17. Convex relaxation � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M c ∈ C x c We still haven’t achieved anything, because: The marginal polytope M is complex to describe (in general, 1 exponentially many vertices and facets) H ( µ ) is very difficult to compute or optimize over 2 We now make two approximations: We replace M with a relaxation of the marginal polytope, e.g. the local 1 consistency constraints M L We replace H ( µ ) with a concave function ˜ H ( µ ) which upper bounds 2 H ( µ ), i.e. H ( µ ) ≤ ˜ H ( µ ) As a result, we obtain the following upper bound on the log-partition function, which is concave and easy to optimize: � � θ c ( x c ) µ c ( x c ) + ˜ ln Z ( θ ) ≤ max H ( µ ) µ ∈ M L x c c ∈ C David Sontag (NYU) Graphical Models Lecture 8, March 22, 2012 17 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend