Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 28, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 1 / 14

From last lecture: Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C Finding the approximating distribution q ( x ) ∈ Q that minimizes the x q ( x ) ln q ( x ) I-projection to p ( x ), i.e. D ( q � p ) = � p ( x ) , is equivalent to � max E q [ θ c ( x c )] + H ( q ( x )) q ∈ Q c ∈ C where E q [ θ c ( x c )] = � x c q ( x c ) θ c ( x c ) and H ( q ( x )) is the entropy of q ( x ) If p ∈ Q , the value of the objective at optimality is equal to ln Z ( θ ) How should we approximate this? We need a compact way of representing q ( x ) and finding the maxima David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 2 / 14

From last lecture: Relaxation approaches We showed two approximation methods, both making use of the local consistency constraints M L on the marginal polytope: Bethe-free energy approximation (for pairwise MRFs): 1 � � � � max µ ij ( x i , x j ) θ ij ( x i , x j ) + H ( µ i ) − I ( µ ij ) µ ∈ M L ij ∈ E x i , x j i ∈ V ij ∈ E Not concave. Can use concave-convex procedure to find local optima Loopy BP, if it converges, finds a saddle point (often a local maxima) Tree re-weighted approximation (for pairwise MRFs): 2 � � � � ( ∗ ) max µ ij ( x i , x j ) θ ij ( x i , x j ) + H ( µ i ) − ρ ij I ( µ ij ) µ ∈ M L ij ∈ E x i , x j i ∈ V ij ∈ E { ρ ij } are edge appearance probabilities (must be consistent with some set of spanning trees) This is concave! Find global maximiza using projected gradient ascent Provides an upper bound on log-partition function, i.e. ln Z ( θ ) ≤ ( ∗ ) David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 3 / 14

Two types of variational algorithms: Mean-field and relaxation � � max q ( x c ) θ c ( x c ) + H ( q ( x )) . q ∈ Q c ∈ C x c Although this function is concave and thus in theory should be easy to optimize, we need some compact way of representing q ( x ) Relaxation algorithms work directly with pseudomarginals which may not be consistent with any joint distribution � Mean-field algorithms assume a factored representation of the joint distribution, e.g. � q ( x ) = q i ( x i ) (called naive mean field) i ∈ V David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 4 / 14

Naive mean-field Suppose that Q consists of all fully factored distributions, of the form q ( x ) = � i ∈ V q i ( x i ) We can use this to simplify � � max q ( x c ) θ c ( x c ) + H ( q ) q ∈ Q x c c ∈ C First, note that q ( x c ) = � i ∈ c q i ( x i ) Next, notice that the joint entropy decomposes as a sum of local entropies: � H ( q ) = − q ( x ) ln q ( x ) x � � � � = − q ( x ) ln q i ( x i ) = − q ( x ) ln q i ( x i ) x x i ∈ V i ∈ V � � = − q ( x ) ln q i ( x i ) x i ∈ V � � � � = − q i ( x i ) ln q i ( x i ) q ( x V \ i | x i ) = H ( q i ) . x V \ i i ∈ V x i i ∈ V David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 5 / 14

Naive mean-field Putting these together, we obtain the following variational objective: � � � � ( ∗ ) max θ c ( x c ) q i ( x i ) + H ( q i ) q x c c ∈ C i ∈ c i ∈ V subject to the constraints q i ( x i ) ≥ 0 ∀ i ∈ V , x i ∈ Val ( X i ) � q i ( x i ) = 1 ∀ i ∈ V x i ∈ Val ( X i ) Corresponds to optimizing over an inner bound on the marginal polytope, given by µ ij ( x i , x j ) = µ i ( x i ) µ j ( x j ) and the above constraints: We obtain a lower bound on the partition function, i.e. ( ∗ ) ≤ ln Z ( θ ) David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 6 / 14

Naive mean-field for pairwise MRFs How do we maximize the variational objective? � � � � ( ∗ ) max θ ij ( x i , x j ) q i ( x i ) q j ( x j ) − q i ( x i ) ln q i ( x i ) q ij ∈ E x i , x j i ∈ V x i This is a non-convex optimization problem, with many local maxima! Nonetheless, we can greedily maximize it using block coordinate descent : Iterate over each of the variables i ∈ V . For variable i , 1 Fully maximize (*) with respect to { q i ( x i ) , ∀ x i ∈ Val ( X i ) } . 2 Repeat until convergence. 3 Constructing the Lagrangian, taking the derivative, setting to zero, and solving yields the update: ( shown on blackboard ) q ( x i ) = 1 � � � exp θ i ( x i ) + q j ( x j ) θ ij ( x i , x j ) Z i j ∈ N ( i ) David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 7 / 14

How accurate will the approximation be? Consider a distribution which is an XOR of two binary variables A and B : p ( a , b ) = 0 . 5 − ǫ if a � = b and p ( a , b ) = ǫ if a = b The contour plot of the variational objective is: 1 0.8 0.6 Q ( b 1 ) 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Q ( a 1 ) Even for a single edge, mean field can give very wrong answers! Interestingly, once ǫ > 0 . 1, mean field has a single maximum point at the uniform distribution (thus, exact) David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 8 / 14

Structured mean-field approximations Rather than assuming a fully-factored distribution for q , we can use a structured approximation, such as a spanning tree For example, for a factorial HMM, a good approximation may be a product of chain-structured models: David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 9 / 14

Obtaining true bounds on the marginals Suppose we can obtain upper and lower bounds on the partition function These can be used to obtain upper and lower bounds on marginals Let Z ( θ x i ) denote the partition function of the distribution on X V \ i where X i = x i Suppose that L x i ≤ Z ( θ x i ) ≤ U x i Then, � x V \ i exp( θ ( x V \ i , x i )) p ( x i ; θ ) = � � x V \ i exp( θ ( x V \ i , ˆ x i )) ˆ x i Z ( θ x i ) = � x i Z ( θ ˆ x i ) ˆ U x i ≤ . � x i L ˆ x i ˆ L xi Similarly, p ( x i ; θ ) ≥ xi . � xi U ˆ ˆ David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 10 / 14

Software packages 1 libDAI http://www.libdai.org Mean-field, loopy sum-product BP, tree-reweighted BP, double-loop GBP 2 Infer.NET http://research.microsoft.com/en-us/um/cambridge/ projects/infernet/ Mean-field, loopy sum-product BP Also handles continuous variables David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 11 / 14

Approximate marginal inference Nearly all approximate marginal inference algorithms are either: Variational algorithms (e.g., mean-field, TRW, loopy BP) 1 Monte-carlo methods (e.g., likelihood reweighting, MCMC) 2 Unconditional sampling: how can one estimate marginals in a BN if there is no evidence? Topologically sort the variables, forward sample (using topological sort), and compute empirical marginals Since these are indepedent samples, can use a Chernoff bound to quantify accuracy. Small additive error with just a few samples! Doesn’t contradict hardness results because unconditional Conditional sampling: what about computing p ( X | e ) = p ( X , e ) / p ( e )? Could try using forward sampling for both numerator and denominator, but in expectation would need at least 1 / p ( e ) samples before ˆ p ( e ) � = 0 Thus, forward sampling won’t work for conditional inference. We need new techniques. David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 12 / 14

Recall from Lecture 4: Reducing satisfiability to marginal inference Input: 3-SAT formula with n literals Q 1 , . . . Q n and m clauses C 1 , . . . , C m Q 1 Q 2 Q 3 Q 4 Q n C 1 C 2 C 3 C m– 1 C m . . . A 1 A 2 A m– 2 X . . . p ( X = 1) = � q , c , a p ( Q = q , C = c , A = a , X = 1) is equal to the number 1 of satisfying assignments times 2 n Thus, p ( X = 1) > 0 if and only if the formula has a satisfying assignment This shows that exact marginal inference is NP-hard David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 13 / 14

Recall from Lecture 4: Reducing satisfiability to approximate marginal inference Might there exist polynomial-time algorithms that can approximately answer marginal queries, i.e. for some ǫ , find ρ such that ρ − ǫ ≤ p ( Y | E = e ) ≤ ρ + ǫ ? Suppose such an algorithm exists, for any ǫ ∈ (0 , 1 2 ). Consider the following: Start with E = { X = 1 } 1 For i = 1 , . . . , n : 2 Let q i = arg max q p ( Q i = q | E ) 3 E ← E ∪ ( Q i = q i ) 4 At termination, E is a satisfying assignment (if one exists). Pf by induction: In iteration i , if ∃ satisfying assignment extending E for both q i = 0 and q i = 1, then choice in line 3 does not matter Otherwise, suppose ∃ satisfying assignment extending E for q i = 1 but not for q i = 0. Then, p ( Q i = 1 | E ) = 1 and p ( Q i = 0 | E ) = 0 Even if approximate inference returned p ( Q i = 1 | E ) = 0 . 501 and p ( Q i = 0 | E ) = . 499, we would still choose q i = 1 Thus, it is even NP-hard to approximately perform marginal inference! David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 14 / 14

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 8, March 28, 2012 David Sontag (NYU) Graphical Models Lecture 8, March 28, 2012 1 / 14 From last lecture: Variational methods Suppose that we have an arbitrary graphical

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Math 211 Math 211 Lecture #27 December 5, 2000 2 Review of Methods Review of Methods

Applied Algorithm Design: Exam June, 23rd, 2008 Prof. Pietro Michiardi Exam Rules Exam

Metastability in Stochastic Dynamics: Random-Field Curie-Weiss-Potts Model Prag summer school 1

Last time: critical points Let f : R 2 R . A point ( a , b ) R 2 is a critical point of f

Advanced Machine Learning Gradient Descent for Non-Convex Functions Amit Sethi Electrical

No Smooth Julia Sets for Complex H enon Maps Eric Bedford Stony Brook U. joint with John

Finite element methods for Maxwells equations: A local a priori estimate Claudio Rojik Vienna

Staying Safe at Work Teaching PRIDE Workers with Disabilities about Health and Safety on the Job