inference and representation
play

Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 1 / 19 Variational methods Suppose that we have an arbitrary graphical model: 1


  1. Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 1 / 19

  2. Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C All of the approaches begin as follows: q ( x ) ln q ( x ) � D ( q � p ) = p ( x ) x 1 � � = − q ( x ) ln p ( x ) − q ( x ) ln q ( x ) x x � � � � = − q ( x ) θ c ( x c ) − ln Z ( θ ) − H ( q ( x )) x c ∈ C � � � = − q ( x ) θ c ( x c ) + q ( x ) ln Z ( θ ) − H ( q ( x )) c ∈ C x x � = − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) . c ∈ C David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 2 / 19

  3. The log-partition function Since D ( q � p ) ≥ 0, we have � − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) ≥ 0 , c ∈ C which implies that � ln Z ( θ ) ≥ E q [ θ c ( x c )] + H ( q ( x )) . c ∈ C Thus, any approximating distribution q ( x ) gives a lower bound on the log-partition function (for a BN, this is the log probability of the observed variables) Recall that D ( q � p ) = 0 if and only if p = q .Thus, if we allow ourselves to optimize over all distributions, we have: � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 3 / 19

  4. Re-writing objective in terms of moments � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) q c ∈ C � � = max q ( x ) θ c ( x c ) + H ( q ( x )) q c ∈ C x � � = max q ( x c ) θ c ( x c ) + H ( q ( x )) . q c ∈ C x c Assume that p ( x ) is in the exponential family, and let f ( x ) be its sufficient statistic vector Define µ q = E q [ f ( x )] to be the marginals of q ( x ) We can re-write the objective as � � ln Z ( θ ) = max max θ c ( x c ) µ c ( x c ) + H ( q ( x )) , µ ∈ M q : E q [ f ( x )]= µ c ∈ C x c where M , the marginal polytope , consists of all valid marginal vectors David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 4 / 19

  5. Re-writing objective in terms of moments Next, push the max over q instead to obtain: � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) , where µ c ∈ C x c H ( µ ) = q : E q [ f ( x )]= µ H ( q ) . max For discrete random variables, the marginal polytope M is given by � µ ∈ R d | µ = � � � M = p ( x ) f ( x ) for some p ( x ) ≥ 0 , p ( x ) = 1 x ∈X m x ∈X m � f ( x ) , x ∈ X m � = ( conv denotes the convex hull operation ) conv For a discrete-variable MRF, the sufficient statistic vector f ( x ) is simply the concatenation of indicator functions for each clique of variables that appear together in a potential function For example, if we have a pairwise MRF on binary variables with m = | V | variables and | E | edges, d = 2 m + 4 | E | David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 5 / 19

  6. Marginal polytope for discrete MRFs 1 ! Assignment for X 1 " 0 ! Marginal polytope ! 0 ! 1 ! 0 ! 0 ! Assignment for X 2 " (Wainwright & Jordan, ’03) ! 1 ! 1 ! 1 ! Assignment for X 3 ! 1 ! µ � = � 0 ! 0 ! 1 " 0 " Edge assignment for " 0 " 0 " X 1 X 3 ! µ = � 0 " 1 " 0 " 0 " 0 ! 0 ! Edge assignment for " 1 � � 1 ! µ � + � 0 ! � X 1 X 2 ! µ 0 ! 0 ! 2 0 ! valid marginal probabilities ! 1 ! 0 ! Edge assignment for " 0 ! 0 ! X 1 ! = 1 ! X 2 X 3 ! 0 ! 1 ! 1 ! X 1 ! 0 " = 0 ! 0 " X 3 ! X 2 ! = 0 ! = 1 ! X 3 ! X 2 ! = 0 ! = 1 ! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 6 / 19

  7. Relaxation � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M c ∈ C x c We still haven’t achieved anything, because: The marginal polytope M is complex to describe (in general, 1 exponentially many vertices and facets) H ( µ ) is very difficult to compute or optimize over 2 We now make two approximations: We replace M with a relaxation of the marginal polytope, e.g. the local 1 consistency constraints M L We replace H ( µ ) with a function ˜ H ( µ ) which approximates H ( µ ) 2 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 7 / 19

  8. Local consistency constraints Force every “cluster” of variables to choose a local assignment: µ i ( x i ) ≥ 0 ∀ i ∈ V , x i � µ i ( x i ) = 1 ∀ i ∈ V x i µ ij ( x i , x j ) ≥ 0 ∀ ij ∈ E , x i , x j � µ ij ( x i , x j ) = 1 ∀ ij ∈ E x i , x j Enforce that these local assignments are globally consistent: � µ i ( x i ) = µ ij ( x i , x j ) ∀ ij ∈ E , x i x j � µ j ( x j ) = µ ij ( x i , x j ) ∀ ij ∈ E , x j x i The local consistency polytope , M L is defined by these constraints Look familiar? Same local consistency constraints as used in Lecture 6 for the linear programming relaxation of MAP inference! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 8 / 19

  9. Local consistency constraints are exact for trees The marginal polytope depends on the specific sufficient statistic vector f ( x ) Theorem: The local consistency constraints exactly define the marginal polytope for a tree-structured MRF Proof: Consider any pseudo-marginal vector � µ ∈ M L . We will specify a distribution p T ( x ) for which µ i ( x i ) and µ ij ( x i , x j ) are the pairwise and singleton marginals of the distribution p T Let X 1 be the root of the tree, and direct edges away from root. Then, µ i , pa ( i ) ( x i , x pa ( i ) ) � p T ( x ) = µ 1 ( x 1 ) . µ pa ( i ) ( x pa ( i ) ) i ∈ V \ X 1 Because of the local consistency constraints, each term in the product can be interpreted as a conditional probability. David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 9 / 19

  10. Example for non-tree models For non-trees, the local consistency constraints are an outer bound on the marginal polytope Example of � µ ∈ M L \ M for a MRF on binary variables: X j" ="1" X j" ="0" X 1 ! X i" ="0" 0" .5" µ ij ( x i , x j ) = X 2 ! X i" ="1" X 3 ! .5" 0" To see that this is not in M , note that it violates the following triangle inequality (valid for marginals of MRFs on binary variables ): � � � µ 1 , 2 ( x 1 , x 2 ) + µ 2 , 3 ( x 2 , x 3 ) + µ 1 , 3 ( x 1 , x 3 ) ≤ 2 . x 1 � = x 2 x 2 � = x 3 x 1 � = x 3 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 10 / 19

  11. Maximum entropy (MaxEnt) Recall that H ( µ ) = max q : E q [ f ( x )]= µ H ( q ) is the entropy of the maximum entropy distribution with marginals µ This yields the optimization problem: � max H ( q ( x )) = − q ( x ) log q ( x ) q x � s.t. q ( x ) f i ( x ) = α i x � q ( x ) = 1 (strictly concave w.r.t. q ( x ) ) x E.g., when doing inference in a pairwise MRF, the α i will correspond to µ l ( x l ) and µ lk ( x l , x k ) for all ( l , k ) ∈ E , x l , x k David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 11 / 19

  12. What does the MaxEnt solution look like? To solve the MaxEnt problem, we form the Lagrangian: �� � �� � � � L = − q ( x ) log q ( x ) − λ i q ( x ) f i ( x ) − α i − λ sum q ( x ) − 1 x i x x Then, taking the derivative of the Lagrangian, ∂ L � ∂ q ( x ) = − 1 − log q ( x ) − λ i f i ( x ) − λ sum i And setting to zero, we obtain: � � q ∗ ( x ) = exp � = e − 1 − λ sum e − � i λ i f i ( x ) − 1 − λ sum − λ i f i ( x ) i i λ i f i ( x ) = Z ( λ ) x q ( x ) = 1 we obtain e 1+ λ sum = � x e − � From constraint � We conclude that the maximum entropy distribution has the form (substituting � θ for − � λ ) 1 q ∗ ( x ) = Z ( θ ) exp( θ · f ( x )) David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 12 / 19

  13. Entropy for tree-structured models Suppose that p is a tree-structured distribution, so that we are optimizing only over marginals µ ij ( x i , x j ) for ij ∈ T We conclude from the previous slide that the arg max q : E q [ f ( x )]= µ H ( q ) is a tree-structured MRF The entropy of q as a function of its marginals can be shown to be � � H ( � µ ) = H ( µ i ) − I ( µ ij ) i ∈ V ij ∈ T where � H ( µ i ) = − µ i ( x i ) log µ i ( x i ) x i µ ij ( x i , x j ) � I ( µ ij ) = µ ij ( x i , x j ) log µ i ( x i ) µ j ( x j ) x i , x j Can we use this for non-tree structured models? David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 13 / 19

  14. Bethe-free energy approximation The Bethe entropy approximation is (for any graph) � � H bethe ( � µ ) = H ( µ i ) − I ( µ ij ) i ∈ V ij ∈ E This gives the following variational approximation: � � max θ c ( x c ) µ c ( x c ) + H bethe ( � µ ) µ ∈ M L x c c ∈ C For non tree-structured models this is not concave, and is hard to maximize Loopy belief propagation, if it converges, finds a saddle point! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 14 / 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend