Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 1 / 19

Variational methods Suppose that we have an arbitrary graphical model: 1 � � � � p ( x ; θ ) = φ c ( x c ) = exp θ c ( x c ) − ln Z ( θ ) Z ( θ ) c ∈ C c ∈ C All of the approaches begin as follows: q ( x ) ln q ( x ) � D ( q � p ) = p ( x ) x 1 � � = − q ( x ) ln p ( x ) − q ( x ) ln q ( x ) x x � � � � = − q ( x ) θ c ( x c ) − ln Z ( θ ) − H ( q ( x )) x c ∈ C � � � = − q ( x ) θ c ( x c ) + q ( x ) ln Z ( θ ) − H ( q ( x )) c ∈ C x x � = − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) . c ∈ C David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 2 / 19

The log-partition function Since D ( q � p ) ≥ 0, we have � − E q [ θ c ( x c )] + ln Z ( θ ) − H ( q ( x )) ≥ 0 , c ∈ C which implies that � ln Z ( θ ) ≥ E q [ θ c ( x c )] + H ( q ( x )) . c ∈ C Thus, any approximating distribution q ( x ) gives a lower bound on the log-partition function (for a BN, this is the log probability of the observed variables) Recall that D ( q � p ) = 0 if and only if p = q .Thus, if we allow ourselves to optimize over all distributions, we have: � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) . q c ∈ C David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 3 / 19

Re-writing objective in terms of moments � ln Z ( θ ) = max E q [ θ c ( x c )] + H ( q ( x )) q c ∈ C � � = max q ( x ) θ c ( x c ) + H ( q ( x )) q c ∈ C x � � = max q ( x c ) θ c ( x c ) + H ( q ( x )) . q c ∈ C x c Assume that p ( x ) is in the exponential family, and let f ( x ) be its sufficient statistic vector Define µ q = E q [ f ( x )] to be the marginals of q ( x ) We can re-write the objective as � � ln Z ( θ ) = max max θ c ( x c ) µ c ( x c ) + H ( q ( x )) , µ ∈ M q : E q [ f ( x )]= µ c ∈ C x c where M , the marginal polytope , consists of all valid marginal vectors David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 4 / 19

Re-writing objective in terms of moments Next, push the max over q instead to obtain: � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) , where µ c ∈ C x c H ( µ ) = q : E q [ f ( x )]= µ H ( q ) . max For discrete random variables, the marginal polytope M is given by � µ ∈ R d | µ = � � � M = p ( x ) f ( x ) for some p ( x ) ≥ 0 , p ( x ) = 1 x ∈X m x ∈X m � f ( x ) , x ∈ X m � = ( conv denotes the convex hull operation ) conv For a discrete-variable MRF, the sufficient statistic vector f ( x ) is simply the concatenation of indicator functions for each clique of variables that appear together in a potential function For example, if we have a pairwise MRF on binary variables with m = | V | variables and | E | edges, d = 2 m + 4 | E | David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 5 / 19

Marginal polytope for discrete MRFs 1 ! Assignment for X 1 " 0 ! Marginal polytope ! 0 ! 1 ! 0 ! 0 ! Assignment for X 2 " (Wainwright & Jordan, ’03) ! 1 ! 1 ! 1 ! Assignment for X 3 ! 1 ! µ � = � 0 ! 0 ! 1 " 0 " Edge assignment for " 0 " 0 " X 1 X 3 ! µ = � 0 " 1 " 0 " 0 " 0 ! 0 ! Edge assignment for " 1 � � 1 ! µ � + � 0 ! � X 1 X 2 ! µ 0 ! 0 ! 2 0 ! valid marginal probabilities ! 1 ! 0 ! Edge assignment for " 0 ! 0 ! X 1 ! = 1 ! X 2 X 3 ! 0 ! 1 ! 1 ! X 1 ! 0 " = 0 ! 0 " X 3 ! X 2 ! = 0 ! = 1 ! X 3 ! X 2 ! = 0 ! = 1 ! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 6 / 19

Relaxation � � ln Z ( θ ) = max θ c ( x c ) µ c ( x c ) + H ( µ ) µ ∈ M c ∈ C x c We still haven’t achieved anything, because: The marginal polytope M is complex to describe (in general, 1 exponentially many vertices and facets) H ( µ ) is very difficult to compute or optimize over 2 We now make two approximations: We replace M with a relaxation of the marginal polytope, e.g. the local 1 consistency constraints M L We replace H ( µ ) with a function ˜ H ( µ ) which approximates H ( µ ) 2 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 7 / 19

Local consistency constraints Force every “cluster” of variables to choose a local assignment: µ i ( x i ) ≥ 0 ∀ i ∈ V , x i � µ i ( x i ) = 1 ∀ i ∈ V x i µ ij ( x i , x j ) ≥ 0 ∀ ij ∈ E , x i , x j � µ ij ( x i , x j ) = 1 ∀ ij ∈ E x i , x j Enforce that these local assignments are globally consistent: � µ i ( x i ) = µ ij ( x i , x j ) ∀ ij ∈ E , x i x j � µ j ( x j ) = µ ij ( x i , x j ) ∀ ij ∈ E , x j x i The local consistency polytope , M L is defined by these constraints Look familiar? Same local consistency constraints as used in Lecture 6 for the linear programming relaxation of MAP inference! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 8 / 19

Local consistency constraints are exact for trees The marginal polytope depends on the specific sufficient statistic vector f ( x ) Theorem: The local consistency constraints exactly define the marginal polytope for a tree-structured MRF Proof: Consider any pseudo-marginal vector � µ ∈ M L . We will specify a distribution p T ( x ) for which µ i ( x i ) and µ ij ( x i , x j ) are the pairwise and singleton marginals of the distribution p T Let X 1 be the root of the tree, and direct edges away from root. Then, µ i , pa ( i ) ( x i , x pa ( i ) ) � p T ( x ) = µ 1 ( x 1 ) . µ pa ( i ) ( x pa ( i ) ) i ∈ V \ X 1 Because of the local consistency constraints, each term in the product can be interpreted as a conditional probability. David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 9 / 19

Example for non-tree models For non-trees, the local consistency constraints are an outer bound on the marginal polytope Example of � µ ∈ M L \ M for a MRF on binary variables: X j" ="1" X j" ="0" X 1 ! X i" ="0" 0" .5" µ ij ( x i , x j ) = X 2 ! X i" ="1" X 3 ! .5" 0" To see that this is not in M , note that it violates the following triangle inequality (valid for marginals of MRFs on binary variables ): � � � µ 1 , 2 ( x 1 , x 2 ) + µ 2 , 3 ( x 2 , x 3 ) + µ 1 , 3 ( x 1 , x 3 ) ≤ 2 . x 1 � = x 2 x 2 � = x 3 x 1 � = x 3 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 10 / 19

Maximum entropy (MaxEnt) Recall that H ( µ ) = max q : E q [ f ( x )]= µ H ( q ) is the entropy of the maximum entropy distribution with marginals µ This yields the optimization problem: � max H ( q ( x )) = − q ( x ) log q ( x ) q x � s.t. q ( x ) f i ( x ) = α i x � q ( x ) = 1 (strictly concave w.r.t. q ( x ) ) x E.g., when doing inference in a pairwise MRF, the α i will correspond to µ l ( x l ) and µ lk ( x l , x k ) for all ( l , k ) ∈ E , x l , x k David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 11 / 19

What does the MaxEnt solution look like? To solve the MaxEnt problem, we form the Lagrangian: �� L = − q ( x ) log q ( x ) − λ i q ( x ) f i ( x ) − α i − λ sum q ( x ) − 1 x i x x Then, taking the derivative of the Lagrangian, ∂ L � ∂ q ( x ) = − 1 − log q ( x ) − λ i f i ( x ) − λ sum i And setting to zero, we obtain: � � q ∗ ( x ) = exp � = e − 1 − λ sum e − � i λ i f i ( x ) − 1 − λ sum − λ i f i ( x ) i i λ i f i ( x ) = Z ( λ ) x q ( x ) = 1 we obtain e 1+ λ sum = � x e − � From constraint � We conclude that the maximum entropy distribution has the form (substituting � θ for − � λ ) 1 q ∗ ( x ) = Z ( θ ) exp( θ · f ( x )) David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 12 / 19

Entropy for tree-structured models Suppose that p is a tree-structured distribution, so that we are optimizing only over marginals µ ij ( x i , x j ) for ij ∈ T We conclude from the previous slide that the arg max q : E q [ f ( x )]= µ H ( q ) is a tree-structured MRF The entropy of q as a function of its marginals can be shown to be � � H ( � µ ) = H ( µ i ) − I ( µ ij ) i ∈ V ij ∈ T where � H ( µ i ) = − µ i ( x i ) log µ i ( x i ) x i µ ij ( x i , x j ) � I ( µ ij ) = µ ij ( x i , x j ) log µ i ( x i ) µ j ( x j ) x i , x j Can we use this for non-tree structured models? David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 13 / 19

Bethe-free energy approximation The Bethe entropy approximation is (for any graph) � � H bethe ( � µ ) = H ( µ i ) − I ( µ ij ) i ∈ V ij ∈ E This gives the following variational approximation: � � max θ c ( x c ) µ c ( x c ) + H bethe ( � µ ) µ ∈ M L x c c ∈ C For non tree-structured models this is not concave, and is hard to maximize Loopy belief propagation, if it converges, finds a saddle point! David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 14 / 19

Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David Sontag (NYU) Inference and Representation Lecture 9, Nov. 11, 2014 1 / 19 Variational methods Suppose that we have an arbitrary graphical model: 1

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David

K K Knowledge Knowledge l d l d Representation Representation Representation

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

Inference and Representation David Sontag New York University Lecture 12, Dec. 2, 2014

Inference and Representation David Sontag New York University Lecture 10, Nov. 17, 2015

CS325 Artificial Intelligence Chs. 9, 12 Knowledge Representation and Inference Cengiz

Inference and Representation David Sontag New York University Lecture 1, September 8, 2015

Inference and Representation David Sontag New York University Lecture 4, Sept. 29, 2015 David

Inference and Representation David Sontag New York University Lecture 1, September 2, 2014

Inference and Representation David Sontag New York University Lecture 2, September 9, 2014

Inference and Representation David Sontag New York University Lecture 5, Sept. 30, 2014 David

BNL - FNAL- LBNL - SLAC Status of the LARP Phase II Secondary Collimator Prototype 14 October

Graphical Inequalities for the Linear Ordering Polytope Jean-Paul Doignon Universit Libre de

Sign Variation and Descents Aram Dermenjian Joint with: Nantel Bergeron and John Machacek York

An Effective Model of Facets Formation Dima Ioffe 1 Technion April 2015 1 Based on joint works

Binomial edge ideals and determinantal facet ideals Sara Saeedi Madani (joint with J. Herzog and

The facets of the cut polytope and the extreme rays of cone of concentration matrices of

Balinskis theorem and Regularity of Line Arrangements Bruno Benedetti (University of Miami)

Hearing #10 on Competition and Consumer Protection in the 21st Century Federal Trade Commission