Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 12, April 19, 2012 Acknowledgement : Partially based on slides by Eric Xing at CMU and Andrew McCallum at UMass Amherst David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 1 / 21

Today: learning undirected graphical models 1 Learning MRFs a. Feature-based (log-linear) representation of MRFs b. Maximum likelihood estimation c. Maximum entropy view 2 Getting around complexity of inference a. Using approximate inference (e.g., TRW) within learning b. Pseudo-likelihood 3 Conditional random fields David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 2 / 21

Recall: ML estimation in Bayesian networks Maximum likelihood estimation: max θ ℓ ( θ ; D ), where � ℓ ( θ ; D ) = log p ( D ; θ ) = log p ( x ; θ ) x ∈D � � � = log p ( x i | ˆ x pa ( i ) ) i ˆ x ∈D : x pa ( i ) x pa ( i ) =ˆ x pa ( i ) In Bayesian networks, we have the closed form ML solution: N x i , x pa ( i ) θ ML x i | x pa ( i ) = � x i N ˆ x i , x pa ( i ) ˆ where N x i , x pa ( i ) is the number of times that the (partial) assignment x i , x pa ( i ) is observed in the training data We were able to estimate each CPD independently because the objective decomposes by variable and parent assignment David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 3 / 21

Bad news for Markov networks The global normalization constant Z ( θ ) kills decomposability: � θ ML = arg max log p ( x ; θ ) θ x ∈D �� = arg max log φ c ( x c ; θ ) − log Z ( θ ) θ x ∈D c �� = arg max log φ c ( x c ; θ ) − |D| log Z ( θ ) θ x ∈D c The log-partition function prevents us from decomposing the objective into a sum over terms for each potential Solving for the parameters becomes much more complicated David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 4 / 21

What are the parameters? How do we parameterize φ c ( x c ; θ )? Use a log-linear parameterization: Introduce weights w ∈ R d that are used globally For each potential c , a vector-valued feature function f c ( x c ) ∈ R d Then, φ c ( x c ; w ) = exp( w · f c ( x c )) Example: discrete-valued MRF with only edge potentials, where each variable takes k states Let d = k 2 | E | , and let w i , j , x i , x j = log φ ij ( x i , x j ) Let f i , j ( x i , x j ) have a 1 in the dimension corresponding to ( i , j , x i , x j ) and 0 elsewhere The joint distribution is in the exponential family ! p ( x ; w ) = exp { w · f ( x ) − log Z ( w ) } , where f ( x ) = � c f c ( x c ) and Z ( w ) = � x exp { � c w · f c ( x c ) } This formulation allows for parameter sharing David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 5 / 21

Log-likelihood for log-linear models �� θ ML = arg max log φ c ( x c ; θ ) − |D| log Z ( θ ) θ c x ∈D �� = arg max w · f c ( x c ) − |D| log Z ( w ) w c x ∈D �� = arg max w · f c ( x c ) − |D| log Z ( w ) w c x ∈D The first term is linear in w The second term is also a function of w : � � � � log Z ( w ) = log exp w · f c ( x c ) x c David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 6 / 21

Log-likelihood for log-linear models � � � � log Z ( w ) = log exp w · f c ( x c ) x c log Z ( w ) does not decompose No closed form solution; even computing likelihood requires inference Recall Problem 4 (“Exponential families”) from Problem Set 2. Letting f ( x ) = � c f c ( x c ), you showed that � ∇ w log Z ( w ) = E p ( x ; w ) [ f ( x )] = E p ( x c ; w ) [ f c ( x c )] c Thus, the gradient of the log-partition function can be computed by inference , computing marginals with respect to the current parameters w We also claimed that the 2nd derivative of the log-partition function gives the second-order moments, i.e. ∇ 2 log Z ( w ) = cov[ f ( x )] Since covariance matrices are always positive semi-definite, this proves that log Z ( w ) is convex (so − log Z ( w ) is concave) David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 7 / 21

Solving the maximum likelihood problem in MRFs �� 1 � ℓ ( w ; D ) = |D| w · f c ( x c ) − log Z ( w ) x ∈D c First, note that the weights w are unconstrained, i.e. w ∈ R d The objective function is jointly concave. Apply any convex optimization method to learn! Can use gradient ascent, stochastic gradient ascent , quasi-Newton methods such as limited memory BFGS (L-BFGS) The gradient of the log-likelihood is: 1 d � � � ℓ ( w ; D ) = ( f c ( x c )) k − E p ( x c ; w ) [( f c ( x c )) k ] dw k |D| x ∈D c c 1 � � � = ( f c ( x c )) k − E p ( x c ; w ) [( f c ( x c )) k ] |D| c x ∈D c David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 8 / 21

The gradient of the log-likelihood 1 ∂ � � � ℓ ( w ; D ) = ( f c ( x c )) k − E p ( x c ; w ) [( f c ( x c )) k ] ∂ w k |D| c x ∈D c Difference of expectations! Consider the earlier pairwise MRF example. This then reduces to: � � ∂ 1 � ℓ ( w ; D ) = 1[ x i = ˆ x i , x j = ˆ x j ] − p (ˆ x i , ˆ x j ; w ) |D| ∂ w i , j , ˆ x i , ˆ x j x ∈D Setting derivative to zero, we see that for the maximum likelihood parameters w ML , we have 1 � x j ; w ML ) = p (ˆ x i , ˆ 1[ x i = ˆ x i , x j = ˆ x j ] |D| x ∈D for all edges ij ∈ E and states ˆ x i , ˆ x j Model marginals for each clique equal the empirical marginals! Called moment matching , and is a property of maximum likelihood learning in exponential families David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 9 / 21

Gradient ascent requires repeated marginal inference, which in many models is hard ! We will return to this shortly. David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 10 / 21

Maximum entropy (MaxEnt) We can approach the modeling task from an entirely different point of view Suppose we know some expectations with respect to a (fully general) distribution p ( x ): 1 � � (true) p ( x ) f i ( x ) , (empirical) f i ( x ) = α i |D| x x ∈D Assuming that the expectations are consistent with one another, there may exist many distributions which satisfy them. Which one should we select? The most uncertain or flexible one, i.e., the one with maximum entropy. This yields a new optimization problem: � max H ( p ( x )) = − p ( x ) log p ( x ) p x � s.t. p ( x ) f i ( x ) = α i x � p ( x ) = 1 (strictly concave w.r.t. p ( x ) ) x David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 11 / 21

What does the MaxEnt solution look like? To solve the MaxEnt problem, we form the Lagrangian: �� L = − p ( x ) log p ( x ) − λ i p ( x ) f i ( x ) − α i − µ p ( x ) − 1 x i x x Then, taking the derivative of the Lagrangian, ∂ L � ∂ p ( x ) = − 1 − log p ( x ) − λ i f i ( x ) − µ i And setting to zero, we obtain: � � p ∗ ( x ) = exp � = e − 1 − µ e − � i λ i f i ( x ) − 1 − µ − λ i f i ( x ) i x p ( x ) = 1 we obtain e 1+ µ = � i λ i f i ( x ) = Z ( λ ) x e − � From the constraint � We conclude that the maximum entropy distribution has the form (substituting w i = − λ i ) 1 p ∗ ( x ) = � Z ( w ) exp( w i f i ( x )) i David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 12 / 21

Equivalence of maximum likelihood and maximum entropy Feature constraints + MaxEnt ⇒ exponential family! We have seen a case of convex duality: In one case, we assume exponential family and show that ML implies model expectations must match empirical expectations In the other case, we assume model expectations must match empirical feature counts and show that MaxEnt implies exponential family distribution Can show that one is the dual of the other, and thus both obtain the same value of the objective at optimality (no duality gap) Besides providing insight into the ML solution, this also gives an alternative way to (approximately) solve the learning problem David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 13 / 21

How can we get around the complexity of inference during learning? David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 14 / 21

Monte Carlo methods Recall the original learning objective �� 1 � ℓ ( w ; D ) = |D| w · f c ( x c ) − log Z ( w ) x ∈D c Use any of the sampling approaches (e.g., Gibbs sampling) that we discussed in Lecture 9 All we need for learning (i.e., to compute the derivative of ℓ ( w , D )) are marginals of the distribution No need to ever estimate log Z ( w ) David Sontag (NYU) Graphical Models Lecture 12, April 19, 2012 15 / 21

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 12, April 19, 2012 Acknowledgement : Partially based on slides by Eric Xing at CMU and Andrew McCallum at UMass Amherst David Sontag (NYU) Graphical Models Lecture 12,

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Alex Suciu Northeastern University Joint work with Thomas Koberda (U. Virginia) arxiv:1604.02010

LLL-reducing in quasi-linear time Damien Stehl e Joint work with A. Novocin & G. Villard

A proof of the model-independence of (, 1) -category theory joint with Dominic Verity CT2018,

Transport for the 1D Schr odinger equation via quasi-free systems (Collaboration with V.

Semantics for Probabilistic Programming Chris Heunen 1 / 21 Bayes law P ( A | B ) = P ( B | A

ADVANCED ECONOMETRICS I Theory (2/3) Instructor: Joaquim J. S. Ramalho E.mail:

Lesson 4. Iterated filtering: principles and practice Edward Ionides, Aaron A. King, and Kidus

Quasi-Realistic Heterotic String Vacua Left Right Symmetric Model Glyn Harries In collaboration

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 12, April 19, 2012 Acknowledgement : Partially based on slides by Eric Xing at CMU and Andrew McCallum at UMass Amherst David Sontag (NYU) Graphical Models Lecture 12,

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Alex Suciu Northeastern University Joint work with Thomas Koberda (U. Virginia) arxiv:1604.02010

LLL-reducing in quasi-linear time Damien Stehl e Joint work with A. Novocin &amp; G. Villard

A proof of the model-independence of (, 1) -category theory joint with Dominic Verity CT2018,

Transport for the 1D Schr odinger equation via quasi-free systems (Collaboration with V.

Semantics for Probabilistic Programming Chris Heunen 1 / 21 Bayes law P ( A | B ) = P ( B | A

ADVANCED ECONOMETRICS I Theory (2/3) Instructor: Joaquim J. S. Ramalho E.mail:

Lesson 4. Iterated filtering: principles and practice Edward Ionides, Aaron A. King, and Kidus

Quasi-Realistic Heterotic String Vacua Left Right Symmetric Model Glyn Harries In collaboration

LLL-reducing in quasi-linear time Damien Stehl e Joint work with A. Novocin & G. Villard