Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 12, Dec. 2, 2014 Acknowledgements : Partially based on slides by Eric Xing at CMU and Andrew McCallum at UMass Amherst David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 1 / 22

Today: learning undirected graphical models 1 Learning MRFs a. Feature-based (log-linear) representation of MRFs b. Maximum likelihood estimation c. Maximum entropy view 2 Getting around complexity of inference a. Using approximate inference (e.g., TRW) within learning b. Pseudo-likelihood David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 2 / 22

Recall: ML estimation in Bayesian networks Maximum likelihood estimation: max θ ℓ ( θ ; D ), where � ℓ ( θ ; D ) = log p ( D ; θ ) = log p ( x ; θ ) x ∈D � � � = log p ( x i | ˆ x pa ( i ) ) i ˆ x ∈D : x pa ( i ) x pa ( i ) =ˆ x pa ( i ) In Bayesian networks, we have the closed form ML solution: N x i , x pa ( i ) θ ML x i | x pa ( i ) = � x i N ˆ x i , x pa ( i ) ˆ where N x i , x pa ( i ) is the number of times that the (partial) assignment x i , x pa ( i ) is observed in the training data We were able to estimate each CPD independently because the objective decomposes by variable and parent assignment David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 3 / 22

Parameter estimation in Markov networks How do we learn the parameters of an Ising model? = +1 = -1 p ( x 1 , · · · , x n ) = 1 � � � � Z exp w i , j x i x j − u i x i i < j i What about for a skip-chain CRF? David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 4 / 22

Bad news for Markov networks The global normalization constant Z ( θ ) kills decomposability: � θ ML = arg max log p ( x ; θ ) θ x ∈D �� = arg max log φ c ( x c ; θ ) − log Z ( θ ) θ x ∈D c �� = arg max log φ c ( x c ; θ ) − |D| log Z ( θ ) θ x ∈D c The log-partition function prevents us from decomposing the objective into a sum over terms for each potential Solving for the parameters becomes much more complicated David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 5 / 22

What are the parameters? Parameterize φ c ( x c ; θ ) using a log-linear parameterization: Single weight vector w ∈ R d that is used globally For each potential c , a vector-valued feature function f c ( x c ) ∈ R d Then, φ c ( x c ; w ) = exp( w · f c ( x c )) Example: discrete-valued MRF with only edge potentials, where each variable takes k states Let d = k 2 | E | , and let w i , j , x i , x j = log φ ij ( x i , x j ) Let f i , j ( x i , x j ) have a 1 in the dimension corresponding to ( i , j , x i , x j ) and 0 elsewhere The joint distribution is in the exponential family ! p ( x ; w ) = exp { w · f ( x ) − log Z ( w ) } , where f ( x ) = � c f c ( x c ) and Z ( w ) = � x exp { � c w · f c ( x c ) } This formulation allows for parameter sharing David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 6 / 22

Log-likelihood for log-linear models �� θ ML = arg max log φ c ( x c ; θ ) − |D| log Z ( θ ) θ c x ∈D �� = arg max w · f c ( x c ) − |D| log Z ( w ) w c x ∈D �� = arg max w · f c ( x c ) − |D| log Z ( w ) w c x ∈D The first term is linear in w The second term is also a function of w : � � � � log Z ( w ) = log exp w · f c ( x c ) x c David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 7 / 22

Log-likelihood for log-linear models � � � � log Z ( w ) = log exp w · f c ( x c ) x c log Z ( w ) does not decompose No closed form solution; even computing likelihood requires inference Letting f ( x ) = � c f c ( x c ), we showed in Lecture 7 that: � ∇ w log Z ( w ) = E p ( x ; w ) [ f ( x )] = E p ( x c ; w ) [ f c ( x c )] c Thus, the gradient of the log-partition function can be computed by inference , computing marginals with respect to the current parameters w Similarly, you can show that 2nd derivative of the log-partition function gives the second-order moments, i.e. ∇ 2 log Z ( w ) = E p ( x ; w ) [ f i ( x ) f j ( x )] � � ij = cov[ f ( x )] Since covariance matrices are always positive semi-definite, this proves that log Z ( w ) is convex (so − log Z ( w ) is concave) David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 8 / 22

Solving the maximum likelihood problem in MRFs �� 1 � ℓ ( w ; D ) = |D| w · f c ( x c ) − log Z ( w ) x ∈D c First, note that the weights w are unconstrained, i.e. w ∈ R d The objective function is jointly concave. Apply any convex optimization method to learn! Can use gradient ascent, stochastic gradient ascent , quasi-Newton methods such as limited memory BFGS (L-BFGS) Let’s study some properties of the ML solution! 1 d � � � ℓ ( w ; D ) = ( f c ( x c )) k − E p ( x c ; w ) [( f c ( x c )) k ] dw k |D| x ∈D c c 1 � � � = ( f c ( x c )) k − E p ( x c ; w ) [( f c ( x c )) k ] |D| c x ∈D c David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 9 / 22

The gradient of the log-likelihood ∂ 1 � � � ℓ ( w ; D ) = ( f c ( x c )) k − E p ( x c ; w ) [( f c ( x c )) k ] ∂ w k |D| c x ∈D c Difference of expectations! Consider the earlier pairwise MRF example. This then reduces to: � � ∂ 1 � ℓ ( w ; D ) = 1[ x i = ˆ x i , x j = ˆ x j ] − p (ˆ x i , ˆ x j ; w ) ∂ w i , j , ˆ |D| x i , ˆ x j x ∈D Setting derivative to zero, we see that for the maximum likelihood parameters w ML , we have 1 � x j ; w ML ) = p (ˆ x i , ˆ 1[ x i = ˆ x i , x j = ˆ x j ] |D| x ∈D for all edges ij ∈ E and states ˆ x i , ˆ x j Model marginals for ML solution equal the empirical marginals! Called moment matching , and is a property of maximum likelihood learning in exponential families David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 10 / 22

Gradient ascent requires repeated marginal inference, which in many models is hard ! We will return to this shortly. David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 11 / 22

Maximum entropy (MaxEnt) We can approach the modeling task from an entirely different point of view Suppose we know some expectations with respect to a (fully general) distribution p ( x ): 1 � � (true) p ( x ) f i ( x ) , (empirical) f i ( x ) = α i |D| x x ∈D Assuming that the expectations are consistent with one another, there may exist many distributions which satisfy them. Which one should we select? The most uncertain or flexible one, i.e., the one with maximum entropy. This yields a new optimization problem: � max H ( p ( x )) = − p ( x ) log p ( x ) p x � s.t. p ( x ) f i ( x ) = α i x � p ( x ) = 1 (strictly concave w.r.t. p ( x ) ) x David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 12 / 22

What does the MaxEnt solution look like? (c.f. Lec. 9) To solve the MaxEnt problem, we form the Lagrangian: �� L = − p ( x ) log p ( x ) − λ i p ( x ) f i ( x ) − α i − µ p ( x ) − 1 x i x x Then, taking the derivative of the Lagrangian, ∂ L � ∂ p ( x ) = − 1 − log p ( x ) − λ i f i ( x ) − µ i And setting to zero, we obtain: � � p ∗ ( x ) = exp � = e − 1 − µ e − � i λ i f i ( x ) − 1 − µ − λ i f i ( x ) i x p ( x ) = 1 we obtain e 1+ µ = � i λ i f i ( x ) = Z ( λ ) x e − � From the constraint � We conclude that the maximum entropy distribution has the form (substituting w i = − λ i ) 1 p ∗ ( x ) = � Z ( w ) exp( w i f i ( x )) i David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 13 / 22

Equivalence of maximum likelihood and maximum entropy Feature constraints + MaxEnt ⇒ exponential family! We have seen a case of convex duality: In one case, we assume exponential family and show that ML implies model expectations must match empirical expectations In the other case, we assume model expectations must match empirical feature counts and show that MaxEnt implies exponential family distribution Can show that one is the dual of the other, and thus both obtain the same value of the objective at optimality (no duality gap) Besides providing insight into the ML solution, this also gives an alternative way to (approximately) solve the learning problem David Sontag (NYU) Inference and Representation Lecture 12, Dec. 2, 2014 14 / 22

Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 12, Dec. 2, 2014 Acknowledgements : Partially based on slides by Eric Xing at CMU and Andrew McCallum at UMass Amherst David Sontag (NYU) Inference and Representation

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David

K K Knowledge Knowledge l d l d Representation Representation Representation

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

Inference and Representation David Sontag New York University Lecture 10, Nov. 17, 2015

Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David

CS325 Artificial Intelligence Chs. 9, 12 Knowledge Representation and Inference Cengiz

Inference and Representation David Sontag New York University Lecture 1, September 8, 2015

Inference and Representation David Sontag New York University Lecture 4, Sept. 29, 2015 David

Inference and Representation David Sontag New York University Lecture 1, September 2, 2014

Inference and Representation David Sontag New York University Lecture 2, September 9, 2014

Inference and Representation David Sontag New York University Lecture 5, Sept. 30, 2014 David

Geometry Beyond 3D Noah Snavely Google Inc., Cornell University Bay Area Vision Meeting, 2014

Roger Williams 02.18.12 || English 2327: American Literature I || D. Glen Smith, instructor

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Part 10: Vector Space Classification Francesco Ricci 1 Content p Recap on nave Bayes p

#1: Meeting of Estates General - May, 1789 SUMMARY: Under the Old Regime , the people of France

THE GREAT PACIFIC WAR: U.S. v. JAPAN, 1940-1945 24 SEPTEMBER 2020: RISING EXPECTATIONS DENIED

INF4820 Algorithms for AI and NLP Semantic Spaces Murhaf Fares & Stephan Oepen

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej

Inference and Representation David Sontag New York University - PowerPoint PPT Presentation

Inference and Representation David Sontag New York University Lecture 12, Dec. 2, 2014 Acknowledgements : Partially based on slides by Eric Xing at CMU and Andrew McCallum at UMass Amherst David Sontag (NYU) Inference and Representation

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David

K K Knowledge Knowledge l d l d Representation Representation Representation

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

Inference and Representation David Sontag New York University Lecture 10, Nov. 17, 2015

Inference and Representation David Sontag New York University Lecture 9, Nov. 11, 2014 David

CS325 Artificial Intelligence Chs. 9, 12 Knowledge Representation and Inference Cengiz

Inference and Representation David Sontag New York University Lecture 1, September 8, 2015

Inference and Representation David Sontag New York University Lecture 4, Sept. 29, 2015 David

Inference and Representation David Sontag New York University Lecture 1, September 2, 2014

Inference and Representation David Sontag New York University Lecture 2, September 9, 2014

Inference and Representation David Sontag New York University Lecture 5, Sept. 30, 2014 David

Geometry Beyond 3D Noah Snavely Google Inc., Cornell University Bay Area Vision Meeting, 2014

Roger Williams 02.18.12 || English 2327: American Literature I || D. Glen Smith, instructor

Basic Elec. Engr Basic Elec. Engr. Lab . Lab ECS 204 ECS 204 Asst. Prof. Dr. Prapun Suksompong

Part 10: Vector Space Classification Francesco Ricci 1 Content p Recap on nave Bayes p

#1: Meeting of Estates General - May, 1789 SUMMARY: Under the Old Regime , the people of France

THE GREAT PACIFIC WAR: U.S. v. JAPAN, 1940-1945 24 SEPTEMBER 2020: RISING EXPECTATIONS DENIED

INF4820 Algorithms for AI and NLP Semantic Spaces Murhaf Fares &amp; Stephan Oepen

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej

INF4820 Algorithms for AI and NLP Semantic Spaces Murhaf Fares & Stephan Oepen