Probabilistic Graphical Models Lecture 11 CRFs, Exponential - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 11 – CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause

Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS format http://nips.cc/PaperInformation/StyleFiles 2

So far Markov Network Representation Local/Global Markov assumptions; Separation Soundness and completeness of separation Markov Network Inference Variable elimination and Junction Tree inference work exactly as in Bayes Nets How about Learning Markov Nets? 3

MLE for Markov Nets Log likelihood of the data 4

Log-likelihood doesn’t decompose Log likelihood l(D | θ ) is concave function! Log Partition function log Z( θ ) doesn’t decompose 5

Computing the derivative Derivative C D I G S L J H Computing P(c i | � ) requires inference! Can optimize using conjugate gradient etc. 6

Alternative approach: Iterative Proportional Fitting (IPF) At optimum, it must hold that � Solve fixed point equation Must recompute parameters every iteration 7

Parameter learning for log-linear models Feature functions � � (C i ) defined over cliques Log linear model over undirected graph G Feature functions � 1 (C 1 ),…, � k (C k ) Domains C i can overlap Joint distribution How do we get weights w i ? 8

Optimizing parameters Gradient of log-likelihood Thus, w is MLE � 9

Regularization of parameters Put prior on parameters w 10

Summary: Parameter learning in MN MLE in BN is easy (score decomposes) MLE in MN requires inference (score doesn’t decompose) Can optimize using gradient ascent or IPF 11

Generative vs. discriminative models Often, want to predict Y for inputs X Bayes optimal classifier: Predict according to P(Y | X) Generative model Model P(Y), P(X|Y) Use Bayes’ rule to compute P(Y | X) Discriminative model Model P(Y | X) directly! Don’t model distribution P(X) over inputs X Cannot “generate” sample inputs Example: Logistic regression 12

Example: Logistic Regression 13

Log-linear conditional random field Define log-linear model over outputs Y No assumptions about inputs X Feature functions � � (C i ,x) defined over cliques and inputs Joint distribution 14

Example: CRFs in NLP Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Mrs. Greene spoke today in New York. Green chairs the finance committee Classify into Person, Location or Other 15

Example: CRFs in vision 16

Parameter learning for log-linear CRF Conditional log-likelihood of data Can maximize using conjugate gradient 17

Gradient of conditional log-likelihood Partial derivative Requires one inference per Can optimize using conjugate gradient 18

Exponential Family Distributions Distributions for log-linear models More generally: Exponential family distributions h(x): Base measure w: natural parameters � (x): Sufficient statistics A(w): log-partition function Hereby x can be continuous (defined over any set) 19

Examples h(x): Base measure Exp. Family: w: natural parameters � (x): Sufficient statistics A(w): log-partition function Gaussian distribution Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, … 20

Moments and gradients Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE � moment matching 21

Recall: Conjugate priors Consider parametric families of prior distributions: P( � ) = f( � ; � ) � is called “hyperparameters” of prior A prior P( � ) = f( � ; � ) is called conjugate for a likelihood function P(D | � ) if P( � | D) = f( � ; � ’) Posterior has same parametric form Hyperparameters are updated based on data D Obvious questions (answered later): How to choose hyperparameters?? Why limit ourselves to conjugate priors?? 22

Conjugate priors in Exponential Family Any exponential family likelihood has a conjugate prior 23

Maximum Entropy interpretation Theorem: Exponential family distributions maximize the entropy over all distributions satisfying 24

Summary exponential family Distributions of the form Most common distributions are exponential family Multinomial, Gaussian Poisson, Exponential, Gamma, Weibull, chi- square, Dirichlet, Geometric, … Log-linear Markov Networks All exponential family distributions have conjugate prior in EF Moments of sufficient stats = derivatives of log-partition function Maximum Entropy distributions (“most uncertain” distributions with specified expected sufficient statistics) 25

Exponential family graphical models So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks 26

Gaussian distribution = Standard deviation σ = mean µ 27

Bivariate Gaussian distribution 0.2 2 0.15 0.4 1 0.3 2 0.1 0.2 1 0 0.05 0.1 0 -1 0 -1 0 -2 -1.5 -1 -0.5 -2 -1.5 0 -1 -2 0.5 -0.5 -2 0 1 0.5 1 1.5 1.5 2 2 28

Multivariate Gaussian distribution Joint distribution over n random variables P(X 1 ,…X n ) σ jk = E[ (X j – µ j ) (X k - µ k ) ] X j and X k independent � σ jk =0 29

Marginalization Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) What is P(X 1 )?? More generally: Let A={i 1 ,…,i k } � {1,…,N} Write X A = (X i1 ,…,X ik ) X A ~ N( µ A , Σ AA ) 30

Conditioning Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) Decompose as (X A ,X B ) What is P(X A | X B )?? P(X A = x A | X B = x B ) = N(x A ; µ A|B , Σ A|B ) where Computable using linear algebra! 31

Conditioning 2 0.4 1 0.3 0.2 0 0.1 -1 0 -2 -1.5 -1 -0.5 P(X 2 | X 1 =0.75) 0 0.5 -2 1 1.5 2 X 1 =0.75 32

Conditional linear Gaussians 33

Canonical representation of Gaussians 34

Canonical Representation Multivariate Gaussians in exponential family! Standard vs canonical form: � = � -1 � � = � -1 In standard form: Marginalization is easy Will see: In canonical form, multiplication/conditioning is easy! 35

Gaussian Networks Zeros in precision matrix � indicate missing edges in log-linear model! 36

Inference in Gaussian Networks Can compute marginal distributions in O(n 3 )! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors! 37

Multiplying factors in Gaussians 38

Marginalizing in canonical form Recall conversion formulas � = � -1 � � = � -1 Marginal distribution 39

Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices) 40

Tasks Read Koller & Friedman Chapters 4.6.1, 8.1-8.3 41

Probabilistic Graphical Models Lecture 11 CRFs, Exponential - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

The maximum likelihood degree of rank 2 matrices via Euler characteristics Jose Israel Rodriguez

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy

Inferential Statistics Concepts IN TR OD U C TION TO L IN E AR MOD E L IN G IN P YTH ON Jason

E9 205 Machine Learning for Signal Processing ML, MAP, MMSE and Gaussian 28-08-2019 Modeling

Recall: Linear Regression 200 180 160 140 Power (bhp)