probabilistic graphical models
play

Probabilistic Graphical Models Lecture 11 CRFs, Exponential - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS


  1. Probabilistic Graphical Models Lecture 11 – CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause

  2. Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS format http://nips.cc/PaperInformation/StyleFiles 2

  3. So far Markov Network Representation Local/Global Markov assumptions; Separation Soundness and completeness of separation Markov Network Inference Variable elimination and Junction Tree inference work exactly as in Bayes Nets How about Learning Markov Nets? 3

  4. MLE for Markov Nets Log likelihood of the data 4

  5. Log-likelihood doesn’t decompose Log likelihood l(D | θ ) is concave function! Log Partition function log Z( θ ) doesn’t decompose 5

  6. Computing the derivative Derivative C D I G S L J H Computing P(c i | � ) requires inference! Can optimize using conjugate gradient etc. 6

  7. Alternative approach: Iterative Proportional Fitting (IPF) At optimum, it must hold that � Solve fixed point equation Must recompute parameters every iteration 7

  8. Parameter learning for log-linear models Feature functions � � (C i ) defined over cliques Log linear model over undirected graph G Feature functions � 1 (C 1 ),…, � k (C k ) Domains C i can overlap Joint distribution How do we get weights w i ? 8

  9. Optimizing parameters Gradient of log-likelihood Thus, w is MLE � 9

  10. Regularization of parameters Put prior on parameters w 10

  11. Summary: Parameter learning in MN MLE in BN is easy (score decomposes) MLE in MN requires inference (score doesn’t decompose) Can optimize using gradient ascent or IPF 11

  12. Generative vs. discriminative models Often, want to predict Y for inputs X Bayes optimal classifier: Predict according to P(Y | X) Generative model Model P(Y), P(X|Y) Use Bayes’ rule to compute P(Y | X) Discriminative model Model P(Y | X) directly! Don’t model distribution P(X) over inputs X Cannot “generate” sample inputs Example: Logistic regression 12

  13. Example: Logistic Regression 13

  14. Log-linear conditional random field Define log-linear model over outputs Y No assumptions about inputs X Feature functions � � (C i ,x) defined over cliques and inputs Joint distribution 14

  15. Example: CRFs in NLP Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Mrs. Greene spoke today in New York. Green chairs the finance committee Classify into Person, Location or Other 15

  16. Example: CRFs in vision 16

  17. Parameter learning for log-linear CRF Conditional log-likelihood of data Can maximize using conjugate gradient 17

  18. Gradient of conditional log-likelihood Partial derivative Requires one inference per Can optimize using conjugate gradient 18

  19. Exponential Family Distributions Distributions for log-linear models More generally: Exponential family distributions h(x): Base measure w: natural parameters � (x): Sufficient statistics A(w): log-partition function Hereby x can be continuous (defined over any set) 19

  20. Examples h(x): Base measure Exp. Family: w: natural parameters � (x): Sufficient statistics A(w): log-partition function Gaussian distribution Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, … 20

  21. Moments and gradients Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE � moment matching 21

  22. Recall: Conjugate priors Consider parametric families of prior distributions: P( � ) = f( � ; � ) � is called “hyperparameters” of prior A prior P( � ) = f( � ; � ) is called conjugate for a likelihood function P(D | � ) if P( � | D) = f( � ; � ’) Posterior has same parametric form Hyperparameters are updated based on data D Obvious questions (answered later): How to choose hyperparameters?? Why limit ourselves to conjugate priors?? 22

  23. Conjugate priors in Exponential Family Any exponential family likelihood has a conjugate prior 23

  24. Maximum Entropy interpretation Theorem: Exponential family distributions maximize the entropy over all distributions satisfying 24

  25. Summary exponential family Distributions of the form Most common distributions are exponential family Multinomial, Gaussian Poisson, Exponential, Gamma, Weibull, chi- square, Dirichlet, Geometric, … Log-linear Markov Networks All exponential family distributions have conjugate prior in EF Moments of sufficient stats = derivatives of log-partition function Maximum Entropy distributions (“most uncertain” distributions with specified expected sufficient statistics) 25

  26. Exponential family graphical models So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks 26

  27. Gaussian distribution = Standard deviation σ = mean µ 27

  28. Bivariate Gaussian distribution 0.2 2 0.15 0.4 1 0.3 2 0.1 0.2 1 0 0.05 0.1 0 -1 0 -1 0 -2 -1.5 -1 -0.5 -2 -1.5 0 -1 -2 0.5 -0.5 -2 0 1 0.5 1 1.5 1.5 2 2 28

  29. Multivariate Gaussian distribution Joint distribution over n random variables P(X 1 ,…X n ) σ jk = E[ (X j – µ j ) (X k - µ k ) ] X j and X k independent � σ jk =0 29

  30. Marginalization Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) What is P(X 1 )?? More generally: Let A={i 1 ,…,i k } � {1,…,N} Write X A = (X i1 ,…,X ik ) X A ~ N( µ A , Σ AA ) 30

  31. Conditioning Suppose (X 1 ,…,X n ) ~ N( µ , Σ ) Decompose as (X A ,X B ) What is P(X A | X B )?? P(X A = x A | X B = x B ) = N(x A ; µ A|B , Σ A|B ) where Computable using linear algebra! 31

  32. Conditioning 2 0.4 1 0.3 0.2 0 0.1 -1 0 -2 -1.5 -1 -0.5 P(X 2 | X 1 =0.75) 0 0.5 -2 1 1.5 2 X 1 =0.75 32

  33. Conditional linear Gaussians 33

  34. Canonical representation of Gaussians 34

  35. Canonical Representation Multivariate Gaussians in exponential family! Standard vs canonical form: � = � -1 � � = � -1 In standard form: Marginalization is easy Will see: In canonical form, multiplication/conditioning is easy! 35

  36. Gaussian Networks Zeros in precision matrix � indicate missing edges in log-linear model! 36

  37. Inference in Gaussian Networks Can compute marginal distributions in O(n 3 )! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors! 37

  38. Multiplying factors in Gaussians 38

  39. Marginalizing in canonical form Recall conversion formulas � = � -1 � � = � -1 Marginal distribution 39

  40. Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices) 40

  41. Tasks Read Koller & Friedman Chapters 4.6.1, 8.1-8.3 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend