Probabilistic Graphical Models Lecture 10 Undirected Models - - PowerPoint PPT Presentation

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Lecture 10 Undirected Models - - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 10 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due this Wednesday (Nov 4) in class Project milestones due next Monday (Nov 9) About half the work should be done 4 pages


slide-1
SLIDE 1

Probabilistic Graphical Models

Lecture 10 – Undirected Models

CS/CNS/EE 155 Andreas Krause

slide-2
SLIDE 2

2

Announcements

Homework 2 due this Wednesday (Nov 4) in class Project milestones due next Monday (Nov 9)

About half the work should be done 4 pages of writeup, NIPS format http://nips.cc/PaperInformation/StyleFiles

slide-3
SLIDE 3

3

Markov Networks

(a.k.a., Markov Random Field, Gibbs Distribution, …) A Markov Network consists of

An undirected graph, where each node represents a RV A collection of factors defined over cliques in the graph

Joint probability A distribution factorizes over undirected graph G if

X1 X2 X4 X5 X7 X8 X6 X3 X9

slide-4
SLIDE 4

4

Computing Joint Probabilities

Computing joint probabilities in BNs Computing joint probabilities in Markov Nets

slide-5
SLIDE 5

5

Local Markov Assumption for MN

The Markov Blanket MB(X)

  • f a node X is the set of

neighbors of X Local Markov Assumption: X EverythingElse | MB(X) Iloc(G) = set of all local independences G is called an I-map of distribution P if Iloc(G) I(P) X1 X2 X4 X5 X7 X8 X6 X3 X9

slide-6
SLIDE 6

6

Factorization Theorem for Markov Nets “”

  • Iloc(G) I(P)

G is an I-map of P (independence map) True distribution P can be represented exactly as a Markov net (G,P)

slide-7
SLIDE 7

7

Factorization Theorem for Markov Nets “” Hammersley-Clifford Theorem

  • Iloc(G) I(P)

True distribution P can be represented exactly as G is an I-map of P (independence map) and P>0 i.e., P can be represented as a Markov net (G,P)

slide-8
SLIDE 8

8

Global independencies

A trail X—X1—…—Xm—Y is called active for evidence E, if none of X1,…,Xm E Variables X and Y are called separated by E if there is no active trail for E connecting X, Y Write sep(X,Y | E) I(G) = {X Y | E: sep(X,Y|E)} X1 X2 X4 X5 X7 X8 X6 X3 X9

slide-9
SLIDE 9

9

Soundness of separation

Know: For positive distributions P>0 Iloc(G) I(P) P factorizes over G Theorem: Soundness of separation For positive distributions P>0

Iloc(G) I(P) I(G) I(P)

Hence, separation captures only true independences How about I(G) = I(P)?

slide-10
SLIDE 10

10

Completeness of separation

Theorem: Completeness of separation

I(G) = I(P)

for “almost all” distributions P that factorize over G “almost all”: Except for of potential parameterizations

  • f measure 0 (assuming no finite set have positive

measure)

slide-11
SLIDE 11

11

Minimal I-maps

For BNs: Minimal I-map not unique For MNs: For positive P, minimal I-map is unique!! E B A J M E B A J M J M A E B

slide-12
SLIDE 12

12

P-maps

Do P-maps always exist? For BNs: no How about Markov Nets?

slide-13
SLIDE 13

13

Exact inference in MNs

Variable elimination and junction tree inference work exactly the same way!

Need to construct junction trees by obtaining chordal graph through triangulation

slide-14
SLIDE 14

14

Pairwise MNs

A pairwise MN is a MN where all factors are defined

  • ver single variables or pairs of variables

Can reduce any MN to pairwise MN! X1 X2 X4 X5 X3

slide-15
SLIDE 15

15

Logarithmic representation

Can represent any positive distribution in log domain

slide-16
SLIDE 16

16

Log-linear models

Feature functions φi(D) defined over cliques Log linear model over undirected graph G

Feature functions φ1(D1),…,φk(Dk) Domains Di can overlap Set of weights wi learnt from data

slide-17
SLIDE 17

17

Converting BNs to MNs

Theorem: Moralized Bayes net is minimal Markov I-map

C D I G S L J H

slide-18
SLIDE 18

18

Converting MNs to BNs

Theorem: Minimal Bayes I-map for MN must be chordal X1 X2 X7 X8 X6 X3 X9

slide-19
SLIDE 19

19

So far

Markov Network Representation

Local/Global Markov assumptions; Separation Soundness and completeness of separation

Markov Network Inference

Variable elimination and Junction Tree inference work exactly as in Bayes Nets

How about Learning Markov Nets?

slide-20
SLIDE 20

20

Parameter Learning for Bayes nets

slide-21
SLIDE 21

21

Algorithm for BN MLE

slide-22
SLIDE 22

22

MLE for Markov Nets

Log likelihood of the data

slide-23
SLIDE 23

23

Log-likelihood doesn’t decompose

Log likelihood l(D | θ) is concave function! Log Partition function log Z(θ) doesn’t decompose

slide-24
SLIDE 24

24

Derivative of log-likelihood

slide-25
SLIDE 25

25

Derivative of log-likelihood

slide-26
SLIDE 26

26

Computing the derivative

Derivative Computing P(ci | ) requires inference! Can optimize using conjugate gradient etc.

C

D

I

G S L J H

slide-27
SLIDE 27

27

Alternative approach: Iterative Proportional Fitting (IPF)

At optimum, it must hold that Solve fixed point equation Must recompute parameters every iteration

slide-28
SLIDE 28

28

Parameter learning for log-linear models

Feature functions (Ci) defined over cliques Log linear model over undirected graph G

Feature functions 1(C1),…,k(Ck) Domains Ci can overlap

Joint distribution How do we get weights wi?

slide-29
SLIDE 29

29

Derivative of Log-likelihood 1

slide-30
SLIDE 30

30

Derivative of Log-likelihood 2

slide-31
SLIDE 31

31

Optimizing parameters

Gradient of log-likelihood Thus, w is MLE

slide-32
SLIDE 32

32

Regularization of parameters

Put prior on parameters w

slide-33
SLIDE 33

33

Summary: Parameter learning in MN

MLE in BN is easy (score decomposes) MLE in MN requires inference (score doesn’t decompose) Can optimize using gradient ascent or IPF

slide-34
SLIDE 34

34

Tasks

Read Koller & Friedman Chapters 20.1-20.3, 4.6.1