Markov random fields 2. conditional specifications 3. conditional - - PowerPoint PPT Presentation

markov random fields
SMART_READER_LITE
LIVE PREVIEW

Markov random fields 2. conditional specifications 3. conditional - - PowerPoint PPT Presentation

Outline: 1. specification of joint distributions Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus Waagepetersen 4. Brooks factorization, 5. conditional independence and graphs 6. Hammersley-Clifford


slide-1
SLIDE 1

Markov random fields

Rasmus Waagepetersen December 30, 2019 (slides under construction)

1 / 36

Outline:

  • 1. specification of joint distributions
  • 2. conditional specifications
  • 3. conditional auto-regression
  • 4. Brooks factorization,
  • 5. conditional independence and graphs
  • 6. Hammersley-Clifford
  • 7. Estimation for Ising model
  • 8. Bayesian Image analysis
  • 9. Gibbs sampler (MCMC algorithm)
  • 10. Phase-transition for Ising model

2 / 36

Specification of joint distributions

Consider random vector (X1, . . . , Xn). How do we specify its joint distribution ?

  • 1. assume X1, . . . , Xn independent - but often not realistic
  • 2. assume (X1, . . . , Xn) jointly normal and specify mean vector

and covariance matrix (i.e. positive n × n matrix)

  • 3. use copula (e.g. transform marginal distributions of joint

normal)

  • 4. specify f (x1), f (x2|x1), f (x3|x1, x2) etc.
  • 5. specify full conditional distributions Xi|X−i - but what is then

joint distribution - and does it exist ? (X−i = (X1, . . . , Xi−1, Xi+1, . . . , Xn)) In this part of the course we will consider the fifth option.

3 / 36

Conditional auto-regressions

Suppose Xi|X−i is normal. Auto-regression natural candidate for conditional distribution: Xi|X−i = x−i ∼ N(αi +

  • l=i

γilxl, κi) (1) Equivalent and more convenient: Xi|X−i = x−i ∼ N(µi −

  • l=i

βil(xl − µl), κi) (2) Is this consistent with a multivariate normal distribution Nn(µ, Σ) for X ?

4 / 36

slide-2
SLIDE 2

Brook’s lemma

Consider two outcomes x and y of X where X has joint density p where p(y) > 0. Brooks factorization: p(x) p(y) =

n

  • i=1

p(xi|x1, . . . , xi−1, yi+1, . . . , yn) p(yi|x1, . . . , xi−1, yi+1, . . . , yn) Note n! ways to factorize ! If conditional densities consistent with joint density, we can choose fixed y and determine p(x) by p(x) ∝ p(x)/p(y) where RHS evaluated using Brook’s factorization. NB strictly speaking we should write pi(·|x−i) to be able to distinguish different conditional characteristics - but will be lazy.

5 / 36

Application to conditional normal specification

We let y = µ = (µ1, . . . , µn). Then log p(xi|x1, . . . , xi−1, µi+1, . . . , µn) p(µi|x1, . . . , xi−1, µi+1, . . . , µn)

  • = − 1

2κi [(xi − µi +

i−1

  • l=1

βil(xl − µl))2 − (

i−1

  • l=1

βil(xl − µl)2] = − 1 2κi [(xi − µi)2 + 2

i−1

  • l=1

βil(xi − µi)(xl − µl)] So log p(x) = log p(µ) − 1 2

n

  • i=1

n

  • l=1

βil κi (xi − µi)(xl − µl) with βii = 1.

6 / 36

This is formally equivalent to a multivariate Gaussian density with mean vector µ and precision matrix Q = Σ−1 = [qij]ij with qij = βij/κi. A well-defined Gaussian density provided Q is symmetric and positive definite (whereby Σ = Q−1 positive definite and symmetric)

7 / 36

Conditional distribution of Xi for N(µ, Q−1)

p(xi|x−i) ∝ exp(−1 2(xi − µi)2Qii −

  • k=i

(xi − µi)(xk − µk)Qik) For a normal distribution Y ∼ N(ξ, σ2), p(y) ∝ exp(− 1 2σ2 y2 + 1 σ2 yξ) Comparing the two above equations we get Xi|X−i = x−i ∼ N(µi − 1 Qii

  • k=i

Qik(xk − µk), Q−1

ii )

Thus auto-regressions on slide 4 are in fact general forms of the conditional distributions for a multivariate normal distribution !

8 / 36

slide-3
SLIDE 3

Example: Gaussian random field on 1D lattice

Consider lattice V = {l|l = 1, . . . , L}. Define µi = 0, κi = βii = 1 and βij = β ⇔ |i − j| mod (L − 2) = 1 Q obviously symmetric. Q not positive definite if β = −1/2. Q positive definite ⇔ |β| < 1/2 (exercise in case L = 4 - consider determinant of Q)

9 / 36

Example: Gaussian random field on 2D lattice

Consider lattice V = {(l, k)|l = 1, . . . , L, k = 1, . . . , K}. Now indices i, j ∈ V correspond to points (i1, i2) and (j1, j2) Define i, j ∈ V to be neighbours ⇔ |i1 − i2| + |j1 − j2| = 1 (i and j horizontal or vertical neighbours). Tempting: define βii = 1 and βij = 1/#Ni where #Ni is number

  • f neighbours (2, 3, or 4) of i and κi = κ > 0.

Problem: resulting Q is positive semi definite: xTQx = 0 ⇔ x = a1n for some a ∈ R. We can modify by Q := Q + τI where τ > 0. Then modified Q is positive definite and we obtain modified conditional distributions Xi|X−i = x−i ∼ N(µi − κ κ(1 + τ) 1 #Ni

  • k=i

(xk − µk), κ 1 + τ )

10 / 36

Markov random fields

Let V denote a finite set of vertices and E a set of edges where an element e in E is of the form {i, j} for i = j ∈ V . (i.e. an edge is a unordered pair of vertices). G = (V , E) is a graph. i, j ∈ V are neighbours, i ∼ j, if {i, j} ∈ E. A random vector X = (Xi)i∈V is a Markov random field with respect to G if p(xi|x−i) = p(xi|xNi) where Ni is the set of neighbours of i and for x = (xl)l∈V and A ⊆ V , xA = (xi)i∈A. In other words, Xi and Xj are conditionally independent given X−{i,j} if i and j are not neighbours.

11 / 36

Hammersley-Clifford

Consider a positive density for X = (Xi)i∈V and a graph G = (V , E). Then the following statements are equivalent:

  • 1. X is a MRF wrt G.

2. p(x) =

  • C⊆V

φC(xC) for interaction functions φC where φC = 1 unless C is a clique

  • wrt. G. We can further introduce the constraint φC(xC) = 1

if xl = yl for l ∈ C and some fixed y. Then the interaction functions are uniquely determined by the full conditionals. Notation: for ease of notation we often write i for {i} and (xA, yB) will denote a vector with entries xi for i ∈ A and yj for j ∈ B, A ∩ B = ∅ (this is a convenient but not rigorous notation) Clique: C ⊆ V is a clique if i ∼ j for all i, j ∈ C.

12 / 36

slide-4
SLIDE 4

Proof: 2. ⇒ 1.

p(xi|x−i) ∝

  • C⊆V :C∩i=∅

φC(xC) RHS depends only on xj ∈ Ni: if l ∈ C is not a neighbour of i then C can not be a clique. Then φC(xC) = 1 so it does not depend on xl.

13 / 36

  • 1. ⇒ 2.

We choose an arbitrary reference outcome y for X. We then define φ∅ = p(y) and, recursively, φC(xC) =

  • 1

C not a clique or xl = yl for some l ∈ C

p(xC ,y−C )

  • B⊂C φB(xB)
  • therwise

Let x = (xA, y−A) where xl = yl for all l ∈ A. We show 2. by induction in the cardinality |A| of A. If |A| = 0 then x = y and p(y) = φ∅ so 2. holds. Assume now that 2. holds for |A| = k − 1 where k ≤ |V | and consider A with |A| = k. Assume A is a clique. Then by construction, p(xA, y−A) = φA(xA)

  • B⊂A

φB(xB) and we are done since for C ⊆ V which is not a subset of A we have φC((xA, y−A)C) = 1 by construction NB: don’t need induction hypothesis in this case.

14 / 36

Assume A is not a clique, i.e. there exist l, j ∈ A so that l ∼ j. Then p(xA, y−A) =p(xl|xA\l, y−A) p(yl|xA\l, y−A)p(xA\l, y−A, yl) =p(xl|xA\{l,j}, yj, y−A) p(yl|xA\{l,j}, yj, y−A)p(xA\l, y−A, yl) =p(xl, xA\{l,j}, yj, y−A) p(yl, xA\{l,j}, yj, y−A)p(xA\l, y−A, yl) =

  • C⊆A\j φC(xC)
  • C⊆A\{l,j} φC(xC)
  • C⊆A\l

φC(xC) =

  • C⊆A

φC(xC) where second ”=” by 1. and fourth ”=” by induction. Thus 2. also holds in this case.

15 / 36

Brooks vs. Hammersley-Clifford

Given full conditionals we can use either Brooks or H-C to identify simultaneous distribution. However, Brooks in principle yields n! solutions (possible non-uniqueness) and we need to check that constructed p(x) is consistent with given full conditionals. For H-C, we can construct the interaction functions using the full conditionals following the proof of 1. ⇒ 2. For given y these interaction functions and hence p(·) are uniquely determined by the full conditionals. Moreover, we can easily check that the constructed interaction functions are consistent with the full conditionals since p(xi|x−i) ∝ p(xi|x−i) p(yi|x−i) =

  • C:i∈C

φ(xC) Both for Brooks and H-C, we need to check that the identified (unnormalized) simultaneous density indeed has finite integral !

16 / 36

slide-5
SLIDE 5

Gaussian MRF

A Gaussian vector X = (Xl)l∈V is a MRF with respect to G if Qij = Qji differs from zero only if {i, j} ∈ E. Letting y = (µl)l∈V , we have φ∅ = √ 2π

−|V ||Q||V |/2

φl(xl) = exp(− 1 2κi (xi − µi)2) φ{i,j}(xi, xj) = exp(Qij κi (xi − µi)(xj − µj)) and φC(xc) = 1 for |C| > 2. Note φ{i,j} covers both pairs (i, j) and (j, i).

17 / 36

Auto-logistic model

Consider again 2D rectangular lattice with horizontal/vertical

  • neighbours. Only cliques are singletons or pairs of horizontal or

vertical neighbours. Consider stochastic vector X on {0, 1}V with p(xi|x−i) = exp(αxi + β

j∈Ni xixj)

1 + exp(α + β

j∈Ni xj)

Following construction in proof of Hammersley-Clifford with y = (0, . . . , 0) we obtain φi(xi) = exp(αxi) φi,j(x{i,j}) = exp(βxixj) Hence joint density is p(x) = p(0) exp(α

  • l∈V

xl + β

  • {i,j}∈E

xixj)

18 / 36

Unknown normalizing constant

Note sum defining p(0) =  

x∈S

exp(α

  • l∈V

xl + β

  • {i,j}∈E

xixj)  

−1

has 2LK terms ! In general impossible to compute exactly. Hence we only know p(·) up to proportionality.

19 / 36

Boundary conditions

◮ free boundary: pixels at edges have only 2 or 3 neighbours ◮ fixed boundary: we condition on fixed values of boundary

  • pixels. Then all interior “random” pixels have 4 neighbours

◮ toroidal: edge pixels neighbours of pixels on opposite edge. E.g. pixel (1, j) becomes neighbour of pixel (L, j). Hence all pixels have 4 neighbours.

20 / 36

slide-6
SLIDE 6

Symmetric case

Suppose all pixels have 4 neighbours (fixed or toroidal boundary). If

j∈Ni = 2 we may want p(0|x−i) = p(1|x−i).

This is achieved with α = −2β.

21 / 36

Ising model

Autologistic model is another name for the very famous Ising model (from statistical physics). In statistical physics 0, 1 are replaced by −1, 1 representing “spins” of elementary particles in piece of iron. An equivalent form is p(x) ∝ exp(˜ α

  • l∈V

xl + ˜ β

  • {i,j}∈E

1[xi = xj]) (3) That is, with ˜ β > 0, the model assigns large probabilities to x with many neighbours of equal value. If xi ∈ {0, 1} and all pixels have four neighbours then (3) is equivalent to auto-logistic with α = ˜ α − 4˜ β and β = 2˜ β.

22 / 36

Simulation of MRF

Gaussian MRF: use sparse matrix Cholesky decomposition of precision matrix. General MRF: Markov chain Monte Carlo. Here we consider the so-called Gibbs sampler

23 / 36

Gibbs sampler

Idea: generate Markov chain X 1, X 2, . . . so X n converges to the distribution p of X. Reasonable requirement: p is invariant distribution for Markov

  • chain. That is, if X i ∼ p then also X i+1 ∼ p. This is implied by

reversibility: P(X i ∈ A, X i+1 ∈ B) = P(X i ∈ B, X i+1 ∈ A) when X i ∼ p (set B equal to sample space S of X. Then reversibility implies P(X i ∈ A) = P(X i+1 ∈ A)) Gibbs sampler update: given X i = xi pick l in V and let X i+1 = (X i

−l, Yl) where Yl is sampled from conditional distribution

  • f Xl|X−l = xi

−l.

l can be chosen at random in V or we can run through V in a systematic order.

24 / 36

slide-7
SLIDE 7

Gibbs update is reversible

Let S =

l∈V Sl be sample space of X.

P(X i ∈ A, X i+1 ∈ B) =

  • S
  • Sl

1[(x−l, yl) ∈ B, x ∈ A]p(yl|x−l)dylp(x)dx Moreover, using a change of variable, P(X i ∈ B, X i+1 ∈ A) =

  • S
  • Sl

1[(x−l, yl) ∈ A, x ∈ B]p(yl|x−l)dylp(x)dx =

  • S
  • Sl

1[x ∈ A,(x−l, yl) ∈ B]p(xl|x−l)dxlp((x−l, yl))dx−ldyl These two integrals are equal since p(x)p(yl|x−l) = p((yl, x−l))p(xl|x−l) Under weak regularity conditions one can show that the Gibbs sampler Markov chain converges to p(·). I.e. X 1, X 2, . . . serves as a (correlated) random sample from p(·).

25 / 36

Estimation

Suppose we have observed realization of auto-logistic model. Likelihood is p(x; α, β) = p(0; α; β) exp(α

  • i∈V

xi + β

  • i∼j

xixj) Problem: normalizing constant c(α, β) = [p(0; α, β)]−1 =

  • x∈S

exp(α

  • i∈V

xi + β

  • i∼j

xixj) can not be evaluated exactly and is difficult to approximate numerically. Normalizing constant: we can define p(x; α, β) ∝ h(x) = exp(α

  • i∈V

xi + β

  • i∼j

xixj) If we normalize RHS by dividing with c(α, β) then it becomes a probability density.

26 / 36

Besag’s pseudo-likelihood

Likelihood function for auto-logistic is intractable Julian Besag suggested to maximize the pseudo-likelihood PL(α, β) =

  • i∈V

p(xi|x−i; α, β) Score of log pseudo-likelihood is an unbiased estimating function (Bartlett identity) and one can show that PL estimates are asymptotically normal. Computationally straightforward - formally equivalent to logistic regression.

27 / 36

Bayesian image analysis

Consider a pixel image X = (Xi)i∈V where Xi represents the color/intensity for pixel i. Suppose we observe “dirty image” Y where Yi = Xi + ǫi where ǫi represents independent zero-mean noise terms with some density ǫi ∼ f . We want to reconstruct X given observation y of Y ! Idea behind Bayesian image analysis: represent prior beliefs about X using a probability distribution and infer X using posterior distribution X|Y = y.

28 / 36

slide-8
SLIDE 8

Pixel values continuous

Suppose Xi ∈ R. We believe neighbouring pixel values are similar. We might model this using Gaussian MRF introduced on slide 10. I.e. Xi|X−i = x−i ∼ N(µi − 1 1 + τ 1 #Ni

  • k=i

(xk − µk), 1 κ(1 + τ)) That is the conditional mean of Xi is essentially the average of the neighbours plus some noise. p(x) ∝ exp[−1 2(x − µ)T(Q + τI)(x − µ)] Assume ǫi ∼ N(0, σ2). Posterior is p(x|y) ∝ p(y|x)p(x) ∝ exp(− 1 2σ2

  • i∈V

(yi − xi)2p(x) (4) which is again a Gaussian MRF.

29 / 36

Posterior is known exactly (we can evaluate normalizing constant). Note also: posterior Gaussian MRF is well-defined also with τ = 0 in which case it does not depend on µ. This is nice since we then do not need to specify µ.

30 / 36

Image segmentation

Image consists of two types (e.g. black or white) homogeneous

  • regions. We may take Xi ∈ {0, 1} with 0 for black and 1 for white.

Homogeneity: most neighbouring pixel values are of the same type ⇒ use Ising model as prior ! Assume again Gaussian noise. Then posterior is p(x|y) ∝ exp(−1 2

  • i∈V

(yi − xi)2 + ˜ α

  • i∈V

xi + ˜ β

  • i∼j

1[xi = xj]) Again MRF distribution ! This time normalizing constant intractable but we can at least simulate posterior using Gibbs sampler. We may want to use symmetric prior with ˜ α = 0.

31 / 36

Contingency tables and graphical models

Consider a K-way contingency table given by combinations of K factors - e.g. Smoker (yes, no), lung cancer (yes, no), Age (young, middle, old). Consider an individual/object which is classified according to random values of these factors - leads to random vector X that takes value i = (i1, . . . , iK) if factor l takes the value il. E.g.

  • utcome could be (yes, no, middle) if person is middle-aged smoker

without lung cancer. Suppose we have n individuals. Then we can model vector of total numbers N = (Ni)i∈K

l=1 Sl of individuals in each class given by

combination of factor levels using a multinomial distribution N ∼ multinomial(n, (pi)i). Imposing a MRF structure allows us to study conditional independence properties of various factors - and these can be visualized via accompanying graph.

32 / 36

slide-9
SLIDE 9

Exercises

  • 1. Show that two parametrizations (1) and (2) are equivalent

under the condition that the matrix A with Aii = 1 and Aij = −γij is invertible. In other words, show that there is an invertible mapping between the parameter vectors (α1, . . . , αn, γ12, . . . , γ(n−1)n) and (µ1, . . . , µn, β12, . . . , β(n−1)n).

  • 2. Verify Brook’s Lemma.
  • 3. Show that the precision matrix, if it exists, is positive definite.
  • 4. Identify the φC functions for the auto-logistic model
  • 5. Show that (3) is equivalent to the auto-logistic model in the

case where all pixels have 4 neighbours Hint: 1[xi = xj] = xixj + (1 − xi)(1 − xj) when xi, xj ∈ {0, 1}.

  • 6. Auto-Poisson: suppose Xi|X−i = x−i is Poisson with mean

exp(α + β

j∈Ni xj). Find the joint distribution of X. Show

that it is well-defined when β ≤ 0 (meaning

x∈S h(x) < ∞)

but not (

x∈S h(x) = ∞) when β > 0 and h(·) denotes the

unnormalized simultaneous density).

33 / 36

Exercises continued

  • 7. How can you simulate a Gaussian MRF when the Cholesky

decomposition Q = LLT has been obtained for the precision matrix ?

  • 8. Show that the posterior distribution (4) is a Gaussian MRF.

Hint: if Z ∼ Nn(ξ, K −1), then p(z) ∝ exp(−1 2zTKz + zTKξ).

  • 9. Implement and run Gibbs sampler for the Ising model (3) with

xi ∈ {0, 1}. Use fixed boundary with all boundary pixels equal to 1. Consider the symmetric case ˜ α = 0 and values of ˜ β = 0.4, 0.7, 0.9. What do you observe ? (some code available on webpage).

34 / 36

Exercises continued

  • 10. 10.1 Show that the score function of pseudo-likelihood is unbiased.

10.2 Implement pseudo-likelihood for auto-logistic model when a fixed boundary condition is used (use R-procedure glm) (some code available on webpage). 10.3 Estimate α and β from the data set isingdata.txt using fixed boundary condition. The data was generated from (3) with ˜ α = 0 and ˜ β = 0.4. Do your estimates of α and β seem reasonable compared to this ?

  • 11. The data set imageAnoisy.txt contains a binary

(black/white) image corrupted by iid normal noise with mean zero and standard deviation 0.25. You can read and view it using temp=as.matrix(read.table("imageAnoisy.txt"));image(temp). Adapt the previously constructed Gibbs sampler to sample from the posterior distribution when the Ising model is used as a prior. Use toroidal edge correction and try out different ˜ β values.

35 / 36

Conditional independence

Suppose X, Y , Z are random variables (or vectors). Then we define X and Y to be conditionally independent given Z if p(x, y|z) = p(x|z)p(y|z) The following statements are equivalent:

  • 1. p(x, y|z) = p(x|z)p(y|z)
  • 2. p(x, y, z) = f (x, z)g(y, z) for some functions f and g
  • 3. p(x|y, z) = p(x|z)
  • 4. p(y|x, z) = p(y|z)

(p(·) generic notation for (possibly conditional) probability densities)

36 / 36