SLIDE 1 LOCAL and GLOBAL INDEPENDENCE
in DISCRETE BAYESIAN GRAPHICAL MODELS
Jacek Wesołowski (GUS & Politechnika Warszawska, Warszawa) XLII Konferencja "STATYSTYKA MATEMATYCZNA" B˛ edlewo, Nov. 28 - Dec. 2, 2016 with H. Massam (York Univ., Toronto)
SLIDE 2
Plan
1
Introduction
2
Markovian structure imposed by DAGs
3
Local and global independence vs. HD law
SLIDE 3
1
Introduction
2
Markovian structure imposed by DAGs
3
Local and global independence vs. HD law
SLIDE 4 Discrete model
Let X = (Xv, v ∈ V) be a random vector assuming values in I = ×v∈V Iv, where #(Iv) < ∞, v ∈ V. We write p(i) := PX(i) = P(X = i), i ∈ I. Let X1, . . . , Xn be iid with distribution PX. Let Mi =
n
I(Xj = i), i ∈ I. Then M = (Mi, i ∈ I) has a multinomial distribution, i.e. P(M = m) = n m
i∈I
p(i)mi, m = (mi, i ∈ I),
mi = n.
SLIDE 5 Dirichlet law as an a priori distribution
Bayesian approach means that one imposes some distribution
Since the only restriction on π are: p(i) ≥ 0, i ∈ I and
- i∈I p(i) = 1 we need a probability measure supported on a
unit simplex of proper dimension. A random vector (Y1, . . . , Yr) has a (classical) Dirichlet distribution D(αi, i = 1, . . . , r) if the density of the distirbution of (Y1, . . . , Yr−1) has the form f(y1, . . . , yr−1) =
Γ(α) r
i=1 Γ(αi)
r
yαi
i ITr (y),
where α = r
i=1 αi oraz yr = 1 − y1 − . . . − yr−1.
SLIDE 6 Dirichlet conjugacy and moments
If π = (p(i), i ∈ I) has a Dirichlet distribution D(αi, i ∈ I) then a posteriori law is also Dirichlet π|M ∼ D(αi + Mi, i ∈ I). Exercise: Prove conjugacy of the Dirichlet law using only the form of its joint moments E
p(i)ri =
(α)r
, where r =
i∈I ri and (a)s = Γ(a+s) Γ(a) .
Note that in this case the moments uniquely determine the distribution.
SLIDE 7
Example
Let X = (X1, X2, X2) assume values in I = {0, 1}3. Obviously, P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1)P(X3 = i3|X1 = i1, X2 = i2), This is different than the Markov structure imposed by P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1)P(X3 = i3|X2 = i2), associated with an ordered graph: 1 → 2 → 3. Equivalent to p(000)p(101) = p(100)p(001) (1) and p(010)p(111) = p(110)p(011). (2)
SLIDE 8
Example, cont.
Conditions (1) and (2) are equivalent to each of the Markov structures imposed by two other ordered graphs (with skeleton 1 − 2 − 3) 1 ← 2 ← 3, i.e. P(X = i) = P(X1 = i1|X2 = i2)P(X2 = i2|X3 = i3)P(X3 = i3); 1 ← 2 → 3, i.e. P(X = i) = P(X1 = i1|X2 = i2)P(X2 = i2)P(X3 = i3|X2 = i2).
SLIDE 9 Example, cont.: prior on π
We seek a convenient prior on π, which is a probab. measure
- n (5-dimensional) manifold in [0, ∞)8 described by equations:
x1 + . . . + x8 = 1, x1x2 = x3x4, x5x6 = x7x8. Some Dirichlet-like distribution would be fine!
SLIDE 10
Example, cont. - one more ordered graph
The graph 1 → 2 ← 3 introduces a different Markov structure P(X = i) = P(X1 = i1)P(X2 = i2|X1 = i1, X3 = i3)P(X3 = i3). Equivalently
(p(000)+p(010))(p(101)+p(111)) = (p(100)+p(110))(p(001)+p(011)).
So we seek a probab. measure on a (6-dimensional) manifold in [0, ∞)8 defined by x1 + . . . + x8 = 1, (x1 + x5)(x2 + x6) = (x3 + x7)(x4 + x8).
SLIDE 11
1
Introduction
2
Markovian structure imposed by DAGs
3
Local and global independence vs. HD law
SLIDE 12
DAG
For a graph G = (V, E) define a DAG (directed acycylic graph) with skeleton G by changing all unordered edges in E into arrows in a acyclic way. DAG can be identified with a parent function p : V → 2V defined by p(v) = {w ∈ V : w → v}, v ∈ V and having the "acyclicity" property ∀ k ≥ 1 {v} ∩ pk(v) = ∅. We will use also another function, q : V → 2V defined by q(v) = {v} ∪ p(v), v ∈ V.
SLIDE 13 p-Markov model
Let p be a DAG with a chordal skeleton G = (V, E). X (or π = (p(i), i ∈ I)) is called p-Markov iff p(i) = P(X = i) =
pv|p(v)
iv|ipv ,
∀ i ∈ I, where
pv|p(v)
iv|ipv
:= P(Xv = iv|Xp(v) = ip(v)). Note that pv|p(v)
m|n
= pq(v)((n,m))
pp(v)(n)
, m ∈ Iv, n ∈ Ip(v), where pA
n = j∈IV\A p((j, n)) = P(XA = n), n ∈ IA, A ⊂ V.
SLIDE 14
Moral DAGs
A DAG p with chordal skeleton G = (V, E) is moral if ∀ v ∈ V the subgraph induced in G by p(v) ⊂ V is complete. π is p′-Markov for a moral DAG p′ with a chordal skeleton G iff π is p-Markov with respect to any moral p DAG with the same skeleton G. The family of DAGs with skeleton 1 − 2 − 3 splits into: moral DAGs 1 → 2 → 3, 1 ← 2 ← 3, 1 ← 2 → 3 an immoral DAG 1 → 2 ← 3.
SLIDE 15 Cliques and separators
Let G = (V, E) be a chordal graph. Any induced maximal complete subgraph is called a clique. Denote C the set of cliques of G. A perfect ordering of cliques is a numbering C1, . . . , CK of element of C, such that ∀ j = 2, . . . , K ∃ i < j : Sj := Cj ∩
j−1
Cl ⊂ Ci. S = {(S1 = ∅), Sj, j = 2, . . . , K} is called a set of separators.
SLIDE 16 G-Markov model
For a chordal G = (V, E) we say that X (or π) is G-Markov if p(i) =
i ∈ I, where pA
iA = P(XA = iA) and XA = (Xv, v ∈ A), for A ⊂ V.
Equivalently, X (or π) is p-Markov for (any) moral DAG p with skeleton G, i.e. p(i) =
pv|p(v)
iv|ip(v),
i ∈ I. Equivalently, Xw ⊥ Xv|XV\{w,v} if only {w, v} ∈ E.
SLIDE 17 Dawid & Lauritzen, Ann. Statist. (1993)
Assume that π is G-Markov where G = (V, E) is a chordal
- graph. We say that π has a hyper-Dirichlet distribution,
HD(αC
m, m ∈ IC, C ∈ C), iff its moments are
E
p(i)ri =
m) rC m
n) rS n ,
where for S ∋ S ⊂ C ∈ C αS
n =
αC
m,n,
n ∈ IS and r A
m =
rm,n, m ∈ IA.
SLIDE 18 HD distribution
Equivalently, for any moral DAG p (with skeleton G) in the decomposition p(i) =
pv|p(v)
iv|ip(v),
i ∈ I the vectors of conditional probabilities (pv|p(v)
iv|ip(v), iv ∈ Iv),
ip(v) ∈ Ip(v), v ∈ V, are independent and have classical Dirichlet distributions D(αv|p(v)
iv|ip(v), iv ∈ Iv).
Then ∀ C ∈ C and ∀ iC ∈ IC αC
iC = αv|p(v) iv|ip(v)
if only C = {v} ∪ p(v) = q(v).
SLIDE 19 Multinomial mixture
Let X1, . . . , Xm be observations on X and M =
m
I(Xk = i), i ∈ I
The conditional law of M = m
k=1 Xk given π is a multinomial
distribution with parameters m and π = (p(i), i ∈ I).
SLIDE 20 HD as a conjugate prior law
- Th. If the a priori law of π is HD(αC
m, m ∈ IC, C ∈ C) the
posterior law of π|M is also hiper-Dirichlet, HD(αC
m + MC m, m ∈ IC, C ∈ C), where
MC
m =
M(m,n), m ∈ IC.
SLIDE 21 Proof
The generalized Bayes rule reads E
i∈I
p(i)ri
=
E
i∈I p(i)ri (m m)
E(m
m)
. Apply the moment formula for the HD distribution in the numerator and denominator: E
i∈I
p(i)ri
=
j
rC
j +mC j
j
mC
j
(αS
n) mS n
(αS
n) rS n +mS n ,
where mA
n = j∈IV\A m(n,j), n ∈ IA, A ⊂ V.
SLIDE 22 Proof, cont.
Since
(a)b+c (a)b
= (a + b)c, then the last formula gives E
i∈I
p(i)ri
=
j +mC j
rC
j
n +mS n) rS n .
✷
SLIDE 23
p-Dirichlet and P-Dirichlet distributins
Let p be a moral DAG with a chordal skeleton G = (V, E). A G-Markow random vector π has a p-Dirichlet law if only the random vectors (pv|p(v)
m|n
, m ∈ Iv), n ∈ Ip(v), v ∈ V, have (classical) Dirichlet laws and are independent. Let P be a family of moral DAGs with a chordal skeleton G = (V, E). We say that G-Markov π has a P-Dirichlet distribution if it has a p-Dirichlet ∀ p ∈ P.
SLIDE 24
HD as a special P-Dirichlet law
Let P be a family of all moral DAGs with the chordal skeletonG. If G-Markov π has a P-Dirichlet distribution then π has a HD distribution. Question: Can we have a similar description of the HD law through a smaller family P?
SLIDE 25 p-perfect ordering of cliques
Let p be a moral DAG with a (chordal) skeleton G = (V, E). A perfect ordering of cliques o = (C1, . . . , CK) is called p-perfect (notation: op) if ∀ ℓ = 1, . . . , K ∃ v ∈ Cℓ \ Sℓ : Sℓ = p(v).
- Lemat. For any moral DAGu p there exists a p-perfect ordering
- f cliques.
SLIDE 26 Pairing a separator with a clique
For S ∈ S, C ∈ C such that S ⊂ C we say that S and C are paired by a perfect ordering of cliques o = (C1, . . . , CK) (notation: S
∃ ℓ ∈ {1, . . . , K} : S = Sℓ
C = Cℓ. We say that a family P of moral DAGs (with a chordal skeleton G = (V, E)) is a pairing familyif ∀ S ∈ S, C ∈ C such that S ⊂ C ∃ p ∈ P : S
→ C.
SLIDE 27 HD as a P-Dirichlet law
- Th. (MW’16) Let P be a family of moral DAGs with a chordal
skeleton G = (V, E). Assume that P is a pairing family,
p(V) = S. If L is a P-Dirichlet law then L is necessarily a hyper-Dirichlet distribution. Of course, p(V) = {p(v), v ∈ V}.
SLIDE 28
1
Introduction
2
Markovian structure imposed by DAGs
3
Local and global independence vs. HD law
SLIDE 29 Independencies
If π (G-Markov wrt to a chordal G = (V, E)) has a hyper Dirichlet distribution then for any moral DAG p with skeleton G the random vectors
i|n
, i ∈ Iv, n ∈ Ip(v)
v ∈ V are independent (global independence of parameters, notation: GI(p)). for an arbitrary fixed v ∈ V the random vectors
i|n
, i ∈ Iv
are independent (local independence of parameters, notation: LI(p)).
SLIDE 30
Heckerman, Geiger and Chickering (1995) Geiger and Heckerman (1997)
Let G = (V, E) be a complete graph with d vertices. Any DAG (all DAGs are moral) is uniquely determined by ordering of vertices: p = (v1, . . . , vd) iff #p(vj) = j − 1, j = 1, . . . , d. For a complete graph, they proved that (under some smothness assumptions for denisties) independence conditions GI i LI wrt the DAGs p = (1, 2, 3 . . . , d − 1, d) and p′ = (d, 1, 2, . . . , d − 2, d − 1) imply that the distribution of π is necessarily classical hyper-Dirichlet.
SLIDE 31 Separation and characterization of P-Dirichlet
We say that a family of moral DAGs P (with a chordal skeleton) G = (V, E) is a separating family if ∀ v ∈ V ∃ p, p′ ∈ P : p(v) = p′(v).
- Th. (MW’16) Let π be G-Markov, where G = (V, E) is chordal.
Let P be a separating family pf moral DAGs with skeleton G. If ∀ p ∈ P the independence conditions GI(p) and LI(p) hold then π has a P-Dirichlet distribution.
SLIDE 32 Characterization of the hyper-Dirichlet law
- Cor. 0 Let P be a pairing and separating family of moral DAGs
(with a chordal skeleton G = (V, E)) satisfying
p(V) = S. If ∀ p ∈ P the independence conditions GI(p) and LI(p) hold then π has a hyper-Dirichlet distribution.
SLIDE 33 The case of a chain
For a chain G = 1 − 2 − . . . − d consider two DAGs p = 1 → 2 → . . . → d
p′ = 1 ← 2 ← . . . ← d.
- Cor. 1 If the random vectors
(pj|j−1
ℓ|k
, ℓ ∈ Ij), k ∈ Ij−1, j = 1, . . . , d I0 = ∅, are jointly independent and the random vectors (pj|j+1
ℓ|k
, ℓ ∈ Ij), k ∈ Ij+1, j = 1, . . . , d, Id+1 = ∅ are also jointly indpendent, then π has a hyper-Dirichlet distribution.
SLIDE 34 The case of a tree
Let T = (V, E) be a tree, i.e. a non-directed graph without loops. A vertex v ∈ V is a leaf if it has only one neighbour. Let L ⊂ V denote the set of leaves of the tree T. For a DAG p with skeleton T a vertex v ∈ V is called a root if p(v) = ∅. If p is a moral DAG (with skeleton T) then the (unique) root vertex v determines uniquely the DAG (notation: pv).
- Cor. 2 Assume that π has the independence properties GI(pv)
and LI(pv) ∀ v ∈ L. Then π has a hyper-Dirichlet distribution.
SLIDE 35 The case of a complete graph
Recal that any DAG with skeleton G being a complete graph (all such DAGs are moral) is uniquely determined by ordering of vertices: (v1, . . . , vd) means that #p(vj) = j − 1, j = 1, . . . , d. For two such DAGs: p = (v1, . . . , vd) and p′ = (v′
1, . . . , v′ d)
consider the condition ∀ j = 2, . . . , d p(vj) = p′(v′
j ).
(3)
- Cor. 3 Assume tha π has the independence properties GI and
LI wrt p and p′ satisfying (3). Then π has a classical Dirichlet distribution.
SLIDE 36
Heckerman, Geiger and Chickering (1995), revisited
For a complete graph, HGC assumed the indpendence conditions GI i LI wrt the DAGs p = (1, 2, 3 . . . , d − 1, d) and p′ = (d, 1, 2, . . . , d − 2, d − 1). Note that for j = 2, 3, . . . , d d ∈ p′(v′
j )
and d ∈ p(vj), that is, the condition (3) is satisfied. Consequently, the HGC characterization is an immediate consequence of Cor. 3. Therefore it holds withouth regularity assumptions assumed in HGC (1995).
SLIDE 37 Literature
[1.] ANDERSSON, S.A., MADIGAN, D., PERLMAN, M.D. (1997) A characterization of Markov equivalence classes for acyclic
- digraphs. Ann. Statist. 25, 505-541.
[2.] DAWID, A.P., LAURITZEN, S.L. (1993) Hyper-Markov laws in the statistical analysis of decomposable graphical models.
- Ann. Statist. 21, 1272-1317.
[3.] GEIGER, D., HECKERMAN, D. (1997) A characterization of the Dirichlet distribution through global and local parameter
- independence. Ann. Statist. 25, 1344-1369.
[4.] HECKERMAN, D., GEIGER, D., CHICKERING, D.M. (1995) Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 20, 197-243. [5.] LAURITZEN, S.L. (1996) Graphical Models, Oxford Univ. Press. [6.] MASSAM, H., WESOŁOWSKI, J. (2016) A new prior for discrete DAG models with a restricted set of directions. Ann.
- Statist. 44, 1010-1037 (with Supplement).