[PPT] - Directed Graphical Models: Bayesian Networks Probabilistic PowerPoint Presentation

SLIDE 1

Directed Graphical Models: Bayesian Networks

Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2018

SLIDE 2

Basics

 Multivariate distributions with large number of variables  Independency assumptions are useful

 Independence and conditional independence relationships simplify

representation and alleviate inference complexities

 Bayesian networks enable us to incorporate domain

knowledge and structures

 Modular combination of heterogeneous parts  Combining data and knowledge (Bayesian philosophy)

2

SLIDE 3

Conditional and marginal independence

3

 𝑌 and 𝑍 are conditionally independent given 𝑎 if:

𝑄 𝑌, 𝑍 𝑎 = 𝑄 𝑌 𝑎 𝑄 𝑍 𝑎

∀𝑦 ∈ 𝑊𝑏𝑚 𝑌 , 𝑧 ∈ 𝑊𝑏𝑚 𝑍 , 𝑨 ∈ 𝑊𝑏𝑚 𝑎 𝑄 𝑌 = 𝑦, 𝑍 = 𝑧 𝑎 = 𝑨 = 𝑄 𝑌 = 𝑦 𝑎 = 𝑨 𝑄 𝑍 = 𝑧 𝑎 = 𝑨  𝑌 and 𝑍 are marginal independent if:

𝑄 𝑌, 𝑍 = 𝑄 𝑌 𝑄(𝑍)

𝑄 𝑌 𝑍, 𝑎 = 𝑄 𝑌 𝑎 𝑄 𝑍 𝑌, 𝑎 = 𝑄 𝑍 𝑎

𝑌 ⊥ 𝑍|𝑎 𝑌 ⊥ 𝑍|∅

𝑄 𝑌 𝑍 = 𝑄(𝑌) 𝑄 𝑍 𝑌 = 𝑄(𝑍)

SLIDE 4

Bayesian network definition

4

 Bayesian Network

 Qualitative specification by a Directed Acyclic Graph (DAG)

 Each node denotes a random variable  Edges denote dependencies

 𝑌 → 𝑍 shows a "direct influence“ of 𝑌 on 𝑍 (𝑌 is a parent of 𝑍)

 Quantitative specification by CPDs

 CPD for each node 𝑌𝑗 defines 𝑄(𝑌𝑗 | 𝑄𝑏(𝑌𝑗))

 Bayesian Network represents a joint distribution over

variables (via DAG and CPDs) compactly in a factorized way:

𝑄(𝑌1, … , 𝑌𝑜) =

𝑗=1 𝑜

𝑄 (𝑌𝑗| 𝑄𝑏(𝑌𝑗))

SLIDE 5

Burglary example

5

John do not perceive burglaries directly John do not perceive minor earthquakes

SLIDE 6

Burglary example

6

 Bayesian networks define joint distribution (over the

variables) in terms of the graph structure and conditional probability distributions

𝑄 𝐶, 𝐹, 𝐵, 𝐾, 𝑁 = 𝑄 𝐶 𝑄 𝐹 𝑄 𝐵 𝐶, 𝐹 𝑄 𝐾 𝐵 𝑄(𝑁|𝐵)

SLIDE 7

Burglary example: DAG + CPTs

7

CPDs as quantitative specification

𝑄(𝐵 = 𝑢|𝐶, 𝐹)

𝑄(𝐾 = 𝑢|𝐵) 𝑄(𝑁 = 𝑢|𝐵)

SLIDE 8

Burglary example: full joint probability

𝐾 = 𝑢: 𝐾𝑝ℎ𝑜𝐷𝑏𝑚𝑚𝑡 = 𝑈𝑠𝑣𝑓 𝐶 = 𝑔: 𝐶𝑣𝑠𝑕𝑚𝑏𝑠𝑧 = 𝐺𝑏𝑚𝑡𝑓 … Short-hands

SLIDE 9

Burglary example: inference

9

 Conditional probability distribution:

 𝑄(𝐶 = 𝑢|𝐾 = 𝑢, 𝑁 = 𝑔) =

𝑄(𝐾=𝑢,𝑁=𝑔,𝐶=𝑢) 𝑄(𝐾=𝑢,𝑁=𝑔)

=

𝐵 𝐹 𝑄(𝐾=𝑢,𝑁=𝑔,𝐵,𝐶,𝐹) 𝐶 𝐵 𝐹 𝑄(𝐾=𝑢,𝑁=𝑔,𝐵,𝐶,𝐹)

SLIDE 10

Student example

10

𝑄(𝐸 = 𝑢) 0.65 Intelligence Difficulty Grade Letter SAT 𝑄(𝐽 = 𝑢) 0.55 𝐽 𝐸 𝑄(𝐻|𝐽, 𝐸) 𝐻 = 1 𝐻 = 2 𝐻 = 3 𝑔 𝑔 0.3 0.4 0.3 𝑔 𝑢 0.05 0.25 0.7 𝑢 𝑔 0.9 0.08 0.02 𝑢 𝑢 0.5 0.3 0.2 𝐽 𝑄(𝑇 = 1|𝐽) 𝑔 0.1 𝑢 0.7 𝐻 𝑄(𝑀 = 𝑢|𝐻) 1 0.9 2 0.5 3 0.05

SLIDE 11

Continuous variables example

11

 Linear Gaussian

𝑍 𝑌 𝑌~𝑂(0,1) 𝑍|𝑌 ~ 𝑂(𝑐 + 𝑌, 𝜏) 𝐵 𝐶 𝑦 𝑧 𝑐 = 0.5 𝜏 = 0.1 𝑞(𝑧|𝑦)

SLIDE 12

Missing edges

12

 The joint distribution is represented by the chain rule

generally:

𝑄(𝑌1, … , 𝑌𝑜) = 𝑄(𝑌1)

𝑗=2 𝑜

𝑄(𝑌𝑗|𝑌1, … , 𝑌𝑗−1)

 Equivalent to a graph in which all 𝑌1, … , 𝑌𝑗−1 are parents of 𝑌𝑗

 Missing edges imply conditional independencies.  If we use a DAG that is not complete:

 we remove some links, some of the conditioned variables are

missing

SLIDE 13

Compact representation

13

 A CPT for a Boolean variable with k Boolean parents requires:

 2𝑙 rows: different combinations of parent values

 𝑙 = 0: one row showing the prior probability

 If each variable has no more than 𝑙 parents

 Full joint distribution requires 2𝑜 − 1 numbers  Bayesian network requires at most 𝑜 × 2𝑙 numbers (linear with 𝑜)

 ⇒ Exponential reduction in number of parameters

SLIDE 14

Bayesian network semantics

14

 Local independencies:

 Each node is conditionally independent of its non-descendants

given its parents 𝑌𝑗 ⊥ Non_Descendants 𝑌𝑗 | 𝑄𝑏(𝑌𝑗)

 Are

local independencies all

f

the conditional independencies implied by a BN?

SLIDE 15

Factorization & independence

15

 Let 𝐻 be a graph over 𝑌1, … , 𝑌𝑜, distribution 𝑄 factorizes over 𝐻 if:

𝑄(𝑌1, … , 𝑌𝑜) =

𝑗=1 𝑜

𝑄 (𝑌𝑗| 𝑄𝑏(𝑌𝑗))

 Factorization ⇒ Independence

 If 𝑄 factorizes over 𝐻, then any variable in 𝑄 is independent of its non-

descendants given its parents (in the graph 𝐻)

 Factorization according to 𝐻 implies the associated conditional independencies.

 Independence ⇒ Factorization

 If any variable in the distribution 𝑄 is independent of its non-descendants

given its parents (in the graph 𝐻) then 𝑄 factorizes over 𝐻

 Conditional independencies imply factorization of the joint distribution (into a

product of simpler terms)

SLIDE 16

Independence ⇒ factorization

16

 Consider the chain rule:

𝑄(𝑌1, … , 𝑌𝑜) =

𝑗=1 𝑜

𝑄(𝑌𝑗|𝑌1, … , 𝑌𝑗−1)

 We can simplify it through conditional independencies

assumptions

 Given using 𝑌𝑗 ⫫ Non_Descendants 𝑌𝑗 | 𝑄𝑏(𝑌𝑗) we can show

𝑄 𝑌𝑗 𝑌1, 𝑌2, … , 𝑌𝑗−1) = 𝑄(𝑌𝑗| 𝑄𝑏𝑠𝑓𝑜𝑢𝑡(𝑌𝑗))

SLIDE 17

Equivalence Theorem

17

 For a graph G:

Let D1 denote the family of all distributions that satisfy

conditional independencies of G

Let D2 denote the family of all distributions that factor

according to G

⇒ D1≡ D2.

SLIDE 18

Other independencies

18

 Are there other independences that hold for every

distribution 𝑄 that factorizes over 𝐻?

 According to the graphical criterion called D-separation,

we can find independencies from the graph

 If 𝑄 factorizes over 𝐻, can we read these independencies from

the structure of 𝐻?

SLIDE 19

Basic structures

19

 𝑌 ⊥ 𝑍|𝑎  𝑌 ⊥ 𝑍|𝑎  𝑌 ⊥ 𝑍

Z Y X Z Y X Z Y X Explaining away

SLIDE 20

Explaining away

20

 When we condition on 𝑎 are 𝑌 and 𝑍 are independent?  𝑌 and 𝑍 are marginally independent but given 𝑎 they are

conditionally dependent

 This is called explaining away  Two coins example

Z Y X 𝑄 𝑌, 𝑍, 𝑎 = 𝑄 𝑌 𝑄 𝑍 𝑄(𝑎|𝑌, 𝑍)

SLIDE 21

D-separation

21

 Let 𝐵, 𝐶, 𝐷 denote three disjoint sets of nodes, 𝐵 is d-

separated from 𝐶 by 𝐷 iff 𝑩 ⊥ 𝑪|𝑫

 𝐵 is d-separated from 𝐶 by 𝐷 if all undirected paths

between 𝐵 and 𝐶 are blocked by 𝐷

SLIDE 22

Undirected path blocking

22

 Head-to-tail at a node 𝑎 ∈ 𝐷  Tail-to-tail at a node 𝑎 ∈ 𝐷  Head-to-head (i.e., v-structure) at a node 𝑎 (𝑎 ∉ 𝐷 & none of

its descendants are in 𝐷)

𝑌 ∈ 𝐵 𝑍 ∈ 𝐶 𝑎 ∈ 𝐷 X Y Z 𝑌 ∈ 𝐵 𝑍 ∈ 𝐶 𝑎 ∈ 𝐷 X Y Z 𝑌 ∈ 𝐵 𝑍 ∈ 𝐶 X Y Z

SLIDE 23

Undirected path blocking

23

𝐵 𝐶 𝐷

𝐵⊥𝐶|𝐷

In all trails (undirected paths) between A and B:

A node in the path is in 𝐷 and

the path at the node do not meet head-to-head.

Or a head-to-head node in the

path, and neither the node, nor any of its descendants, is in C … … … … … … … …

SLIDE 24

D-separation: active trail view

24

 Definition: 𝑌 and 𝑍 are d-separated in 𝐻 given 𝑎 if there

is no active trail in 𝐻 between 𝑌 and 𝑍 given 𝑎

 A trail between 𝑌 and 𝑍 is active:

 for any v-structure node 𝑉 in the trail 𝑌 … ⟶ 𝑉 ⟵ ⋯ 𝑍,

either 𝑉 or one of its descendants are in 𝑎

 other nodes in this trail are not in 𝑎

SLIDE 25

D-separation: example

25

Intelligence Difficulty Grade Letter Rank 𝑆⊥𝐻|𝐽 𝑆⊥𝐸|𝐽 𝑆 ⊥ 𝐸|𝐻 𝑆 ⊥ 𝐸|𝑀 𝑆 ⊥ 𝑀|𝐻 𝐸 ⊥ 𝑀|𝐻

SLIDE 26

Markov Blanket in Bayesian Network

26

 A variable is conditionally independent of all other

variables given its Markov blanket

 Markov blanket of a node:

 All parents  Children  Co-parents of children

SLIDE 27

D-Separation: soundness & completeness

27

 Soundness: Any conditional independence properties

that we can derive from 𝐻 should hold for the probability distribution that factorize over 𝐻

 Theorem: If 𝑄 factorizes over 𝐻, and d-sepG(𝒀, 𝒁|𝒂) then 𝑄

satisfies 𝒀 ⊥ 𝒁|𝒂

 Weak completeness:

 For almost all distributions 𝑄 that factorize over 𝐻, if 𝒀 ⊥ 𝒁|𝒂

is in 𝑄 then 𝒀 and 𝒁 are d-separated given 𝒂 in the graph 𝐻

 There can be independencies in 𝑄 that are not found by conditional

independence properties of 𝐻

SLIDE 28

I-equivalence

28

 Definition: Two graphs 𝐻1 and 𝐻2 over a set of variables

are I-equivalent if 𝐽(𝐻1) = 𝐽(𝐻2)

 Most graphs have many I-equivalent variants

A B C A B C A B C

SLIDE 29

I-equivalence

29

 If 𝐻1 and 𝐻2 have the same skeleton and the same set

f

immoralities (v-structures without direct edge between parents) then they are I-equivalent

A B C immorality

SLIDE 30

I-map

30

 𝐽 𝐻 = {(𝒀 ⊥ 𝒁|𝒂) ∶ d-sepG(𝒀, 𝒁|𝒂)}  Definition: If 𝑄 satisfies 𝐽(𝐻), we say that 𝐻 is an I-map

(independencies map) of 𝑄

 𝐽 𝐻 ⊆ 𝐽(𝑄) where 𝐽 𝑄 = { 𝒀 ⊥ 𝒁 𝒂 : 𝑄 ⊨ (𝒀, 𝒁|𝒂)}} 𝐽(𝑄): All conditional independence relations satisfied in 𝑄

SLIDE 31

Minimal I-map

31

 When more independence relations exist in the graph

 ⇒ sparser representation (fewer parameters)  ⇒ more informative or intuitive representation

 We want a graph that captures as much of the structure

(conditional independence relations) in 𝑄 as possible

 𝐻 is a minimal I-map for 𝑄 if it is an I-map for 𝑄, and

also the removal of each edge from 𝐻 renders it not an I- map.

 Minimal I-map may still not capture 𝐽(𝑄)

SLIDE 32

Minimal I-map

32

 The fact that G is a minimal I-map for P is far from a

guarantee that G captures the independence structure in P

Perfect map of a distribution P Minimal I-map of P Minimal I-map of P [Koller and Friedman Book]

SLIDE 33

Perfect map

33

 𝐻

is a D-map for a distribution 𝑄 if every conditional independence relation satisfied by the distribution is also in 𝐽(𝐻).

 𝐻 is a perfect map (P-map) for a distribution 𝑄 is both an I-

map and a D-map for that distribution

 Theorem: not every distribution has a perfect map as a DAG.

 A distribution 𝑄 with the independencies  𝐽 𝑄 = {𝐵 ⊥ 𝐷 𝐶, 𝐸 , 𝐶 ⊥ 𝐸 𝐵, 𝐷 }  cannot be represented by any Bayesian network.

A B D C A B D C

SLIDE 34

Perfect map

34

 A perfect map of a distribution is great, but may not exist

for many distributions

 A distribution 𝑄 can have many P-map graphs (all of them

are I-equivalent).

 A minimal I-map graph 𝐻 for distribution 𝑄 may be far

from a guarantee that 𝐻 contains all independencies in 𝑄

SLIDE 35

Bayesian networks: summary

35

 Bayesian network is a pair (𝐻, CPDs) where 𝐻 is a DAG and

CPDs can be used to find a joint distribution 𝑄 that factorizes

ver 𝐻

 Each CPD is the conditional distribution 𝑄(𝑌𝑗|𝑄𝑏(𝑌𝑗)) associated to

the graph node 𝑌𝑗.

 We can show “causality”, “generative schemes”, “asymmetric

influences”, etc., between variables via a Bayesian network

 We can find local and global independence relations from the

graph structure via d-separation criteria