Directed Graphical Models: Bayesian Networks Probabilistic - - PowerPoint PPT Presentation
Directed Graphical Models: Bayesian Networks Probabilistic - - PowerPoint PPT Presentation
Directed Graphical Models: Bayesian Networks Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2018 Basics Multivariate distributions with large number of variables Independency assumptions are useful
Basics
Multivariate distributions with large number of variables Independency assumptions are useful
Independence and conditional independence relationships simplify
representation and alleviate inference complexities
Bayesian networks enable us to incorporate domain
knowledge and structures
Modular combination of heterogeneous parts Combining data and knowledge (Bayesian philosophy)
2
Conditional and marginal independence
3
𝑌 and 𝑍 are conditionally independent given 𝑎 if:
𝑄 𝑌, 𝑍 𝑎 = 𝑄 𝑌 𝑎 𝑄 𝑍 𝑎
∀𝑦 ∈ 𝑊𝑏𝑚 𝑌 , 𝑧 ∈ 𝑊𝑏𝑚 𝑍 , 𝑨 ∈ 𝑊𝑏𝑚 𝑎 𝑄 𝑌 = 𝑦, 𝑍 = 𝑧 𝑎 = 𝑨 = 𝑄 𝑌 = 𝑦 𝑎 = 𝑨 𝑄 𝑍 = 𝑧 𝑎 = 𝑨 𝑌 and 𝑍 are marginal independent if:
𝑄 𝑌, 𝑍 = 𝑄 𝑌 𝑄(𝑍)
𝑄 𝑌 𝑍, 𝑎 = 𝑄 𝑌 𝑎 𝑄 𝑍 𝑌, 𝑎 = 𝑄 𝑍 𝑎
𝑌 ⊥ 𝑍|𝑎 𝑌 ⊥ 𝑍|∅
𝑄 𝑌 𝑍 = 𝑄(𝑌) 𝑄 𝑍 𝑌 = 𝑄(𝑍)
Bayesian network definition
4
Bayesian Network
Qualitative specification by a Directed Acyclic Graph (DAG)
Each node denotes a random variable Edges denote dependencies
𝑌 → 𝑍 shows a "direct influence“ of 𝑌 on 𝑍 (𝑌 is a parent of 𝑍)
Quantitative specification by CPDs
CPD for each node 𝑌𝑗 defines 𝑄(𝑌𝑗 | 𝑄𝑏(𝑌𝑗))
Bayesian Network represents a joint distribution over
variables (via DAG and CPDs) compactly in a factorized way:
𝑄(𝑌1, … , 𝑌𝑜) =
𝑗=1 𝑜
𝑄 (𝑌𝑗| 𝑄𝑏(𝑌𝑗))
Burglary example
5
John do not perceive burglaries directly John do not perceive minor earthquakes
Burglary example
6
Bayesian networks define joint distribution (over the
variables) in terms of the graph structure and conditional probability distributions
𝑄 𝐶, 𝐹, 𝐵, 𝐾, 𝑁 = 𝑄 𝐶 𝑄 𝐹 𝑄 𝐵 𝐶, 𝐹 𝑄 𝐾 𝐵 𝑄(𝑁|𝐵)
Burglary example: DAG + CPTs
7
CPDs as quantitative specification
𝑄(𝐵 = 𝑢|𝐶, 𝐹)
𝑄(𝐾 = 𝑢|𝐵) 𝑄(𝑁 = 𝑢|𝐵)
Burglary example: full joint probability
8 𝑄 𝐾, 𝑁, 𝐵, 𝐶, 𝐹 = 𝑄(𝐾|𝐵) 𝑄(𝑁|𝐵) 𝑄(𝐵|𝐶, 𝐹) 𝑄 (𝐶) 𝑄 (𝐹) 𝑄 𝐾 = 𝑢, 𝑁 = 𝑢, 𝐵 = 𝑢, 𝐶 = 𝑔, 𝐹 = 𝑔 = 𝑄(𝐾 = 𝑢|𝐵 = 𝑢) 𝑄(𝑁 = 𝑢|𝐵 = 𝑢) 𝑄(𝐵 = 𝑢|𝐶 = 𝑔, 𝐹 = 𝑔) 𝑄 (𝐶 = 𝑔) 𝑄 (𝐹 = 𝑔) = 0.9 × 0.7 × 0.001 × 0.999 × 0.998 = 0.000628
𝐾 = 𝑢: 𝐾𝑝ℎ𝑜𝐷𝑏𝑚𝑚𝑡 = 𝑈𝑠𝑣𝑓 𝐶 = 𝑔: 𝐶𝑣𝑠𝑚𝑏𝑠𝑧 = 𝐺𝑏𝑚𝑡𝑓 … Short-hands
Burglary example: inference
9
Conditional probability distribution:
𝑄(𝐶 = 𝑢|𝐾 = 𝑢, 𝑁 = 𝑔) =
𝑄(𝐾=𝑢,𝑁=𝑔,𝐶=𝑢) 𝑄(𝐾=𝑢,𝑁=𝑔)
=
𝐵 𝐹 𝑄(𝐾=𝑢,𝑁=𝑔,𝐵,𝐶,𝐹) 𝐶 𝐵 𝐹 𝑄(𝐾=𝑢,𝑁=𝑔,𝐵,𝐶,𝐹)
Student example
10
𝑄(𝐸 = 𝑢) 0.65 Intelligence Difficulty Grade Letter SAT 𝑄(𝐽 = 𝑢) 0.55 𝐽 𝐸 𝑄(𝐻|𝐽, 𝐸) 𝐻 = 1 𝐻 = 2 𝐻 = 3 𝑔 𝑔 0.3 0.4 0.3 𝑔 𝑢 0.05 0.25 0.7 𝑢 𝑔 0.9 0.08 0.02 𝑢 𝑢 0.5 0.3 0.2 𝐽 𝑄(𝑇 = 1|𝐽) 𝑔 0.1 𝑢 0.7 𝐻 𝑄(𝑀 = 𝑢|𝐻) 1 0.9 2 0.5 3 0.05
Continuous variables example
11
Linear Gaussian
𝑍 𝑌 𝑌~𝑂(0,1) 𝑍|𝑌 ~ 𝑂(𝑐 + 𝑌, 𝜏) 𝐵 𝐶 𝑦 𝑧 𝑐 = 0.5 𝜏 = 0.1 𝑞(𝑧|𝑦)
Missing edges
12
The joint distribution is represented by the chain rule
generally:
𝑄(𝑌1, … , 𝑌𝑜) = 𝑄(𝑌1)
𝑗=2 𝑜
𝑄(𝑌𝑗|𝑌1, … , 𝑌𝑗−1)
Equivalent to a graph in which all 𝑌1, … , 𝑌𝑗−1 are parents of 𝑌𝑗
Missing edges imply conditional independencies. If we use a DAG that is not complete:
we remove some links, some of the conditioned variables are
missing
Compact representation
13
A CPT for a Boolean variable with k Boolean parents requires:
2𝑙 rows: different combinations of parent values
𝑙 = 0: one row showing the prior probability
If each variable has no more than 𝑙 parents
Full joint distribution requires 2𝑜 − 1 numbers Bayesian network requires at most 𝑜 × 2𝑙 numbers (linear with 𝑜)
⇒ Exponential reduction in number of parameters
Bayesian network semantics
14
Local independencies:
Each node is conditionally independent of its non-descendants
given its parents 𝑌𝑗 ⊥ Non_Descendants 𝑌𝑗 | 𝑄𝑏(𝑌𝑗)
Are
local independencies all
- f
the conditional independencies implied by a BN?
Factorization & independence
15
Let 𝐻 be a graph over 𝑌1, … , 𝑌𝑜, distribution 𝑄 factorizes over 𝐻 if:
𝑄(𝑌1, … , 𝑌𝑜) =
𝑗=1 𝑜
𝑄 (𝑌𝑗| 𝑄𝑏(𝑌𝑗))
Factorization ⇒ Independence
If 𝑄 factorizes over 𝐻, then any variable in 𝑄 is independent of its non-
descendants given its parents (in the graph 𝐻)
Factorization according to 𝐻 implies the associated conditional independencies.
Independence ⇒ Factorization
If any variable in the distribution 𝑄 is independent of its non-descendants
given its parents (in the graph 𝐻) then 𝑄 factorizes over 𝐻
Conditional independencies imply factorization of the joint distribution (into a
product of simpler terms)
Independence ⇒ factorization
16
Consider the chain rule:
𝑄(𝑌1, … , 𝑌𝑜) =
𝑗=1 𝑜
𝑄(𝑌𝑗|𝑌1, … , 𝑌𝑗−1)
We can simplify it through conditional independencies
assumptions
Given using 𝑌𝑗 ⫫ Non_Descendants 𝑌𝑗 | 𝑄𝑏(𝑌𝑗) we can show
𝑄 𝑌𝑗 𝑌1, 𝑌2, … , 𝑌𝑗−1) = 𝑄(𝑌𝑗| 𝑄𝑏𝑠𝑓𝑜𝑢𝑡(𝑌𝑗))
Equivalence Theorem
17
For a graph G:
- Let D1 denote the family of all distributions that satisfy
conditional independencies of G
- Let D2 denote the family of all distributions that factor
according to G
- ⇒ D1≡ D2.
Other independencies
18
Are there other independences that hold for every
distribution 𝑄 that factorizes over 𝐻?
According to the graphical criterion called D-separation,
we can find independencies from the graph
If 𝑄 factorizes over 𝐻, can we read these independencies from
the structure of 𝐻?
Basic structures
19
𝑌 ⊥ 𝑍|𝑎 𝑌 ⊥ 𝑍|𝑎 𝑌 ⊥ 𝑍
Z Y X Z Y X Z Y X Explaining away
Explaining away
20
When we condition on 𝑎 are 𝑌 and 𝑍 are independent? 𝑌 and 𝑍 are marginally independent but given 𝑎 they are
conditionally dependent
This is called explaining away Two coins example
Z Y X 𝑄 𝑌, 𝑍, 𝑎 = 𝑄 𝑌 𝑄 𝑍 𝑄(𝑎|𝑌, 𝑍)
D-separation
21
Let 𝐵, 𝐶, 𝐷 denote three disjoint sets of nodes, 𝐵 is d-
separated from 𝐶 by 𝐷 iff 𝑩 ⊥ 𝑪|𝑫
𝐵 is d-separated from 𝐶 by 𝐷 if all undirected paths
between 𝐵 and 𝐶 are blocked by 𝐷
Undirected path blocking
22
Head-to-tail at a node 𝑎 ∈ 𝐷 Tail-to-tail at a node 𝑎 ∈ 𝐷 Head-to-head (i.e., v-structure) at a node 𝑎 (𝑎 ∉ 𝐷 & none of
its descendants are in 𝐷)
𝑌 ∈ 𝐵 𝑍 ∈ 𝐶 𝑎 ∈ 𝐷 X Y Z 𝑌 ∈ 𝐵 𝑍 ∈ 𝐶 𝑎 ∈ 𝐷 X Y Z 𝑌 ∈ 𝐵 𝑍 ∈ 𝐶 X Y Z
Undirected path blocking
23
𝐵 𝐶 𝐷
𝐵⊥𝐶|𝐷
In all trails (undirected paths) between A and B:
- A node in the path is in 𝐷 and
the path at the node do not meet head-to-head.
- Or a head-to-head node in the
path, and neither the node, nor any of its descendants, is in C … … … … … … … …
D-separation: active trail view
24
Definition: 𝑌 and 𝑍 are d-separated in 𝐻 given 𝑎 if there
is no active trail in 𝐻 between 𝑌 and 𝑍 given 𝑎
A trail between 𝑌 and 𝑍 is active:
for any v-structure node 𝑉 in the trail 𝑌 … ⟶ 𝑉 ⟵ ⋯ 𝑍,
either 𝑉 or one of its descendants are in 𝑎
other nodes in this trail are not in 𝑎
D-separation: example
25
Intelligence Difficulty Grade Letter Rank 𝑆⊥𝐻|𝐽 𝑆⊥𝐸|𝐽 𝑆 ⊥ 𝐸|𝐻 𝑆 ⊥ 𝐸|𝑀 𝑆 ⊥ 𝑀|𝐻 𝐸 ⊥ 𝑀|𝐻
Markov Blanket in Bayesian Network
26
A variable is conditionally independent of all other
variables given its Markov blanket
Markov blanket of a node:
All parents Children Co-parents of children
D-Separation: soundness & completeness
27
Soundness: Any conditional independence properties
that we can derive from 𝐻 should hold for the probability distribution that factorize over 𝐻
Theorem: If 𝑄 factorizes over 𝐻, and d-sepG(𝒀, 𝒁|𝒂) then 𝑄
satisfies 𝒀 ⊥ 𝒁|𝒂
Weak completeness:
For almost all distributions 𝑄 that factorize over 𝐻, if 𝒀 ⊥ 𝒁|𝒂
is in 𝑄 then 𝒀 and 𝒁 are d-separated given 𝒂 in the graph 𝐻
There can be independencies in 𝑄 that are not found by conditional
independence properties of 𝐻
I-equivalence
28
Definition: Two graphs 𝐻1 and 𝐻2 over a set of variables
are I-equivalent if 𝐽(𝐻1) = 𝐽(𝐻2)
Most graphs have many I-equivalent variants
A B C A B C A B C
I-equivalence
29
If 𝐻1 and 𝐻2 have the same skeleton and the same set
- f
immoralities (v-structures without direct edge between parents) then they are I-equivalent
A B C immorality
I-map
30
𝐽 𝐻 = {(𝒀 ⊥ 𝒁|𝒂) ∶ d-sepG(𝒀, 𝒁|𝒂)} Definition: If 𝑄 satisfies 𝐽(𝐻), we say that 𝐻 is an I-map
(independencies map) of 𝑄
𝐽 𝐻 ⊆ 𝐽(𝑄) where 𝐽 𝑄 = { 𝒀 ⊥ 𝒁 𝒂 : 𝑄 ⊨ (𝒀, 𝒁|𝒂)}} 𝐽(𝑄): All conditional independence relations satisfied in 𝑄
Minimal I-map
31
When more independence relations exist in the graph
⇒ sparser representation (fewer parameters) ⇒ more informative or intuitive representation
We want a graph that captures as much of the structure
(conditional independence relations) in 𝑄 as possible
𝐻 is a minimal I-map for 𝑄 if it is an I-map for 𝑄, and
also the removal of each edge from 𝐻 renders it not an I- map.
Minimal I-map may still not capture 𝐽(𝑄)
Minimal I-map
32
The fact that G is a minimal I-map for P is far from a
guarantee that G captures the independence structure in P
Perfect map of a distribution P Minimal I-map of P Minimal I-map of P [Koller and Friedman Book]
Perfect map
33
𝐻
is a D-map for a distribution 𝑄 if every conditional independence relation satisfied by the distribution is also in 𝐽(𝐻).
𝐻 is a perfect map (P-map) for a distribution 𝑄 is both an I-
map and a D-map for that distribution
Theorem: not every distribution has a perfect map as a DAG.
A distribution 𝑄 with the independencies 𝐽 𝑄 = {𝐵 ⊥ 𝐷 𝐶, 𝐸 , 𝐶 ⊥ 𝐸 𝐵, 𝐷 } cannot be represented by any Bayesian network.
A B D C A B D C
Perfect map
34
A perfect map of a distribution is great, but may not exist
for many distributions
A distribution 𝑄 can have many P-map graphs (all of them
are I-equivalent).
A minimal I-map graph 𝐻 for distribution 𝑄 may be far
from a guarantee that 𝐻 contains all independencies in 𝑄
Bayesian networks: summary
35
Bayesian network is a pair (𝐻, CPDs) where 𝐻 is a DAG and
CPDs can be used to find a joint distribution 𝑄 that factorizes
- ver 𝐻
Each CPD is the conditional distribution 𝑄(𝑌𝑗|𝑄𝑏(𝑌𝑗)) associated to
the graph node 𝑌𝑗.
We can show “causality”, “generative schemes”, “asymmetric
influences”, etc., between variables via a Bayesian network
We can find local and global independence relations from the
graph structure via d-separation criteria
Reference
36