Factor analysis & Exact inference for Gaussian networks - - PowerPoint PPT Presentation
Factor analysis & Exact inference for Gaussian networks - - PowerPoint PPT Presentation
Factor analysis & Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Multivariate Gaussian distribution 2 /2 1/2 exp{ 1 1 2
Multivariate Gaussian distribution
2
πͺ π|π, π― = 1 2π π/2 π― 1/2 exp{β 1 2 π β π ππ―β1(π β π)}
ο½ The natural, canonical, or information parameterization of
a Gaussian distribution arises from quadratic form πͺ π|π, π² β exp{β 1 2 πππ²π + πππ}
π³ = π² = π―β1 π = π―β1π
Joint Gaussian distribution: block elements
3
ο½ If we partition the vector π into π1 and π2:
π = π1 π2 π― = π―11 π―12 π―21 π―22 π π1 π2 π, π― = πͺ π1 π2 π1 π2 , π―11 π―12 π―21 π―22
π―21 = π―12
π
π―11 and π―22 are symmetric
Marginal and conditional of Gaussian
4
π¦1 π¦2 π(π¦1, π¦2) π¦2 = 0.7 π(π¦1) π(π¦1|π¦2 = 0.7) For multivariate Gaussian distribution, all marginal and conditional distributions are also Gaussians [Bishop]
Matrix inverse lemma
5
π΅ πΆ π· πΈ
β1
= π βππΆπΈβ1 βπΈβ1π·π πΈβ1 + πΈβ1π·ππΆπΈβ1 π = π΅ β πΆπΈβ1π· β1
Precision matrix
6
ο½ In many situations, it will be convenient to work with π³ = π―β1
known as precision matrix:
π³ = π―β1 = π―11 π―12 π―21 π―22
β1
π³ = π³11 π³12 π³21 π³22
Relation between the inverse of a partitioned matrix and the inverses of its partitions (using matrix inverse lemma): π³11 = π―11 β π―12π―22
β1π―21 β1
π³12 = β π―11 β π―12π―22
β1π―21 β1π―12π―22 β1
π³21 = π³12
π
π³11 and π³22 are symmetric
Marginal and conditional distributions based on block elements of π³
7
ο½ Conditional
π π1|π2 = πͺ π1|π1|2, π―1|2 π1|2 = π1 β π³11
β1π³12 π2 β π2
π―1|2 = π³11
β1
ο½ Marginal
π π1 = πͺ π1|π1, π―1 π―1 = π³11 β π³12π³22
β1π³21 β1
π³ = π―β1 π³ = π³11 π³12 π³21 π³22 linear-Gaussian model
Marginal and conditional distributions based on block elements of π―
8
ο½ Conditional distributions :
π π1|π2 = πͺ π1|π1|2, π―1|2 π1|2 = π1 + π―12π―22
β1 π2 β π2
π―1|2 = π―11 β π―12π―22
β1π―21
ο½ Marginal distributions based on block element of π and π―:
π π1 = πͺ π1|π1, π―11 π π2 = πͺ π2|π2, π―22
Factor analysis
9
ο½ Gaussian latent variable π (π dimensional)
ο½ Continuous latent variable ο½ Can be used for dimensionality reduction
ο½ Observed variable π (π < πΈ dimensional)
π π = πͺ π π, π± π π|π = πͺ π π + π©π, π
π π
π©: factor loading πΈ Γ π matrix π: diagonal covariance matrix
π β βπ π β βπΈ π, π©, π
Marginal distribution
10
π = π + π©π + π π~πͺ(π, π)
The product of Gaussian distributions are Gaussian, as well as the marginal of Gaussian, thus π π = π π π π|π ππ is Gaussian
ππ = πΉ π = πΉ π + π©π + π = π + π©πΉ π = π π―ππ = πΉ π β π π β π π = πΉ π©π + π π©π + π π = π©πΉ πππ π©π + π = π©π©π + π β π π = πͺ π|π,π©π©π + π
π is independent of π and π π π π π = πͺ π π, π± π π|π = πͺ π π + π©π, π
Joint Gaussian distribution
11
π―ππ = π·ππ€ π, π = πΉ π π©π + π π = π©π
π―ππ = π©π©π + π΄
π―ππ = π±
β π― = π± π© π©π π©π©π + π΄ πΉ π π = π π β π π π = πͺ π π π π , π± π© π©π π©π©π + π΄
Conditional distributions
12
π π|π = πͺ π|ππ|π,π»π|π ππ|π = π©π π©π©π + π β1 π β π π»π|π = π± β π©π π©π©π + π β1π©
ο½ πΈ Γ πΈ matrix is required to be inverted. If
π < πΈ, it is preferred to use:
ππ|π = π± + π©ππβ1π© β1π©ππβ1 π β π = π»π|ππ©ππβ1 π β π π»π|π = π± + π©ππβ1π© β1
π΅ β πΆπΈβ1π· β1 = π΅β1 + π΅β1πΆ πΈ β π·π΅β1πΆ β1π·π΅β1 Posterior covariance does not depend on observed data π! Computing the posterior mean is a linear operation π π1|π2 = πͺ π1|π1|2, π―1|2 π1|2 = π1 + π―12π―22
β1 π2 β π2
π―1|2 = π―11 β π―12π―22
β1π―21
π― = π± π© π©π π©π©π + π΄
Geometric illustration
13
π(π) π(π|π) To generate data, first generate a point within the manifold then add noise. π¦1 π¦2 π¦3 π [Jordan]
FA example
14
ο½ Data is a linear function of low-dimensional latent
coordinates, plus Gaussian noise
π¨ π
π π = πͺ(π|π, π±) π π π = πͺ π|π©π + π, π΄ π π = πͺ π|π, π©π©πΌ + π΄
[Bishop] π
Factor analysis: dimensionality reduction
15
ο½ FA is just a constrained Gaussian model
ο½ If π΄ were not diagonal then we could model any Gaussian
ο½ FA is a low rank parameterization of a multi-variate Gaussian
ο½ Since π π = πͺ π|π,π©π©π + π , FA approximates the covariance matrix
- f the visible vector using a low-rank decomposition π©π©π and the
diagonal matrix π
ο½ π©π©π + π is the outer product of two low-rank matrices plus a diagonal matrix
(i.e., π(ππΈ) parameters instead of π(πΈ2))
ο½ Given {π 1 , β¦ , π(π)} (the observation on high dimensional
data), by learning from incomplete data we find π© for transforming data to a lower dimensional space
Incomplete likelihood
16
β πΎ; π = β π 2 log π©π©π + π΄ β 1 2
π=1 π
π π β π
π π©π©π + π΄ β1 π π β π
= β π 2 log π©π©π + π΄ β 1 2 π’π π©π©π + π΄ β1π» π» =
π=1 π
π π β π π π β π
π
πππ = 1 π
π=1 π
π π
E-step: expected sufficient statistics
17
πΉπ β|π ,πΎπ’ log π π , β πΎ =
π=1 π
πΉπ π(π)|π(π),πΎπ’ log π π(π) πΎ + log π π(π) π(π), πΎ
ο½ Expected sufficient statistics:
πΉ log π π , β πΎ = β π 2 log π΄ β 1 2
π=1 π
π’π πΉ π π π π π β 1 2
π=1 π
π’π πΉ π π β π©π π π π β π©π π
π π΄β1 + π
πΉ π π β π©π(π) π π β π©π(π) π = π π π π π β π©πΉ π π π π β π π πΉ π π
ππ©π + π©πΉ π(π)π π π π©π
πΉπ π(π)|π(π),πΎπ’ π π = ππ|π(π) = π»π|π(π)π©ππ΄β1 π π β π πΉπ π(π)|π(π),πΎπ’ π(π)π π π = π»π|π(π) + ππ|π(π)ππ|π π
π
π»π|π = π± + π©ππβ1π© β1 ππ|π = π»π|ππ©ππβ1 π β π
M-Step
18
π©π’+1 =
π=1 π
π π πΉ π π
π π=1 π
πΉ π π π π π
β1
π΄π’+1 = 1 π diag πΉ π π β π©π’+1π(π) π π β π©π’+1π(π) π = diag(
π=1 π
π π π π π β π©π’+1
π=1 π
πΉ π π π π π)
Unidentifiability
19
ο½ π© only appears as outer product π©π©π, thus the model is
invariant to rotation and axis flips of the latent space.
ο½ π© can be replaced with π©πΉ for any orthonormal matrix πΉ and
the model containing only π©π©π remains the same.
ο½ Thus, FA is an un-identifiable model.
ο½ Likelihood objective function on a set of data will not have a
unique maximum (an infinite number of parameters give the maximum score)
ο½ It not be guaranteed to identify the same parameters.
Probabilistic PCA (PPCA)
20
ο½ Factor analysis: π΄ is a general diagonal matrix ο½ Probabilistic PCA: π΄ = π½π± and π© is orthogonal
Posterior mean is not an orthogonal projection, since it is shrunk somewhat towards the prior mean [Murphy] PCA
21
Exact inference for Gaussian networks
Multivariate Gaussian distribution
22
π π = 1 2π π/2 π― 1/2 exp{β 1 2 π β π ππ―β1(π β π)} π π β exp{β 1 2 πππ²π + π²π ππ}
ο½ Directed model
ο½ Linear Gaussian model
ο½ Undirected model
ο½ Gaussian MRF
π² = π―β1 π is normalizable (i.e., normalization constant is finite) and defines a legal Gaussian distribution if and only if π² is positive definite.
Linear-Gaussian model
23
ο½ Linear-Gaussian model for CPDs:
π ππ ππ ππ = πͺ ππ|
ππβππ ππ
π₯πππ
π + ππ , π€π
ο½ The joint distribution is Gaussian:
ln π π1, β¦ , ππΈ = β
π=1 πΈ
1 2π€π ππ β
ππβππ ππ
π₯πππ
π β ππ 2
+ π·
From linear-Gaussian model to joint multi- variate distribution
24
ο½ We can find the parameters of the multi-variate Gaussian
from the linear-Gaussian model
ο½ Mean and covariences (ππs are in topological order):
πΉ ππ =
ππβππ ππ
π₯πππΉ π
π + ππ
πππ€ ππ, π
π = ππβππ ππ
π₯πππππ€ ππ, ππ + π½πππ€π
Multivariate Gaussian: directed model example
25
ο½ Linear Gaussian
π2 π1 π π1 = πͺ(π1|π1, π€1) π π2 π1 = πͺ(π2|π₯π1 + π2, π€2) π₯ = 1 π2 = 0.5 π€2 = 0.2 π1 π2 π(π2|π1) π1 π2 π(π1, π2) π1 = 2 π€1 = 0.5
Gaussian Bayesian networks
26
ο½ We define a Gaussian Bayesian network to be a Bayesian
network all of whose variables are continuous, and where all
- f the CPDs are linear Gaussians.
ο½ For Gaussian networks, the joint distribution has a compact
representation
ο½ the number of variables will be quadratic
ο½ Transformations from the network to the joint and back have
a fairly simple and efficiently computable closed form
Independencies in multivariate Gaussian
27
ο½ π―ππ
β1 = 0 β ππ β₯ π π|π β {ππ, π π}
ο½ If π1, β¦ , ππΈ have a joint normal distribution π π = πͺ π, π―
then π―ππ
β1 = 0 if and only if π β¨ ππ β₯ π π|π β {ππ, π π}
ο½ π―ππ = 0 β ππ β₯ π
π
ο½ If π1, β¦ , ππΈ have a joint normal distribution πͺ π, π― then π―ππ
= 0 if and only if ππ β₯ π
π
Sparsity in covariance matrix
28
π1 π2 π3 π4 π1 π2 π3 π―13 = π―31 = 0 π―13
β1 β 0
If the parametrization is not degenerate, π― would be dense (i.e., βπ, π π―ππ β 0) π― = β β β β β β β
Multivariate Gaussian: undirected model
29
ο½ A Gaussian
distribution can be represented by a fully connected graph with pairwise (edge) potentials
ο½ Gaussian MRF
ο½ The overall energy has the form
πΉ(π) = 1 2
π,π
π¦π β ππ π»ππ
β1 π¦π β ππ
ππ,π π¦π, π¦π = βπ»ππ
β1π¦ππ¦π,
π < π ππ π¦π = β 1 2 π»ππ
β1π¦π 2 + π¦π π
π»ππ
β1ππ
Multivariate Gaussian: undirected model
30
ο½ π―ππ
β1 = 0 β ππ β₯ π π|π β {ππ, π π}
ο½ If π1, β¦ , ππΈ have a joint normal distribution π π = πͺ π, π―
then π―ππ
β1 = 0 if and only if π β¨ ππ β₯ π π|π β {ππ, π π}
ο½ We can view the information matrix as directly defining a
minimal I-map Markov network for the distribution
ο½ whereby nonzero entries correspond to edges in the network.
Sparsity in precision matrix π―β1
31
π―β1 = β β β β β β β β β β
π΅ πΆ π· πΈ
BP for continuous variables
32
π¦3 π¦4 π¦1 π32(π¦2) π42(π¦2) π21(π¦1) π¦2
Belief propagation: integral-product
33
πππ(π¦π) πππ(π¦π) πππ(π¦π) π π
ο½ Messages
πππ π¦π =
π¦π
π π¦π π π¦π, π¦π
πβπͺ(π)\j
πππ(π¦π) ππ¦π
ο½ Marginal probability function
π π¦π β π π¦π
πβπͺ(π)
πππ(π¦π)
BP for continuous variables
34
ο½ Is there a finitely parameterized, closed form for the
message and marginal functions?
ο½ Is there an analytic formula for the message integral,
phrased as an update of these parameters?
Canonical form properties
35
ο½ The product of two canonical form is in the canonical form ο½ The division of two canonical form is also in the canonical
form
ο½ The marginalization of a canonical form onto a subset of its
variables π results in a canonical form (when π―ππ is positive definite)
ο½ Instantiating a subset of variables results in a canonical form
Canonical forms
36
= for will be
Messages and marginals for Gaussian networks
37
ο½ We use the canonical form as a finitely parameterized, closed
form to show the message and marginal functions
ο½ The message integral is phrased in the canonical form and its parameters
can be found based on the parameters involved in the integral
ο½ The inference algorithm will be correct since we can show
that it executes a marginalization step only on canonical well- defined forms (for which this operation is well defined).
Gaussian Markov network: factors
38
ο½ The graph topology can be specified by the structure of the
matrix π², i.e. the edges set {π, π} includes all non-zero entries of π² for which π > π:
πππ π¦π, π¦π = exp β 1 2 π¦ππΎπππ¦π ππ π¦π = exp β 1 2 πΎπππ¦π
2 + βππ¦π
ππ π¦π β πͺ πππ, πΎππ
β1
βππ = πΎπππππ πΎππ = Ξ£ππ
β1
One form of parametrizing factors (that is not uniquely defined)
Gaussian Markov network: messages
39
ο½ If we assume Gaussian MRF:
ο½ Messages and marginal functions are all Gaussian ο½ Updates will be in terms of updating parameters πΎ and β
ο½ ππ π¦π πβπͺ(π)\π πππ(π¦π) β πͺ(ππ\π, πΎπ\π
β1)
ο½ πΎπ\π = πΎππ + πβπͺ(π)\π πΎπβπ ο½ βπ\π = βππ + πβπͺ(π)\π βπβπ
ο½ πππ π¦π = ππ,π π¦π, π¦π ππ π¦π πβπͺ(π)\π πππ(π¦π) ππ¦π β πͺ(ππβπ, πΎπβπ
β1 )
ο½ πΎπβπ = βπΎππ πΎπ\π
β1πΎππ
ο½ βπβπ = πΎππ πΎπ\π
β1βπ\π
Messages for the Gaussian networks
40
ο½ Messages in the canonical form:
πππ π¦π = exp β 1 2 πΎπβππ¦π
2 + βπβππ¦π
πΎπ\π = πΎππ +
πβπͺ(π)\π
πΎπβπ βπ\π = βππ +
πβπͺ(π)\π
βπβπ πΎπβπ = βπΎππ πΎπ\π
β1πΎππ
βπβπ = πΎππ πΎπ\π
β1βπ\π
Marginal distributions
41
π π¦π β π π¦π
πβπͺ π
πππ π¦π π π¦π β πͺ(ππ, πΎπ
β1)
πΎπ = πΎππ +
πβπͺ(π)
πΎπβπ βπ = βππ +
πβπͺ(π)
βπβπ
ππ = πΎπ βπ
Exact inference for Gaussian networks
42
ο½ All exact inference algorithms can be adapted to Gaussian
networks.
ο½ only the representation of factors and the implementation of the basic
factor operations are different.
ο½ Inference in Gaussian networks is computationally linear in the no.
- f cliques, and at most cubic in the size of the largest clique.
ο½ When the Gaussian has sufficiently low dimension, naive approach
for inference may be sufficient.
ο½ When we have a high dimensional Gaussian distribution and the network
has low tree-width, the message-passing algorithms can provide considerable savings.
References
43