Variational inference Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation
Variational inference Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation
Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing s slides Nodes: = { 1 , , } Evidence: Inference query Query
Inference query
2
ο½ Marginal probability (Likelihood):
π ππ =
π
π(π, ππ)
ο½ Conditional probability (a posteriori belief):
π π|ππ = π(π, ππ) π π(π, ππ)
ο½ Marginalized conditional probability:
π π|ππ = πΏ π(π, πΏ, ππ) π πΏ π(π, πΏ, ππ) (π = π βͺ πΏ)
ο½ Most probable assignment for some variables of interest given an
evidence ππΎ = ππ πβ|ππ = argmax
π
π π|ππ
Nodes: π = {π1, β¦ , ππ} Evidence: ππΎ Query variables: π = π\ππΎ
Exact methods for inference
3
ο½ Variable elimination ο½ Message Passing: shared terms
ο½ Sum-product (belief propagation) ο½ Max-product ο½ Junction Tree
Junction tree
4
ο½ General algorithm on graphs with cycles ο½ Message passing on junction trees
π·
π
πππ π·π πππ πππ πππ πππ
Why approximate inference
5
ο½ The computational complexity of Junction tree algorithm with be at
least πΏ π· where π· shows the largest elimination clique (the largest clique in the triangulated graph)
ο½ For a distribution π associated with a complex graph, computing the
marginal (or conditional) probability of arbitrary random variable(s) is intractable
Tree-width of an π Γ π grid is π
Learning and inference
6
ο½ Learning is also an inference problem or usually needs
inference
ο½ For Bayesian inference that is one of the principal foundations
for machine learning, learning is just an inference problem
ο½ For Maximum Likelihood approach, also, we need inference
when we have incomplete data or when we encounter an undirected model
Approximate inference
7
ο½ Approximate inference techniques
ο½ Variational algorithms
ο½ Loopy belief propagation ο½ Mean field approximation ο½ Expectation propagation
ο½ Stochastic simulation / sampling methods
Variational methods
8
ο½ βvariationalβ:
general term for
- ptimization-based
formulations
ο½ Many problems can be expressed in terms of an optimization
problem in which the quantity being optimized is a functional
ο½ Variational inference is a deterministic framework that is
widely used for approximate inference
Variatonal inference methods
9
ο½ Constructing an approximation to the target distribution π
where this approximation takes a simpler form for inference:
ο½ We define a target class of distributions π ο½ Search for an instance π β in π that is the best approximation to π ο½ Queries will be answered using π β rather than on π
ο½ π: given family of distributions
ο½ Simpler families for which solving the optimization problem will
be computationally tractable
ο½ However, the family may not be sufficiently expressive to
encode π
Constrained
- ptimization
Setup
10
ο½ Assume that we are interested in the posterior distribution
π π π, π½ = π(π, π|π½) π π, π π½ ππ
ο½ The problem of computing the posterior is an instance of more
general problems that variational inference solves
ο½ Main idea:
ο½ We pick a family of distributions over the latent variables with its own
variational parameters
ο½ Then, find the setting of the parameters that makes π close to the posterior
- f interest
ο½ Use π with the fitted parameters as an approximation for the posterior
π = {π¦1, β¦ , π¦π} π = {π¨1, β¦ , π¨π} Observed variables Hidden variables
Approximation
11
ο½ Goal: Approximate a difficult distribution π(π|π) with a
new distribution π (π) such that:
ο½ π(π|π) and π (π) are close ο½ Computation on π (π) is easy
ο½ Typically, the true posterior is not in the variational family. ο½ How should we measure distance between distributions?
ο½ The Kullback-Leibler divergence (KL-divergence) between two
distributions π and π
KL divergence
12
ο½ Kullback-Leibler divergence between π and π :
πΏπ(π| π = π π¦ log π(π¦) π (π¦) ππ¦
ο½ A result from information theory: For any π and π
πΏπ(π| π β₯ 0
ο½ πΏπ(π| π = 0 if and only if π β‘ π ο½ πΈ is asymmetric
How measure the distance of π and π ?
13
ο½ We wish to find a distribution π such that π is a βgoodβ
approximation to π
ο½ We can therefore use KL divergence as a scoring function
to decide a good π
ο½ But, πΏπ(π(π|π)||π (π)) β πΏπ(π (π)||π(π|π))
M-projection vs. I-projection
14
ο½ M-projection of π onto π
π β = argmin
π βπ
πΏπ(π||π )
ο½ I-projection of π onto π
π β = argmin
π βπ
πΏπ(π ||π)
ο½ These two will differ only when π is minimized over a
restricted set of probability distributions (when π β π set
- f possible π distributions)
KL divergence: M-projection vs. I-projection
15
ο½ Let π be a 2D Gaussian and π be a Gaussian distribution
with diagonal covariance matrix:
π: Green π β: Red π β = argmin
π
π π log π π π π ππ π β = argmin
π
π π log π π π π ππ πΉπ π = πΉπ [π] πΉπ π = πΉπ [π] [Bishop]
KL divergence: M-projection vs. I-projection
16
ο½ Let π is mixture of two 2D Gaussians and π be a 2D
Gaussian distribution with arbitrary covariance matrix:
π: Blue π β: Red [Bishop] πΉπ π = πΉπ π π·ππ€π π = π·ππ€π π π β = argmin
π
π π log π π π π ππ π β = argmin
π
π π log π π π π ππ two good solutions!
M-projection
17
ο½ Computing πΏπ(π| π requires inference on π
πΏπ(π| π =
π¨
π π¨ log π π¨ π π¨ = βπΌ π β πΉπ[log π (π¨)]
ο½ When π is in the exponential family:
πΏπ(π| π = 0 β πΉπ π π¨ = πΉπ [π π¨ ]
ο½ Expectation
Propagation methods are based
- n
minimizing πΏπ(π| π
Moment projection Inference on π (that is difficult) is required!
I-projection
18
ο½ πΏπ(π | π
can be computed without performing inference on π πΏπ(π | π = π π¨ log π π¨ π π¨ ππ¨ = βπΌ π β πΉπ [log π(π¨)]
ο½ Most variational inference algorithms make use of πΏπ(π | π ο½ Computing expectations w.r.t. π is tractable (by choosing a
suitable class of distributions for π )
ο½ We choose a restricted family of distributions such that the expectations
can be evaluated and optimized efficiently.
ο½ and yet which is still sufficiently flexible as to give a good approximation
Example of variatinal approximation
19
[Bishop] Variational Laplace Approx.
Evidence Lower Bound (ELBO)
20
ln π π = β π + πΏπ(π ||π) β π = π π ln π(π, π) π (π) ππ πΏπ(π ||π) = β π π ln π(π|π) π (π) ππ
ο½ We can maximize the lower bound β π
ο½ equivalent to minimizing KL divergence. ο½ if we allow any possible choice for π (π), then the maximum of the
lower bound occurs when the KL divergence vanishes
ο½ occurs when π (π) equals the posterior distribution π(π|π).
ο½ The
difference between the ELBO and the KL divergence is ln π(π) which is what the ELBO bounds
We also called β π as πΊ[π, π ] latter. π = {π¦1, β¦ , π¦π} π = {π¨1, β¦ , π¨π}
Evidence Lower Bound (ELBO)
21
ο½ Lower bound on the marginal likelihood ο½ This quantity should increase monotonically with each
iteration
ο½ we maximize the ELBO to find the parameters that gives as
tight a bound as possible on the marginal likelihood
ο½ ELBO converges to a local minimum.
ο½ Variational inference is closely related to EM
Factorized distributions π
22
π π =
π
π π(ππ) β π =
π
π π ln π(π, π) β
π
ln π π ππ Coordinate ascent to optimize β π : βπ π = π π ln π(π, π)
πβ π
π π πππ ππ
π β π π ln π π ππ π + ππππ‘π’
β βπ π = π ππΉβπ ln π π, π ππ
π β π π ln π π ππ π + ππππ‘π’
The restriction on the distributions in the form of factorization assumptions:
πΉβπ ln π π, π = ln π π, π
πβ π
π π πππ
Factorized distributions π : optimization
23
π(π π, π) = βπ π + π(
ππ
π π
π β 1)
ππ ππ (π
π) = πΉβπ log π π, π
β log π π
π β 1 + π = 0
β π β(π
π) β exp πΉβπ ln π π, π
π β(π
π) β exp πΉβπ ln π π π|πβπ, π
ο½ The above formula determines the form of the optimal π . We didn't specify
the form in advance and only the factorization has been assumed.
ο½ Depending on that form, the optimal π π π
might not be easy to work with. Nonetheless, for many models it is.
ο½ Since we are replacing the neighboring values by their mean value, the
method is known as mean field.
Example: Gaussian factorized distribution
24
Example: Gaussian factorized distribution
25
Solution:
Example: Bayesian mixtures of Gaussians
26
ο½ For simplicity, assume that the data generating variance is one.
ο½ π π = π=1
πΏ
πͺ ππ|π0, π³0
β1
ο½ π π¨π
π = 1|π = ππ ο½ π π(π)|π¨π
π = 1, π = πͺ π(π)|ππ, π±
For π = 1, β¦ , πΏ Draw ππ~πͺ π0, π³0
β1
For π = 1, β¦ , π Draw π(π)~ππ£ππ’ π Draw π(π)~ π=1
πΏ
πͺ ππ, π± π¨π
π
π(π) π(π) π ππ ππ
Example: Bayesian mixtures of Gaussians
27
π = π(1), β¦ , π π , π1, β¦ , ππΏ π = π(1), β¦ , π π
π π(1), β¦ , π π , π1, β¦ , ππΏ|π(1), β¦ , π π = π=1
πΏ
π ππ π=1
π
π π(π) π π(π)|π π , π1, β¦ , ππΏ
π1,β¦,ππΏ π(1),β¦,π π π=1 πΏ
π ππ π=1
π
π π(π) π π(π)|π π , π1, β¦ , ππΏ
The denominator is difficult to compute
Example: Bayesian mixtures of Gaussians
28
ο½ Consider
a variational distribution which factorizes between the latent variables and the parameters: π π 1 , β¦ , π π , π1, β¦ , ππΏ = π π 1 , β¦ , π π π π1, β¦ , ππΏ
ο½ This is the only assumption required to make in order to
- btain a tractable practical solution
Example: Bayesian mixtures of Gaussians
29
ln π (π 1 , β¦ , π π ) = πΉπ1,β¦,ππΏ ln π π, π + const = πΉπ1,β¦,ππΏ ln π π 1 , β¦ , π π , π1, β¦ , ππΏ, π 1 , β¦ , π π + const = πΉπ1,β¦,ππΏ ln
π=1 πΏ
π ππ
π=1 π
π π(π) π π(π)|π π , π1, β¦ , ππΏ
+ const
= πΉπ1,β¦,ππΏ
π=1 πΏ
ln π ππ +
π=1 π
ln π π(π) +
π=1 π
ln π π(π)|π π , π1, β¦ , ππΏ
+ const
ln π π(π)|π π , π1, β¦ , ππΏ =
π=1 πΏ
π¨π
(π) ln π π(π)|ππ, π±
= β π 2 ln 2π β 1 2
π=1 πΏ
π¨π
π
π π β ππ
π π π β ππ
ln π π(π) =
π=1 πΏ
π¨π
π ln ππ
Example: Bayesian mixtures of Gaussians
30
ln π (π 1 , β¦ , π π ) =
π=1 π
ln π π(π) β π π 1 , β¦ , π π =
π=1 π
π π(π) ln π π(π) = πΉπ1,β¦,ππΏ
π=1 πΏ
π¨π
π ln ππ β 1
2
π=1 πΏ
π¨π
π
π π β ππ
π π π β ππ
+ const
Example: Bayesian mixtures of Gaussians
31
ln π π(π) = πΉπ1,β¦,ππΏ
π=1 πΏ
π¨π
π ln ππ β 1
2
π=1 πΏ
π¨π
π
π π β ππ
π π π β ππ
+ const
ln π π(π) =
π=1 πΏ
π¨π
π
ln ππ + π π ππΉππ ππ β 1 2 πΉππ ππ
πππ β 1
2 π π ππ π + const β π π(π) = ππ£ππ’ π π1, β¦ , π ππ πΉ π¨π
(π) = π ππ
π ππ = ππ¦π ln ππ + π π ππΉ ππ β 1 2 πΉ ππ
πππ β 1
2 π π ππ π π=1
πΏ
ππ¦π ln ππ + π π ππΉ ππ β 1 2 πΉ ππ
πππ β 1
2 π π ππ π
Example: Bayesian mixtures of Gaussians
32
ln π (π1, β¦ , ππΏ) = πΉπ 1 ,β¦,π π ln π π 1 , β¦ , π π , π1, β¦ , ππΏ, π 1 , β¦ , π π + const = πΉπ 1 ,β¦,π π ln
π=1 πΏ
π ππ
π=1 π
π π(π) π π(π)|π π , π1, β¦ , ππΏ + const = ln
π=1 πΏ
π ππ + πΉπ 1 ,β¦,π π
π=1 π
ln π π(π)|π π , π1, β¦ , ππΏ + const
ln π π(π)|π π , π1, β¦ , ππΏ =
π=1 πΏ
π¨π
(π) ln π π(π)|ππ, π±
Example: Bayesian mixtures of Gaussians
33
=
π=1 πΏ
ln π ππ +
π=1 π π=1 πΏ
πΉ π¨π
(π) ln π π(π)|ππ, π±
+ const β π π1, β¦ , ππΏ =
π=1 πΏ
π ππ π (ππ) β exp ln π ππ +
π=1 π
πΉ π¨π
π
ln π π π |ππ, π± β π ππ = π ππ|ππ, π§π
β1
π§π = π§0 +
π=1 π
πΉ π¨π
(π)
π± ππ = π§π
β1
π§0π0 +
π=1 π
πΉ π¨π
(π) π(π)
Variational posterior distribution
34
ο½ In this example, variational posterior distribution have the
same functional form as the corresponding factor in the joint distribution
ο½ This is a general result and is a consequence of the choice of conjugate
distributions.
ο½ There are general results for general class of conjugate-
exponential models
ο½ The
additional factorizations
- f
variational posterior distributions are a consequence of the interaction between the assumed factorization and the conditional independencies in π
Mean field for exponential family
35
π π¨
π|π¨βπ, π¦ = β π¨ π exp π π¨βπ, π¦ ππ π¨ π β π΅ π π¨βπ, π¦
ln π π¨
π|π¨βπ, π¦ = ln β π¨ π + π π¨βπ, π¦ ππ π¨ π β π΅ π π¨βπ, π¦
ο½ Mean field variational inference is straightforward:
ln π π¨
π = πΉπ βπ log π π¨ π|π¨βπ, π¦
+ const = ln β π¨
π + πΉπ βπ π π¨βπ, π¦ ππ π¨ π β πΉπ βπ π΅ π π¨βπ, π¦
π π¨
π β β π¨ π exp πΉπ βπ π π¨βπ, π¦ ππ π¨ π
ο½ π π¨
π
is in the same exponential distribution as the conditional.
Mean field for exponential family
36
ο½ Give each hidden variable π¨
π a variational parameter π€π, and
put it in the same exponential family as its model conditional: π π =
π
π π¨
π|π€π
ο½ In each iteration of coordinate descent:
ο½ sets each natural variational parameter π€π to the expectation of the
natural conditional parameter for variable π¨
π :
π€π
β = πΉπ βπ π π¨βπ, π¦
Conjugate exponential model in learning problems
37
ο½ When
complete-data likelihood is drawn from the exponential family with natural parameters π½:
π π, π|π½ =
π=1 π
β π π , π π exp π½ππ π π , π π β π΅ π½
ο½ We shall also use a conjugate prior for Ξ·:
ο½
ο½ π π½|π0, ππ = π ππ, ππ exp π0π½πππ β π0π΅ π½
ο½ ο½
π = π 1 , β¦ , π π π = π 1 , β¦ , π π
Mean field for conjugate exponential model in learning problems
38
ο½ Suppose π π, π½ = π π π π½ :
β π π =
π=1 π
π π(π) π β π(π) = β π π , π π exp πΉπ½[π½]ππ π π , π π β π΅ πΉπ½[π½] π β π½ = π ππ, ππ exp π½πππ β πππ΅ π½ ππ = π0 + π ππ = π0 +
π=1 π
πΉπ π π π π , π π
Variational Bayes
39
ο½ Learning with incomplete data by Bayesian approach
ο½ For complete data, we could derive closed-form solutions to the
Bayesian inference problem when we take some assumptions.
ο½ In the case of incomplete data, these solutions do not exist, and so we
need to resort to the approximate inference.
ο½ Variational Bayes EM (VBEM) provides a way to model
uncertainty in the parameters as well in the latent variables
ο½ Bayesian estimation at a computational cost that is essentially the same
as EM.
ο½ Thus, it often gives us the speed benefits of ML or MAP estimation but
the statistical benefits of the Bayesian approach
Variarional Bayes learning
40
ln π(π ) = ln
β
π π , β|πΎ π πΎ ππΎ ln π(π ) β₯
β
π β, πΎ ln π π , β, πΎ π β, πΎ ππΎ ln π(π ) β₯
β
π β β π πΎ πΎ ln π π , β, πΎ π β β π πΎ πΎ ππΎ ln π(π ) β₯
β
π β β π πΎ πΎ ln π π , β, πΎ ππΎ + πΌ π β + πΌ π πΎ
Mean field: π β, πΎ = π β β π πΎ πΎ πΊπ π, π π = β βͺ πΎ π = π πΊπ π, π
Variarional Bayes learning
41
ln π π = πΊπ π, π + πΈ(π (β, πΎ)| π β, πΎ π
ο½ We want to find π β = argmax
π
πΊ
π π, π
ο½ We assume factorization π β, πΎ = π β β π πΎ πΎ
and use block coordinate ascent to optimize the above problem
Mean Field VB
42
ο½ Initialization: Randomly select starting distribution πΎ1 ο½ Repeat
ο½ E-Step: Given parameters, find posterior of hidden data
π β
π’+1 = argmax π β
πΊ
π π, π β, π πΎ π’
ο½ M-Step: Given posterior distributions, find likely parameters
π πΎ
π’+1 = argmax π πΎ
πΊ
π π, π β π’+1, π πΎ
ο½ Until convergence
πΊ π, π β, π πΎ =
β
π β β π πΎ πΎ ln π π , β, πΎ ππΎ + πΌ π β + πΌ π πΎ
Local computation of ELBO for factorized π
43
π π = 1 π
π
πββ±
π
π(ππ)
πΏπ(π | π = βπΌ π β πΉπ log 1 π
π
πβπΊ
π
π ππ
= βπΌ π β log 1 π β
π
πβπΊ
πΉπ log π
π ππ
= log π β πΌ π +
π
πβπΊ
πΉπ log π
π ππ β π = πΌ π +
π
πβπΊ
πΉπ log π
π ππ
NaΓ―ve mean field for factorized π
44
NaΓ―ve mean field (i.e., fully factored distribution π ):
π π =
π=1 π
π π(π¦π) β π =
πββ±
πΉπ log π
π ππ
+ πΌ π
πΉπ log π
π(ππ) = ππβπππ(ππ) πβπͺ π
π π(π¦π) log π
π(ππ)
πΌ π =
π=1 π
πΌ[π π]
ο½ Thus, β[π ] can be rewritten simply as a sum of expectations,
each one over a small set of variables
πͺ π = π|π¦π β π‘ππππ π
π
Stationary point (fixed-point equations)
45
π π π¦π = 1 ππ exp
π:πβπͺ π ππβπππ(ππ)
π ππ|π¦π log π
π(ππ) ο½ Proof:
β π =
π=1 π
βπ[π ] βπ π =
π:πβπͺ π ππ πβπͺ π
π π π¦π log π
π ππ + πΌ π π
ππ π , π = βπ π + ππ(
π¦πβπππ ππ
π π π¦π β 1) πππ ππ π(π¦π) = 0 β π π π¦π = 1 π1βππ exp
π:ππβπͺ π ππ
π ππ|π¦π log π
π(ππ)
Update rule: We can optimize each π π given values for other potentials
Optimization by coordinate ascent for factorized π
46
π π π¦π = 1 ππ exp
π:ππβπͺ π ππ
π ππ|π¦π log π
π(ππ, π¦π)
ο½ Coordinate ascent algorithm repeatedly optimizes a single
marginal at a time, given fixed choices to all of the others.
While not converged Iterate over each of the variables π β π² Maximize the objective function with respect to π π π¦π , βπ¦π β πππ ππ by the above formula.
All these terms involve expectations of variables other than ππ and do not depend on the choice of π π ππ . block coordinate descent
Convergence properties
47
ο½ βπ is concave in π π(ππ)
ο½ Update of π π is guaranteed to increase (or not decrease) β
ο½ Mean Field iterations are guaranteed to converge.
ο½ Each step of coordinate ascent procedure is monotonically non-
decreasing in β.
ο½ Because β is bounded, the sequence of distributions represented by
successive iterations of Mean-Field must converge.
ο½ At the convergence point, the fixed-point equations hold for
all variables.
ο½ As a consequence, the convergence point is a stationary point of the
energy functional subject to the constraints
ο½ The result of the mean field approximation is a local maximum,
and not necessarily a global one
Local computation in naΓ―ve mean field
48
ο½ When updating π π, we only need to reason about the
variables which share a factor with π
π
ο½ the expectations required to evaluate π π involve only those
variables lying in the Markov blanket of the node π
ο½ the other terms get absorbed into the constant term.
ο½ The optimization of π π can therefore be expressed as a local
computation at the node
Variational methods: two perspective
49
ο½ Each algorithm can be explained via two perspective:
ο½ Constrained optimization ο½ Message-passing algorithm
ο½ As a on-way of solving the optimization problem
Example: Mean field for pairwise MRFs
50
π π = 1 π exp
(π,π)ββ°
πππ π¦π, π¦π +
πβπ²
ππ π¦π π β = argmax
π βπ
β π π π =
π=1 π
π π(π¦π)
π¦πβπππ ππ
π π(π¦π) = 1
Subject to:
Example: Mean field for pairwise MRFs
51
ο½ π : Pairwise MRF
π π = 1 π
(π,π)ββ°
πππ π¦π, π¦π
πβπ²
ππ π¦π π π = 1 π exp
(π,π)ββ°
πππ π¦π, π¦π +
πβπ²
ππ π¦π π π π¦π = 1 ππ exp ππ π¦π +
πβπͺ π π¦π
π π π¦π πππ π¦π, π¦π β π π π¦π β ππ π¦π
πβπͺ π
πππ π¦π πππ π¦π β exp
π¦π
π π π¦π πππ π¦π, π¦π ππ = ln ππ πππ = ln πππ
Message passing: Mean field vs. BP for pairwise MRF
52
π π = 1 π
(π,π)ββ°
πππ π¦π, π¦π
πβπ²
ππ π¦π
ο½ Mean Field:
π π π¦π β ππ π¦π
πβπͺ π
πππ π¦π πππ π¦π β exp
π¦π
π π π¦π πππ π¦π, π¦π
ο½ Belief propagation (sum product)
ππ(π¦π) β ππ π¦π
πβπͺ π
πππ π¦π πππ(π¦π) β
π¦π
ππ π¦π πππ π¦π, π¦π
πβπͺ π \π
πππ π¦π
ππ π¦π, π¦π = ln πππ π¦π, π¦π
Variational message passing
53
ο½ Mean field methods are all very similar
ο½ just compute each nodeβs full conditional, and average out the neighbors
π π =
π
π(π¦π|πππ¦π) ln π π¦π = πΉπ βπ
πβ{π,π·βπ}
ln π π¦π|πππ + const
ο½ It is possible to derive a general purpose set of update equations that work for
any DGM for which all CPDs are in the exponential family, and for which all parent nodes have conjugate distributions ο½ Updating nodes one at a time
ο½ updating posterior beliefs using local operations at each node. ο½ each update increases a lower bound on the log evidence (unless already
at a local maximum)
Structured variational
54
ο½ Mean Field
ο½ NaΓ―ve mean field ο½ Structured mean field
Structured Mean Field
55
ο½ NaΓ―ve mean-field can lead to very poor approximations
ο½ we must use a richer class of distributions π, which has greater expressive
power (by capturing some of the dependencies in π)
ο½ use network structures of different complexity
ο½ subgraph of π»π over which exact computation of πΌ[π ] is feasible
ο½ Example: for grid π»π, a collection of independent chain structures.
ο½ Exact inference with such structures is linear
Structured stationary point
56
π π = 1 π
π=1 πΏ
ππ ππ π π = 1 ππ
π=1 πΎ
ππ ππ πΊ π, π =
π=1 πΏ
πΉπ ln ππ ππ β πΉπ ln π πΊ π, π =
π=1 πΏ
πΉπ ln ππ ππ β
π=1 πΎ
πΉπ ln ππ ππ + ln ππ
Structured stationary point
57
ο½ ππ is a stationary point of the energy functional iff: ππ ππ β exp πΉπ log π(π) |ππ β
πβ π
πΉπ log ππ ππ |ππ ππ ππ β exp
π
πΉπ log ππ(ππ) |ππ β
πβ π
πΉπ log ππ ππ |ππ
ο½ We need to perform inference after each update step ππ ππ does not affect the right-hand side of the fixed-point equations defining its value
Structured mean-field quality
58
ο½ Both the quality and the computational complexity of the
variational approximation depend on the structure of π and π .
ο½ We want to be able to perform efficient inference in the
approximating network.
ο½ we often select our network so that the resulting factorization
leads to a tractable network (that is, one of low tree-width)
Loopy Belief Propagation (LBP)
59
ο½ A fixed point iteration procedure that tries to minimize an
approximation of πΊ π, π
ο½ Start with initialization of all messages to one ο½ While not converged do
ο½ At convergence, stationary properties are guaranteed ο½ LBP does not always converge, and even when it does, it may
converge to the wrong answers
ππβπ‘
πππ₯ π¦π = π’βπͺ(π)\s
ππ’βπ(π¦π) ππ‘βπ
πππ₯ π¦π = ππͺ(π‘)\π
π
π‘ ππͺ(π‘) πβπͺ(π‘)\π
ππβπ‘(π¦π)
Recall: Beliefs and messages in factor tree
60
ππ π
π’
π
π
β¦ ππβπ(π¦π) ππ’βπ(π¦π)
π’ β πͺ(π)\s
ππβπ π¦π =
π’βπͺ(π)\a
ππ’βπ(π¦π) ππβπ π¦π =
ππͺ(π)\π
π
π ππͺ(π) πβπͺ(π)\π
ππβπ(π¦π) ππ π¦π β
πβπͺ π
ππβπ π¦π ππ ππͺ(π) β π
π ππͺ(π) πβπͺ π
ππβπ π¦π β¦ ππβπ(π¦π)
π β πͺ(π)\π
π
π
ππβπ(π¦π)
LBP
61
ο½ If BP is used on graphs with loops, messages may circulate
indefinitely
ο½ But we can run it anyway and hope for the best ο½ Stop message passing when
ο½ fixed number of iterations is reached ο½ or when no significant change in beliefs is occurred
ο½ Empirically, a good approximation can be achievable
ο½ If solution is not oscillatory but converges, it usually is a good
approximation
LBP as a relaxation method
62
ο½ Loopy
Belief Propagation (LBP)
- ptimizes
approximate versions of the energy functional
ο½ We approximate πΊ[π, π ] with πΊπΆππ’βπ[π, π ] ο½ works directly with pseudomarginals which may not be consistent with
any joint distribution
ο½ Fixed-point equations derived from the constrained energy
minimization can be viewed as message-passing over a graph
Bethe approximation
63
ο½ Pros:
ο½ Objective function πΊ
πΆππ’βπ[π, π ] that is easier to compute and
- ptimize
ο½ Cons:
ο½ may or may not be well connected to πΊ[π, π ] ο½ It could, in general, be greater, equal or less than
ο½ Optimize each π(ππ)'s.
ο½ For discrete belief, constrained opt. with Lagrangian multiplier ο½ For continuous belief, not yet a general formula ο½ Not always converge
LBP message-update rules
64
ππβπ π¦π =
π’βπͺ(π)\a
ππ’βπ(π¦π) ππβπ π¦π =
ππͺ(π)\π
π
π ππͺ(π) πβπͺ(π)\π
ππβπ(π¦π)
ππ π¦π β π
π π¦π πβπͺ(π)
ππβπ π¦π ππ ππ β π
π ππ πβπͺ(π) πβπͺ(π)\a
ππβπ π¦π ο½ Bethe is equal to BP on factor graph
ο½ Each message can be defined in terms of other messages allowing an
easy iterative algorithm for solving the fixed point equations
βbeliefβ is the approximation of the marginal probability.
Inference on trees: variational perspective
65
ο½ For trees, a sequence of message propagations can calibrate
the tree in two passes
ο½ Propagation process is converged and additional message passing do not
change the beliefs
ο½ For general graphs, the process may not converge
ο½ Information from one pass on loopy graphs will circulate and affect the
next round
ο½ Beliefs are not necessarily the marginal probability in π ο½ Empirically, a good approximation is still achievable
ο½ If solution is not oscillatory but converges, it usually is a good approximation ο½ As cycles grow long, BP becomes exact (coding)
Reference
66
ο½ C.M. Bishop, βPattern Recognition and Machine Learningβ,
Chapter 10.1-10.4.
ο½ D. Koller and N. Friedman, βProbabilistic Graphical Models:
Principles and Techniquesβ, Chapter 11.1-11.3, 11.5, 11.6.
Some extra slides on theory of LBP if you optionally want to know it briefly
Theory behind LBP
68
ο½ LBP relaxes the following optimization problem:
π β = argmax
π ββ³
πΉπ log π + πΌ(π )
ο½ and use the following optimization problem:
π β = argmax
π βπ
πΊπΆππ’βπ π, π πΊπΆππ’βπ π, π = β
π
πββ±
ππ
ππ ππ log ππ ππ π
π ππ
+
π
1 β ππ
π¦π
ππ π¦π log ππ π¦π =
π
πββ±
πΉ π
π ππ
+ πΌπΆππ’βπ π : a family of (tractable) probability distributions ππ β‘ π¦π|π β π‘ππππ π
π
π π = 1 π π π
Theory behind LBP
69
ο½ Indeed, we do not optimize π explicitly, focus on the set of
beliefs π = ππ ππ and ππ π¦π
- n factors and variables:
πβ = argmin
πββ³π
πΉπ log π + πΌ(π)
ο½ approximate objective: πΊπΆππ’βπ = πΉπ log
π + πΌ(π)
ο½ relaxed feasible set:
β³π = ππ β₯ 0, ππ β₯ 0|
π¦π
ππ π¦π = 1,
ππ\π¦π
ππ ππ = π π¦π
ο½ LBP is a fixed point iteration procedure that tries to find πβ β³π: locally consistent pseudo-marginals (a relaxation of the original set of consistencies). both the objective and the constraint space are approximate
Tree energy functional
70
ο½ Consider a tree-structured distribution
ο½ ππ and ππ denote marginals on factors and variables
π π = π ππ(ππ) π ππ π¦π π ππ π¦π ππ πΌ = β
π ππͺ π
ππ ππ log ππ ππ +
π
ππ β 1
π¦π
ππ π¦π log ππ π¦π πΊ = β
π ππͺ π
ππ ππ log ππ ππ π
π ππ
+
π
ππ β 1
π¦π
ππ π¦π log ππ π¦π
ππ and ππ denote pseudo marginals
Bethe approximation of πΊ π, π for general graphs
71
ο½ For a general graph, choose
πΊ π, π = πΊ
Bethe(π, π )
πΌπΆππ’βπ = β
π ππͺ π
ππ ππ log ππ ππ +
π
ππ β 1
π¦π
ππ π¦π log ππ π¦π πΊπΆππ’βπ = β
π ππͺ π
ππ ππ log ππ ππ π
π ππ
+
π
ππ β 1
π¦π
ππ π¦π log ππ π¦π =
π
πΉππ log π
π ππ
+ πΌπΆππ’βπ
Minimizing the Bethe free energy
72
π π , π = πΊπΆππ’βπ π, π +
π
ππ 1 β
π¦π
ππ π¦π +
π πβπͺ(π) π¦π
πππ π¦π ππ π¦π β
ππ\π¦π
ππ ππ
ο½ Stationary points:
ππ πππ π¦π = 0 β ππ π¦π β exp 1 ππ β 1
πβπͺ(π)
πππ π¦π ππ πππ ππ = 0 β ππ ππ β exp log π
π ππ + πβπͺ(π)
πππ π¦π
Fixed point equations
73
ππ π¦π = exp β1 exp 1 ππ β 1
πβπͺ π
πππ π¦π ππ ππ = exp β1 + ππ exp log π
π ππ + πβπͺ(π)
πππ π¦π ππβπ π¦π β‘ exp πππ π¦π
ο½ Using ππ π¦π = ππ\π¦π ππ ππ , we have:
ππβπ π¦π =
ππ\π¦π
π
π ππ πβπͺ(π)\i πβπͺ(π)\a