Variational inference Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation

β–Ά
variational inference
SMART_READER_LITE
LIVE PREVIEW

Variational inference Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing s slides Nodes: = { 1 , , } Evidence: Inference query Query


slide-1
SLIDE 1

Variational inference

Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani

Some slides are adapted from Xing’s slides

slide-2
SLIDE 2

Inference query

2

 Marginal probability (Likelihood):

𝑄 π’€π’˜ =

𝒂

𝑄(𝒂, π’€π’˜)

 Conditional probability (a posteriori belief):

𝑄 𝒂|π’šπ’˜ = 𝑄(𝒂, π’šπ’˜) 𝒂 𝑄(𝒂, π’šπ’˜)

 Marginalized conditional probability:

𝑄 𝒁|π’šπ’˜ = 𝑿 𝑄(𝒁, 𝑿, π’šπ’˜) 𝒁 𝑿 𝑄(𝒁, 𝑿, π’šπ’˜) (𝒁 = 𝒁 βˆͺ 𝑿)

 Most probable assignment for some variables of interest given an

evidence 𝒀𝑾 = π’šπ’˜ π’βˆ—|π’šπ’˜ = argmax

𝒁

𝑄 𝒁|π’šπ’˜

Nodes: 𝒀 = {π‘Œ1, … , π‘Œπ‘œ} Evidence: 𝒀𝑾 Query variables: 𝒂 = 𝒀\𝒀𝑾

slide-3
SLIDE 3

Exact methods for inference

3

 Variable elimination  Message Passing: shared terms

 Sum-product (belief propagation)  Max-product  Junction Tree

slide-4
SLIDE 4

Junction tree

4

 General algorithm on graphs with cycles  Message passing on junction trees

𝐷

π‘˜

π‘‡π‘—π‘˜ 𝐷𝑗 π‘›π‘—π‘˜ π‘‡π‘—π‘˜ π‘›π‘˜π‘— π‘‡π‘—π‘˜

slide-5
SLIDE 5

Why approximate inference

5

 The computational complexity of Junction tree algorithm with be at

least 𝐿 𝐷 where 𝐷 shows the largest elimination clique (the largest clique in the triangulated graph)

 For a distribution 𝑄 associated with a complex graph, computing the

marginal (or conditional) probability of arbitrary random variable(s) is intractable

Tree-width of an 𝑂 Γ— 𝑂 grid is 𝑂

slide-6
SLIDE 6

Learning and inference

6

 Learning is also an inference problem or usually needs

inference

 For Bayesian inference that is one of the principal foundations

for machine learning, learning is just an inference problem

 For Maximum Likelihood approach, also, we need inference

when we have incomplete data or when we encounter an undirected model

slide-7
SLIDE 7

Approximate inference

7

 Approximate inference techniques

 Variational algorithms

 Loopy belief propagation  Mean field approximation  Expectation propagation

 Stochastic simulation / sampling methods

slide-8
SLIDE 8

Variational methods

8

 β€œvariational”:

general term for

  • ptimization-based

formulations

 Many problems can be expressed in terms of an optimization

problem in which the quantity being optimized is a functional

 Variational inference is a deterministic framework that is

widely used for approximate inference

slide-9
SLIDE 9

Variatonal inference methods

9

 Constructing an approximation to the target distribution 𝑄

where this approximation takes a simpler form for inference:

 We define a target class of distributions 𝒭  Search for an instance π‘…βˆ— in 𝒭 that is the best approximation to 𝑄  Queries will be answered using π‘…βˆ— rather than on 𝑄

 𝒭: given family of distributions

 Simpler families for which solving the optimization problem will

be computationally tractable

 However, the family may not be sufficiently expressive to

encode 𝑄

Constrained

  • ptimization
slide-10
SLIDE 10

Setup

10

 Assume that we are interested in the posterior distribution

𝑄 π‘Ž π‘Œ, 𝛽 = 𝑄(π‘Ž, π‘Œ|𝛽) 𝑄 π‘Ž, π‘Œ 𝛽 π‘’π‘Ž

 The problem of computing the posterior is an instance of more

general problems that variational inference solves

 Main idea:

 We pick a family of distributions over the latent variables with its own

variational parameters

 Then, find the setting of the parameters that makes 𝑅 close to the posterior

  • f interest

 Use 𝑅 with the fitted parameters as an approximation for the posterior

π‘Œ = {𝑦1, … , π‘¦π‘œ} π‘Ž = {𝑨1, … , 𝑨𝑛} Observed variables Hidden variables

slide-11
SLIDE 11

Approximation

11

 Goal: Approximate a difficult distribution 𝑄(π‘Ž|π‘Œ) with a

new distribution 𝑅(π‘Ž) such that:

 𝑄(π‘Ž|π‘Œ) and 𝑅(π‘Ž) are close  Computation on 𝑅(π‘Ž) is easy

 Typically, the true posterior is not in the variational family.  How should we measure distance between distributions?

 The Kullback-Leibler divergence (KL-divergence) between two

distributions 𝑄 and 𝑅

slide-12
SLIDE 12

KL divergence

12

 Kullback-Leibler divergence between 𝑄 and 𝑅:

𝐿𝑀(𝑄| 𝑅 = 𝑄 𝑦 log 𝑄(𝑦) 𝑅(𝑦) 𝑒𝑦

 A result from information theory: For any 𝑄 and 𝑅

𝐿𝑀(𝑄| 𝑅 β‰₯ 0

 𝐿𝑀(𝑄| 𝑅 = 0 if and only if 𝑄 ≑ 𝑅  𝐸 is asymmetric

slide-13
SLIDE 13

How measure the distance of 𝑄 and 𝑅?

13

 We wish to find a distribution 𝑅 such that 𝑅 is a β€œgood”

approximation to 𝑄

 We can therefore use KL divergence as a scoring function

to decide a good 𝑅

 But, 𝐿𝑀(𝑄(π‘Ž|π‘Œ)||𝑅(π‘Ž)) β‰  𝐿𝑀(𝑅(π‘Ž)||𝑄(π‘Ž|π‘Œ))

slide-14
SLIDE 14

M-projection vs. I-projection

14

 M-projection of 𝑅 onto 𝑄

π‘…βˆ— = argmin

π‘…βˆˆπ’­

𝐿𝑀(𝑄||𝑅)

 I-projection of 𝑅 onto 𝑄

π‘…βˆ— = argmin

π‘…βˆˆπ’­

𝐿𝑀(𝑅||𝑄)

 These two will differ only when 𝑅 is minimized over a

restricted set of probability distributions (when 𝑄 βˆ‰ 𝑅 set

  • f possible 𝑅 distributions)
slide-15
SLIDE 15

KL divergence: M-projection vs. I-projection

15

 Let 𝑄 be a 2D Gaussian and 𝑅 be a Gaussian distribution

with diagonal covariance matrix:

𝑄: Green π‘…βˆ—: Red π‘…βˆ— = argmin

𝑅

𝑄 π’œ log 𝑄 π’œ 𝑅 π’œ π‘’π’œ π‘…βˆ— = argmin

𝑅

𝑅 π’œ log 𝑅 π’œ 𝑄 π’œ π‘’π’œ 𝐹𝑄 π’œ = 𝐹𝑅[π’œ] 𝐹𝑄 π’œ = 𝐹𝑅[π’œ] [Bishop]

slide-16
SLIDE 16

KL divergence: M-projection vs. I-projection

16

 Let 𝑄 is mixture of two 2D Gaussians and 𝑅 be a 2D

Gaussian distribution with arbitrary covariance matrix:

𝑄: Blue π‘…βˆ—: Red [Bishop] 𝐹𝑄 π’œ = 𝐹𝑅 π’œ 𝐷𝑝𝑀𝑄 π’œ = 𝐷𝑝𝑀𝑅 π’œ π‘…βˆ— = argmin

𝑅

𝑄 π’œ log 𝑄 π’œ 𝑅 π’œ π‘’π’œ π‘…βˆ— = argmin

𝑅

𝑅 π’œ log 𝑅 π’œ 𝑄 π’œ π‘’π’œ two good solutions!

slide-17
SLIDE 17

M-projection

17

 Computing 𝐿𝑀(𝑄| 𝑅 requires inference on 𝑄

𝐿𝑀(𝑄| 𝑅 =

𝑨

𝑄 𝑨 log 𝑄 𝑨 𝑅 𝑨 = βˆ’πΌ 𝑄 βˆ’ 𝐹𝑄[log 𝑅(𝑨)]

 When 𝑅 is in the exponential family:

𝐿𝑀(𝑄| 𝑅 = 0 ⇔ 𝐹𝑄 π‘ˆ 𝑨 = 𝐹𝑅[π‘ˆ 𝑨 ]

 Expectation

Propagation methods are based

  • n

minimizing 𝐿𝑀(𝑄| 𝑅

Moment projection Inference on 𝑄 (that is difficult) is required!

slide-18
SLIDE 18

I-projection

18

 𝐿𝑀(𝑅| 𝑄

can be computed without performing inference on 𝑄 𝐿𝑀(𝑅| 𝑄 = 𝑅 𝑨 log 𝑅 𝑨 𝑄 𝑨 𝑒𝑨 = βˆ’πΌ 𝑅 βˆ’ 𝐹𝑅[log 𝑄(𝑨)]

 Most variational inference algorithms make use of 𝐿𝑀(𝑅| 𝑄  Computing expectations w.r.t. 𝑅 is tractable (by choosing a

suitable class of distributions for 𝑅)

 We choose a restricted family of distributions such that the expectations

can be evaluated and optimized efficiently.

 and yet which is still sufficiently flexible as to give a good approximation

slide-19
SLIDE 19

Example of variatinal approximation

19

[Bishop] Variational Laplace Approx.

slide-20
SLIDE 20

Evidence Lower Bound (ELBO)

20

ln 𝑄 π‘Œ = β„’ 𝑅 + 𝐿𝑀(𝑅||𝑄) β„’ 𝑅 = 𝑅 π‘Ž ln 𝑄(π‘Œ, π‘Ž) 𝑅(π‘Ž) π‘’π‘Ž 𝐿𝑀(𝑅||𝑄) = βˆ’ 𝑅 π‘Ž ln 𝑄(π‘Ž|π‘Œ) 𝑅(π‘Ž) π‘’π‘Ž

 We can maximize the lower bound β„’ 𝑅

 equivalent to minimizing KL divergence.  if we allow any possible choice for 𝑅(π‘Ž), then the maximum of the

lower bound occurs when the KL divergence vanishes

 occurs when 𝑅(π‘Ž) equals the posterior distribution 𝑄(π‘Ž|π‘Œ).

 The

difference between the ELBO and the KL divergence is ln 𝑄(π‘Œ) which is what the ELBO bounds

We also called β„’ 𝑅 as 𝐺[𝑄, 𝑅] latter. π‘Œ = {𝑦1, … , π‘¦π‘œ} π‘Ž = {𝑨1, … , 𝑨𝑛}

slide-21
SLIDE 21

Evidence Lower Bound (ELBO)

21

 Lower bound on the marginal likelihood  This quantity should increase monotonically with each

iteration

 we maximize the ELBO to find the parameters that gives as

tight a bound as possible on the marginal likelihood

 ELBO converges to a local minimum.

 Variational inference is closely related to EM

slide-22
SLIDE 22

Factorized distributions 𝑅

22

𝑅 π‘Ž =

𝑗

𝑅𝑗(π‘Žπ‘—) β„’ 𝑅 =

𝑗

𝑅𝑗 ln 𝑄(π‘Œ, π‘Ž) βˆ’

𝑗

ln 𝑅𝑗 π‘’π‘Ž Coordinate ascent to optimize β„’ 𝑅 : β„’π‘˜ 𝑅 = π‘…π‘˜ ln 𝑄(π‘Œ, π‘Ž)

π‘—β‰ π‘˜

𝑅𝑗 π‘’π‘Žπ‘— π‘’π‘Ž

π‘˜ βˆ’ π‘…π‘˜ ln π‘…π‘˜ π‘’π‘Ž π‘˜ + π‘‘π‘π‘œπ‘‘π‘’

β‡’ β„’π‘˜ 𝑅 = π‘…π‘˜πΉβˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž π‘’π‘Ž

π‘˜ βˆ’ π‘…π‘˜ ln π‘…π‘˜ π‘’π‘Ž π‘˜ + π‘‘π‘π‘œπ‘‘π‘’

The restriction on the distributions in the form of factorization assumptions:

πΉβˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž

π‘—β‰ π‘˜

𝑅𝑗 π‘’π‘Žπ‘—

slide-23
SLIDE 23

Factorized distributions 𝑅: optimization

23

𝑀(π‘…π‘˜, πœ‡) = β„’π‘˜ 𝑅 + πœ‡(

π‘Žπ‘˜

𝑅 π‘Ž

π‘˜ βˆ’ 1)

𝑒𝑀 𝑒𝑅(π‘Ž

π‘˜) = πΉβˆ’π‘˜ log 𝑄 π‘Ž, π‘Œ

βˆ’ log 𝑅 π‘Ž

π‘˜ βˆ’ 1 + πœ‡ = 0

β‡’ π‘…βˆ—(π‘Ž

π‘˜) ∝ exp πΉβˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž

π‘…βˆ—(π‘Ž

π‘˜) ∝ exp πΉβˆ’π‘˜ ln 𝑄 π‘Ž π‘˜|π‘Žβˆ’π‘˜, π‘Œ

 The above formula determines the form of the optimal 𝑅. We didn't specify

the form in advance and only the factorization has been assumed.

 Depending on that form, the optimal 𝑅 π‘Ž π‘˜

might not be easy to work with. Nonetheless, for many models it is.

 Since we are replacing the neighboring values by their mean value, the

method is known as mean field.

slide-24
SLIDE 24

Example: Gaussian factorized distribution

24

slide-25
SLIDE 25

Example: Gaussian factorized distribution

25

Solution:

slide-26
SLIDE 26

Example: Bayesian mixtures of Gaussians

26

 For simplicity, assume that the data generating variance is one.

 𝑄 𝝂 = 𝑙=1

𝐿

π’ͺ 𝝂𝑙|𝒏0, 𝚳0

βˆ’1

 𝑄 𝑨𝑙

π‘œ = 1|𝝆 = πœŒπ‘™  𝑄 π’š(π‘œ)|𝑨𝑙

π‘œ = 1, 𝝂 = π’ͺ π’š(π‘œ)|𝝂𝑙, 𝑱

For 𝑙 = 1, … , 𝐿 Draw 𝝂𝑙~π’ͺ 𝒏0, 𝚳0

βˆ’1

For π‘œ = 1, … , 𝑂 Draw π’œ(π‘œ)~π‘π‘£π‘šπ‘’ 𝝆 Draw π’š(π‘œ)~ 𝑙=1

𝐿

π’ͺ 𝝂𝑙, 𝑱 𝑨𝑙

π‘œ

π’œ(π‘œ) π’š(π‘œ) 𝑂 𝝆𝝆 𝝂𝝆

slide-27
SLIDE 27

Example: Bayesian mixtures of Gaussians

27

π‘Ž = π’œ(1), … , π’œ 𝑂 , 𝝂1, … , 𝝂𝐿 π‘Œ = π’š(1), … , π’š 𝑂

𝑄 π’œ(1), … , π’œ 𝑂 , 𝝂1, … , 𝝂𝐿|π’š(1), … , π’š 𝑂 = 𝑙=1

𝐿

𝑄 𝝂𝑙 π‘œ=1

𝑂

𝑄 π’œ(π‘œ) 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿

𝝂1,…,𝝂𝐿 π’œ(1),…,π’œ 𝑂 𝑙=1 𝐿

𝑄 𝝂𝑙 π‘œ=1

𝑂

𝑄 π’œ(π‘œ) 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿

The denominator is difficult to compute

slide-28
SLIDE 28

Example: Bayesian mixtures of Gaussians

28

 Consider

a variational distribution which factorizes between the latent variables and the parameters: 𝑅 π’œ 1 , … , π’œ 𝑂 , 𝝂1, … , 𝝂𝐿 = 𝑅 π’œ 1 , … , π’œ 𝑂 𝑅 𝝂1, … , 𝝂𝐿

 This is the only assumption required to make in order to

  • btain a tractable practical solution
slide-29
SLIDE 29

Example: Bayesian mixtures of Gaussians

29

ln 𝑅(π’œ 1 , … , π’œ 𝑂 ) = 𝐹𝝂1,…,𝝂𝐿 ln 𝑄 π‘Ž, π‘Œ + const = 𝐹𝝂1,…,𝝂𝐿 ln 𝑄 π’œ 1 , … , π’œ 𝑂 , 𝝂1, … , 𝝂𝐿, π’š 1 , … , π’š 𝑂 + const = 𝐹𝝂1,…,𝝂𝐿 ln

𝑙=1 𝐿

𝑄 𝝂𝑙

π‘œ=1 𝑂

𝑄 π’œ(π‘œ) 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿

+ const

= 𝐹𝝂1,…,𝝂𝐿

𝑙=1 𝐿

ln 𝑄 𝝂𝑙 +

π‘œ=1 𝑂

ln 𝑄 π’œ(π‘œ) +

π‘œ=1 𝑂

ln 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿

+ const

ln 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿 =

𝑙=1 𝐿

𝑨𝑙

(π‘œ) ln 𝑂 π’š(π‘œ)|𝝂𝑙, 𝑱

= βˆ’ 𝑒 2 ln 2𝜌 βˆ’ 1 2

𝑙=1 𝐿

𝑨𝑙

π‘œ

π’š π‘œ βˆ’ 𝝂𝑙

π‘ˆ π’š π‘œ βˆ’ 𝝂𝑙

ln 𝑄 π’œ(π‘œ) =

𝑙=1 𝐿

𝑨𝑙

π‘œ ln πœŒπ‘™

slide-30
SLIDE 30

Example: Bayesian mixtures of Gaussians

30

ln 𝑅(π’œ 1 , … , π’œ 𝑂 ) =

π‘œ=1 𝑂

ln 𝑅 π’œ(π‘œ) β‡’ 𝑅 π’œ 1 , … , π’œ 𝑂 =

π‘œ=1 𝑂

𝑅 π’œ(π‘œ) ln 𝑅 π’œ(π‘œ) = 𝐹𝝂1,…,𝝂𝐿

𝑙=1 𝐿

𝑨𝑙

π‘œ ln πœŒπ‘™ βˆ’ 1

2

𝑙=1 𝐿

𝑨𝑙

π‘œ

π’š π‘œ βˆ’ 𝝂𝑙

π‘ˆ π’š π‘œ βˆ’ 𝝂𝑙

+ const

slide-31
SLIDE 31

Example: Bayesian mixtures of Gaussians

31

ln 𝑅 π’œ(π‘œ) = 𝐹𝝂1,…,𝝂𝐿

𝑙=1 𝐿

𝑨𝑙

π‘œ ln πœŒπ‘™ βˆ’ 1

2

𝑙=1 𝐿

𝑨𝑙

π‘œ

π’š π‘œ βˆ’ 𝝂𝑙

π‘ˆ π’š π‘œ βˆ’ 𝝂𝑙

+ const

ln 𝑅 π’œ(π‘œ) =

𝑙=1 𝐿

𝑨𝑙

π‘œ

ln πœŒπ‘™ + π’š π‘œ π‘ˆπΉπ‚π‘™ 𝝂𝑙 βˆ’ 1 2 𝐹𝝂𝑙 𝝂𝑙

π‘ˆπ‚π‘™ βˆ’ 1

2 π’š π‘œ π‘ˆπ’š π‘œ + const β‡’ 𝑅 π’œ(π‘œ) = π‘π‘£π‘šπ‘’ π‘ π‘œ1, … , π‘ π‘œπ‘™ 𝐹 𝑨𝑙

(π‘œ) = π‘ π‘œπ‘™

π‘ π‘œπ‘™ = π‘“π‘¦π‘ž ln πœŒπ‘™ + π’š π‘œ π‘ˆπΉ 𝝂𝑙 βˆ’ 1 2 𝐹 𝝂𝑙

π‘ˆπ‚π‘™ βˆ’ 1

2 π’š π‘œ π‘ˆπ’š π‘œ 𝑙=1

𝐿

π‘“π‘¦π‘ž ln πœŒπ‘™ + π’š π‘œ π‘ˆπΉ 𝝂𝑙 βˆ’ 1 2 𝐹 𝝂𝑙

π‘ˆπ‚π‘™ βˆ’ 1

2 π’š π‘œ π‘ˆπ’š π‘œ

slide-32
SLIDE 32

Example: Bayesian mixtures of Gaussians

32

ln 𝑅(𝝂1, … , 𝝂𝐿) = πΉπ’œ 1 ,…,π’œ 𝑂 ln 𝑄 π’œ 1 , … , π’œ 𝑂 , 𝝂1, … , 𝝂𝐿, π’š 1 , … , π’š 𝑂 + const = πΉπ’œ 1 ,…,π’œ 𝑂 ln

𝑙=1 𝐿

𝑄 𝝂𝑙

π‘œ=1 𝑂

𝑄 π’œ(π‘œ) 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿 + const = ln

𝑙=1 𝐿

𝑄 𝝂𝑙 + πΉπ’œ 1 ,…,π’œ 𝑂

π‘œ=1 𝑂

ln 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿 + const

ln 𝑄 π’š(π‘œ)|π’œ π‘œ , 𝝂1, … , 𝝂𝐿 =

𝑙=1 𝐿

𝑨𝑙

(π‘œ) ln 𝑂 π’š(π‘œ)|𝝂𝑙, 𝑱

slide-33
SLIDE 33

Example: Bayesian mixtures of Gaussians

33

=

𝑙=1 𝐿

ln 𝑄 𝝂𝑙 +

π‘œ=1 𝑂 𝑙=1 𝐿

𝐹 𝑨𝑙

(π‘œ) ln 𝑂 π’š(π‘œ)|𝝂𝑙, 𝑱

+ const β‡’ 𝑅 𝝂1, … , 𝝂𝐿 =

𝑙=1 𝐿

𝑅 𝝂𝑙 𝑅(𝝂𝑙) ∝ exp ln 𝑄 𝝂𝑙 +

π‘œ=1 𝑂

𝐹 𝑨𝑙

π‘œ

ln 𝑂 π’š π‘œ |𝝂𝑙, 𝑱 β‡’ 𝑅 𝝂𝑙 = 𝑂 𝝂𝑙|𝒏𝑙, πœ§π‘™

βˆ’1

πœ§π‘™ = 𝜧0 +

π‘œ=1 𝑂

𝐹 𝑨𝑙

(π‘œ)

𝑱 𝒏𝑙 = πœ§π‘™

βˆ’1

𝜧0𝝂0 +

π‘œ=1 𝑂

𝐹 𝑨𝑙

(π‘œ) π’š(π‘œ)

slide-34
SLIDE 34

Variational posterior distribution

34

 In this example, variational posterior distribution have the

same functional form as the corresponding factor in the joint distribution

 This is a general result and is a consequence of the choice of conjugate

distributions.

 There are general results for general class of conjugate-

exponential models

 The

additional factorizations

  • f

variational posterior distributions are a consequence of the interaction between the assumed factorization and the conditional independencies in 𝑄

slide-35
SLIDE 35

Mean field for exponential family

35

𝑄 𝑨

π‘˜|π‘¨βˆ’π‘˜, 𝑦 = β„Ž 𝑨 π‘˜ exp πœƒ π‘¨βˆ’π‘˜, 𝑦 π‘ˆπ‘ˆ 𝑨 π‘˜ βˆ’ 𝐡 πœƒ π‘¨βˆ’π‘˜, 𝑦

ln 𝑄 𝑨

π‘˜|π‘¨βˆ’π‘˜, 𝑦 = ln β„Ž 𝑨 π‘˜ + πœƒ π‘¨βˆ’π‘˜, 𝑦 π‘ˆπ‘ˆ 𝑨 π‘˜ βˆ’ 𝐡 πœƒ π‘¨βˆ’π‘˜, 𝑦

 Mean field variational inference is straightforward:

ln 𝑅 𝑨

π‘˜ = πΉπ‘…βˆ’π‘˜ log 𝑄 𝑨 π‘˜|π‘¨βˆ’π‘˜, 𝑦

+ const = ln β„Ž 𝑨

π‘˜ + πΉπ‘…βˆ’π‘˜ πœƒ π‘¨βˆ’π‘˜, 𝑦 π‘ˆπ‘ˆ 𝑨 π‘˜ βˆ’ πΉπ‘…βˆ’π‘˜ 𝐡 πœƒ π‘¨βˆ’π‘˜, 𝑦

𝑅 𝑨

π‘˜ ∝ β„Ž 𝑨 π‘˜ exp πΉπ‘…βˆ’π‘˜ πœƒ π‘¨βˆ’π‘˜, 𝑦 π‘ˆπ‘ˆ 𝑨 π‘˜

 𝑅 𝑨

π‘˜

is in the same exponential distribution as the conditional.

slide-36
SLIDE 36

Mean field for exponential family

36

 Give each hidden variable 𝑨

π‘˜ a variational parameter π‘€π‘˜, and

put it in the same exponential family as its model conditional: 𝑅 π‘Ž =

π‘˜

𝑅 𝑨

π‘˜|π‘€π‘˜

 In each iteration of coordinate descent:

 sets each natural variational parameter π‘€π‘˜ to the expectation of the

natural conditional parameter for variable 𝑨

π‘˜ :

π‘€π‘˜

βˆ— = πΉπ‘…βˆ’π‘˜ πœƒ π‘¨βˆ’π‘˜, 𝑦

slide-37
SLIDE 37

Conjugate exponential model in learning problems

37

 When

complete-data likelihood is drawn from the exponential family with natural parameters 𝜽:

𝑄 𝒀, 𝒂|𝜽 =

π‘œ=1 𝑂

β„Ž π’š π‘œ , π’œ π‘œ exp πœ½π‘ˆπ‘ˆ π’š π‘œ , π’œ π‘œ βˆ’ 𝐡 𝜽

 We shall also use a conjugate prior for η:



 𝑄 𝜽|πœ‰0, 𝝍𝟏 = 𝑔 πœ‰πŸ, 𝝍𝟏 exp πœ‰0πœ½π‘ˆππŸ βˆ’ πœ‰0𝐡 𝜽

 

𝒂 = π’œ 1 , … , π’œ π‘œ 𝒀 = π’š 1 , … , π’š π‘œ

slide-38
SLIDE 38

Mean field for conjugate exponential model in learning problems

38

 Suppose 𝑅 𝒂, 𝜽 = 𝑅 𝒂 𝑅 𝜽 :

β‡’ 𝑅 𝒂 =

π‘œ=1 𝑂

𝑅 π’œ(π‘œ) π‘…βˆ— π’œ(π‘œ) = β„Ž π’š π‘œ , π’œ π‘œ exp 𝐹𝜽[𝜽]π‘ˆπ‘ˆ π’š π‘œ , π’œ π‘œ βˆ’ 𝐡 𝐹𝜽[𝜽] π‘…βˆ— 𝜽 = 𝑔 πœ‰π‘‚, 𝝍𝑂 exp πœ½π‘ˆππ‘‚ βˆ’ πœ‰π‘‚π΅ 𝜽 πœ‰π‘‚ = πœ‰0 + 𝑂 𝝍𝑂 = 𝝍0 +

π‘œ=1 𝑂

πΉπ’œ π‘œ π‘ˆ π’š π‘œ , π’œ π‘œ

slide-39
SLIDE 39

Variational Bayes

39

 Learning with incomplete data by Bayesian approach

 For complete data, we could derive closed-form solutions to the

Bayesian inference problem when we take some assumptions.

 In the case of incomplete data, these solutions do not exist, and so we

need to resort to the approximate inference.

 Variational Bayes EM (VBEM) provides a way to model

uncertainty in the parameters as well in the latent variables

 Bayesian estimation at a computational cost that is essentially the same

as EM.

 Thus, it often gives us the speed benefits of ML or MAP estimation but

the statistical benefits of the Bayesian approach

slide-40
SLIDE 40

Variarional Bayes learning

40

ln 𝑄(𝒠) = ln

β„‹

𝑄 𝒠, β„‹|𝜾 𝑄 𝜾 π‘’πœΎ ln 𝑄(𝒠) β‰₯

β„‹

𝑅 β„‹, 𝜾 ln 𝑄 𝒠, β„‹, 𝜾 𝑅 β„‹, 𝜾 π‘’πœΎ ln 𝑄(𝒠) β‰₯

β„‹

𝑅ℋ β„‹ π‘…πœΎ 𝜾 ln 𝑄 𝒠, β„‹, 𝜾 𝑅ℋ β„‹ π‘…πœΎ 𝜾 π‘’πœΎ ln 𝑄(𝒠) β‰₯

β„‹

𝑅ℋ β„‹ π‘…πœΎ 𝜾 ln 𝑄 𝒠, β„‹, 𝜾 π‘’πœΎ + 𝐼 𝑅ℋ + 𝐼 π‘…πœΎ

Mean field: 𝑅 β„‹, 𝜾 = 𝑅ℋ β„‹ π‘…πœΎ 𝜾 𝐺𝒠 𝑄, 𝑅 π‘Ž = β„‹ βˆͺ 𝜾 π‘Œ = 𝒠 𝐺𝒠 𝑄, 𝑅

slide-41
SLIDE 41

Variarional Bayes learning

41

ln 𝑄 𝒠 = 𝐺𝒠 𝑄, 𝑅 + 𝐸(𝑅(β„‹, 𝜾)| 𝑄 β„‹, 𝜾 𝒠

 We want to find π‘…βˆ— = argmax

𝑅

𝐺

𝒠 𝑄, 𝑅

 We assume factorization 𝑅 β„‹, 𝜾 = 𝑅ℋ β„‹ π‘…πœΎ 𝜾

and use block coordinate ascent to optimize the above problem

slide-42
SLIDE 42

Mean Field VB

42

 Initialization: Randomly select starting distribution 𝜾1  Repeat

 E-Step: Given parameters, find posterior of hidden data

𝑅ℋ

𝑒+1 = argmax 𝑅ℋ

𝐺

𝒠 𝑄, 𝑅ℋ, π‘…πœΎ 𝑒

 M-Step: Given posterior distributions, find likely parameters

π‘…πœΎ

𝑒+1 = argmax π‘…πœΎ

𝐺

𝒠 𝑄, 𝑅ℋ 𝑒+1, π‘…πœΎ

 Until convergence

𝐺 𝑄, 𝑅ℋ, π‘…πœΎ =

β„‹

𝑅ℋ β„‹ π‘…πœΎ 𝜾 ln 𝑄 𝒠, β„‹, 𝜾 π‘’πœΎ + 𝐼 𝑅ℋ + 𝐼 π‘…πœΎ

slide-43
SLIDE 43

Local computation of ELBO for factorized 𝑄

43

𝑄 π’š = 1 π‘Ž

𝑔

π‘βˆˆβ„±

𝑔

𝑏(π’šπ‘)

𝐿𝑀(𝑅| 𝑄 = βˆ’πΌ 𝑅 βˆ’ 𝐹𝑅 log 1 π‘Ž

𝑔

π‘βˆˆπΊ

𝑔

𝑏 π’šπ‘

= βˆ’πΌ 𝑅 βˆ’ log 1 π‘Ž βˆ’

𝑔

π‘βˆˆπΊ

𝐹𝑅 log 𝑔

𝑏 π’šπ‘

= log π‘Ž βˆ’ 𝐼 𝑅 +

𝑔

π‘βˆˆπΊ

𝐹𝑅 log 𝑔

𝑏 π’šπ‘ β„’ 𝑅 = 𝐼 𝑅 +

𝑔

π‘βˆˆπΊ

𝐹𝑅 log 𝑔

𝑏 π’šπ‘

slide-44
SLIDE 44

NaΓ―ve mean field for factorized 𝑄

44

NaΓ―ve mean field (i.e., fully factored distribution 𝑅):

𝑅 π’š =

𝑗=1 𝑂

𝑅𝑗(𝑦𝑗) β„’ 𝑅 =

π‘βˆˆβ„±

𝐹𝑅 log 𝑔

𝑏 π’šπ‘

+ 𝐼 𝑅

𝐹𝑅 log 𝑔

𝑏(π’šπ‘) = π’šπ‘βˆˆπ‘Šπ‘π‘š(π‘Œπ‘) π‘—βˆˆπ’ͺ 𝑏

𝑅𝑗(𝑦𝑗) log 𝑔

𝑏(π’šπ‘)

𝐼 𝑅 =

𝑗=1 𝑂

𝐼[𝑅𝑗]

 Thus, β„’[𝑅] can be rewritten simply as a sum of expectations,

each one over a small set of variables

π’ͺ 𝑏 = 𝑗|𝑦𝑗 ∈ π‘‘π‘‘π‘π‘žπ‘“ 𝑔

𝑏

slide-45
SLIDE 45

Stationary point (fixed-point equations)

45

𝑅𝑗 𝑦𝑗 = 1 π‘Žπ‘— exp

𝑏:π‘—βˆˆπ’ͺ 𝑏 π’šπ‘βˆˆπ‘Šπ‘π‘š(π‘Œπ‘)

𝑅 π’šπ‘|𝑦𝑗 log 𝑔

𝑏(π’šπ‘)  Proof:

β„’ 𝑅 =

𝑗=1 𝑂

ℒ𝑗[𝑅] ℒ𝑗 𝑅 =

𝑏:π‘—βˆˆπ’ͺ 𝑏 π’šπ‘ π‘˜βˆˆπ’ͺ 𝑏

π‘…π‘˜ π‘¦π‘˜ log 𝑔

𝑏 π’šπ‘ + 𝐼 𝑅𝑗

𝑀𝑗 𝑅, πœ‡ = ℒ𝑗 𝑅 + πœ‡π‘—(

π‘¦π‘—βˆˆπ‘Šπ‘π‘š π‘Œπ‘—

𝑅𝑗 𝑦𝑗 βˆ’ 1) πœ–π‘€π‘— πœ–π‘…π‘—(𝑦𝑗) = 0 β‡’ 𝑅𝑗 𝑦𝑗 = 1 𝑓1βˆ’πœ‡π‘— exp

𝑏:π‘Œπ‘—βˆˆπ’ͺ 𝑏 π’šπ‘

𝑅 π’šπ‘|𝑦𝑗 log 𝑔

𝑏(π’šπ‘)

Update rule: We can optimize each 𝑅𝑗 given values for other potentials

slide-46
SLIDE 46

Optimization by coordinate ascent for factorized 𝑄

46

𝑅𝑗 𝑦𝑗 = 1 π‘Žπ‘— exp

𝑏:π‘Œπ‘—βˆˆπ’ͺ 𝑏 π’šπ‘

𝑅 π’šπ‘|𝑦𝑗 log 𝑔

𝑏(π’šπ‘, 𝑦𝑗)

 Coordinate ascent algorithm repeatedly optimizes a single

marginal at a time, given fixed choices to all of the others.

While not converged Iterate over each of the variables 𝑗 ∈ 𝒲 Maximize the objective function with respect to 𝑅𝑗 𝑦𝑗 , βˆ€π‘¦π‘— ∈ π‘Šπ‘π‘š π‘Œπ‘— by the above formula.

All these terms involve expectations of variables other than π‘Œπ‘— and do not depend on the choice of 𝑅𝑗 π‘Œπ‘— . block coordinate descent

slide-47
SLIDE 47

Convergence properties

47

 ℒ𝑗 is concave in 𝑅𝑗(π‘Œπ‘—)

 Update of 𝑅𝑗 is guaranteed to increase (or not decrease) β„’

 Mean Field iterations are guaranteed to converge.

 Each step of coordinate ascent procedure is monotonically non-

decreasing in β„’.

 Because β„’ is bounded, the sequence of distributions represented by

successive iterations of Mean-Field must converge.

 At the convergence point, the fixed-point equations hold for

all variables.

 As a consequence, the convergence point is a stationary point of the

energy functional subject to the constraints

 The result of the mean field approximation is a local maximum,

and not necessarily a global one

slide-48
SLIDE 48

Local computation in naΓ―ve mean field

48

 When updating π‘…π‘˜, we only need to reason about the

variables which share a factor with π‘Ž

π‘˜

 the expectations required to evaluate π‘…π‘˜ involve only those

variables lying in the Markov blanket of the node π‘˜

 the other terms get absorbed into the constant term.

 The optimization of π‘…π‘˜ can therefore be expressed as a local

computation at the node

slide-49
SLIDE 49

Variational methods: two perspective

49

 Each algorithm can be explained via two perspective:

 Constrained optimization  Message-passing algorithm

 As a on-way of solving the optimization problem

slide-50
SLIDE 50

Example: Mean field for pairwise MRFs

50

𝑄 π’š = 1 π‘Ž exp

(𝑗,π‘˜)βˆˆβ„°

πœ„π‘—π‘˜ 𝑦𝑗, π‘¦π‘˜ +

π‘—βˆˆπ’²

πœ„π‘— 𝑦𝑗 π‘…βˆ— = argmax

π‘…βˆˆπ’­

β„’ 𝑅 𝑅 π’š =

𝑗=1 𝑂

𝑅𝑗(𝑦𝑗)

π‘¦π‘—βˆˆπ‘Šπ‘π‘š π‘Œπ‘—

𝑅𝑗(𝑦𝑗) = 1

Subject to:

slide-51
SLIDE 51

Example: Mean field for pairwise MRFs

51

 𝑄 : Pairwise MRF

𝑄 π’š = 1 π‘Ž

(𝑗,π‘˜)βˆˆβ„°

πœšπ‘—π‘˜ 𝑦𝑗, π‘¦π‘˜

π‘—βˆˆπ’²

πœšπ‘— 𝑦𝑗 𝑄 π’š = 1 π‘Ž exp

(𝑗,π‘˜)βˆˆβ„°

πœ„π‘—π‘˜ 𝑦𝑗, π‘¦π‘˜ +

π‘—βˆˆπ’²

πœ„π‘— 𝑦𝑗 𝑅𝑗 𝑦𝑗 = 1 π‘Žπ‘— exp πœ„π‘— 𝑦𝑗 +

π‘˜βˆˆπ’ͺ 𝑗 π‘¦π‘˜

π‘…π‘˜ π‘¦π‘˜ πœ„π‘—π‘˜ 𝑦𝑗, π‘¦π‘˜ β‡’ 𝑅𝑗 𝑦𝑗 ∝ πœšπ‘— 𝑦𝑗

π‘˜βˆˆπ’ͺ 𝑗

π‘›π‘˜π‘— 𝑦𝑗 π‘›π‘˜π‘— 𝑦𝑗 ∝ exp

π‘¦π‘˜

π‘…π‘˜ π‘¦π‘˜ πœ„π‘—π‘˜ 𝑦𝑗, π‘¦π‘˜ πœ„π‘— = ln πœšπ‘— πœ„π‘—π‘˜ = ln πœšπ‘—π‘˜

slide-52
SLIDE 52

Message passing: Mean field vs. BP for pairwise MRF

52

𝑄 π’š = 1 π‘Ž

(𝑗,π‘˜)βˆˆβ„°

πœšπ‘—π‘˜ 𝑦𝑗, π‘¦π‘˜

π‘—βˆˆπ’²

πœšπ‘— 𝑦𝑗

 Mean Field:

𝑅𝑗 𝑦𝑗 ∝ πœšπ‘— 𝑦𝑗

π‘˜βˆˆπ’ͺ 𝑗

π‘›π‘˜π‘— 𝑦𝑗 π‘›π‘˜π‘— 𝑦𝑗 ∝ exp

π‘¦π‘˜

π‘…π‘˜ π‘¦π‘˜ πœ„π‘—π‘˜ 𝑦𝑗, π‘¦π‘˜

 Belief propagation (sum product)

𝑐𝑗(𝑦𝑗) ∝ πœšπ‘— 𝑦𝑗

π‘˜βˆˆπ’ͺ 𝑗

π‘›π‘˜π‘— 𝑦𝑗 π‘›π‘—π‘˜(π‘¦π‘˜) ∝

𝑦𝑗

πœšπ‘— 𝑦𝑗 πœšπ‘—π‘˜ 𝑦𝑗, π‘¦π‘˜

π‘™βˆˆπ’ͺ 𝑗 \π‘˜

𝑛𝑙𝑗 𝑦𝑗

πœ„π‘— 𝑦𝑗, π‘¦π‘˜ = ln πœšπ‘—π‘˜ 𝑦𝑗, π‘¦π‘˜

slide-53
SLIDE 53

Variational message passing

53

 Mean field methods are all very similar

 just compute each node’s full conditional, and average out the neighbors

𝑄 π’š =

𝑗

𝑄(𝑦𝑗|𝑄𝑏𝑦𝑗) ln 𝑅 π‘¦π‘˜ = πΉπ‘…βˆ’π‘˜

π‘—βˆˆ{π‘˜,π·β„Žπ‘˜}

ln 𝑄 𝑦𝑗|𝑄𝑏𝑗 + const

 It is possible to derive a general purpose set of update equations that work for

any DGM for which all CPDs are in the exponential family, and for which all parent nodes have conjugate distributions  Updating nodes one at a time

 updating posterior beliefs using local operations at each node.  each update increases a lower bound on the log evidence (unless already

at a local maximum)

slide-54
SLIDE 54

Structured variational

54

 Mean Field

 Naïve mean field  Structured mean field

slide-55
SLIDE 55

Structured Mean Field

55

 Naïve mean-field can lead to very poor approximations

 we must use a richer class of distributions 𝒭, which has greater expressive

power (by capturing some of the dependencies in 𝑄)

 use network structures of different complexity

 subgraph of 𝐻𝑄 over which exact computation of 𝐼[𝑅] is feasible

 Example: for grid 𝐻𝑄, a collection of independent chain structures.

 Exact inference with such structures is linear

slide-56
SLIDE 56

Structured stationary point

56

𝑄 π’š = 1 π‘Ž

𝑙=1 𝐿

πœšπ‘™ π’šπ‘™ 𝑅 π’š = 1 π‘Žπ‘…

π‘˜=1 𝐾

πœ”π‘˜ π’šπ‘˜ 𝐺 𝑄, 𝑅 =

𝑙=1 𝐿

𝐹𝑅 ln πœšπ‘™ π’šπ‘™ βˆ’ 𝐹𝑅 ln 𝑅 𝐺 𝑄, 𝑅 =

𝑙=1 𝐿

𝐹𝑅 ln πœšπ‘™ π’šπ‘™ βˆ’

π‘˜=1 𝐾

𝐹𝑅 ln πœ”π‘˜ π’šπ‘˜ + ln π‘Žπ‘…

slide-57
SLIDE 57

Structured stationary point

57

 πœ”π‘˜ is a stationary point of the energy functional iff: πœ”π‘˜ π’šπ‘˜ ∝ exp 𝐹𝑅 log 𝑄(π’š) |π’šπ‘˜ βˆ’

π‘™β‰ π‘˜

𝐹𝑅 log πœ”π‘™ π’šπ‘™ |π’šπ‘˜ πœ”π‘˜ π’šπ‘˜ ∝ exp

𝑗

𝐹𝑅 log πœšπ‘—(π’šπ‘—) |π’šπ‘˜ βˆ’

π‘™β‰ π‘˜

𝐹𝑅 log πœ”π‘™ π’šπ‘™ |π’šπ‘˜

 We need to perform inference after each update step πœ”π‘˜ π’šπ‘˜ does not affect the right-hand side of the fixed-point equations defining its value

slide-58
SLIDE 58

Structured mean-field quality

58

 Both the quality and the computational complexity of the

variational approximation depend on the structure of 𝑄 and 𝑅.

 We want to be able to perform efficient inference in the

approximating network.

 we often select our network so that the resulting factorization

leads to a tractable network (that is, one of low tree-width)

slide-59
SLIDE 59

Loopy Belief Propagation (LBP)

59

 A fixed point iteration procedure that tries to minimize an

approximation of 𝐺 𝑄, 𝑅

 Start with initialization of all messages to one  While not converged do

 At convergence, stationary properties are guaranteed  LBP does not always converge, and even when it does, it may

converge to the wrong answers

𝑛𝑗→𝑑

π‘œπ‘“π‘₯ 𝑦𝑗 = π‘’βˆˆπ’ͺ(𝑗)\s

𝑛𝑒→𝑗(𝑦𝑗) 𝑛𝑑→𝑗

π‘œπ‘“π‘₯ 𝑦𝑗 = π’šπ’ͺ(𝑑)\𝑗

𝑔

𝑑 π’šπ’ͺ(𝑑) π‘˜βˆˆπ’ͺ(𝑑)\𝑗

π‘›π‘˜β†’π‘‘(π‘¦π‘˜)

slide-60
SLIDE 60

Recall: Beliefs and messages in factor tree

60

π‘Œπ‘— 𝑔

𝑒

𝑔

𝑏

… 𝑛𝑗→𝑏(𝑦𝑗) 𝑛𝑒→𝑗(𝑦𝑗)

𝑒 ∈ π’ͺ(𝑗)\s

𝑛𝑗→𝑏 𝑦𝑗 =

π‘’βˆˆπ’ͺ(𝑗)\a

𝑛𝑒→𝑗(𝑦𝑗) 𝑛𝑏→𝑗 𝑦𝑗 =

π’šπ’ͺ(𝑏)\𝑗

𝑔

𝑏 π’šπ’ͺ(𝑏) π‘˜βˆˆπ’ͺ(𝑏)\𝑗

π‘›π‘˜β†’π‘(π‘¦π‘˜) 𝑐𝑗 𝑦𝑗 ∝

π‘βˆˆπ’ͺ 𝑗

𝑛𝑏→𝑗 𝑦𝑗 𝑐𝑏 π’šπ’ͺ(𝑏) ∝ 𝑔

𝑏 π’šπ’ͺ(𝑏) π‘—βˆˆπ’ͺ 𝑏

𝑛𝑗→𝑏 𝑦𝑗 … π‘›π‘˜β†’π‘(π‘¦π‘˜)

π‘˜ ∈ π’ͺ(𝑏)\𝑗

π‘Œ

π‘˜

𝑛𝑏→𝑗(𝑦𝑗)

slide-61
SLIDE 61

LBP

61

 If BP is used on graphs with loops, messages may circulate

indefinitely

 But we can run it anyway and hope for the best  Stop message passing when

 fixed number of iterations is reached  or when no significant change in beliefs is occurred

 Empirically, a good approximation can be achievable

 If solution is not oscillatory but converges, it usually is a good

approximation

slide-62
SLIDE 62

LBP as a relaxation method

62

 Loopy

Belief Propagation (LBP)

  • ptimizes

approximate versions of the energy functional

 We approximate 𝐺[𝑄, 𝑅] with πΊπΆπ‘“π‘’β„Žπ‘“[𝑄, 𝑅]  works directly with pseudomarginals which may not be consistent with

any joint distribution

 Fixed-point equations derived from the constrained energy

minimization can be viewed as message-passing over a graph

slide-63
SLIDE 63

Bethe approximation

63

 Pros:

 Objective function 𝐺

πΆπ‘“π‘’β„Žπ‘“[𝑄, 𝑅] that is easier to compute and

  • ptimize

 Cons:

 may or may not be well connected to 𝐺[𝑄, 𝑅]  It could, in general, be greater, equal or less than

 Optimize each 𝑐(π’šπ‘)'s.

 For discrete belief, constrained opt. with Lagrangian multiplier  For continuous belief, not yet a general formula  Not always converge

slide-64
SLIDE 64

LBP message-update rules

64

𝑛𝑗→𝑏 𝑦𝑗 =

π‘’βˆˆπ’ͺ(𝑗)\a

𝑛𝑒→𝑗(𝑦𝑗) 𝑛𝑏→𝑗 𝑦𝑗 =

π’šπ’ͺ(𝑏)\𝑗

𝑔

𝑏 π’šπ’ͺ(𝑏) π‘˜βˆˆπ’ͺ(𝑏)\𝑗

π‘›π‘˜β†’π‘(π‘¦π‘˜)

𝑐𝑗 𝑦𝑗 ∝ 𝑔

𝑗 𝑦𝑗 π‘βˆˆπ’ͺ(𝑗)

𝑛𝑏→𝑗 𝑦𝑗 𝑐𝑏 π’šπ‘ ∝ 𝑔

𝑏 π’šπ‘ π‘—βˆˆπ’ͺ(𝑏) π‘‘βˆˆπ’ͺ(𝑗)\a

𝑛𝑑→𝑗 𝑦𝑗  Bethe is equal to BP on factor graph

 Each message can be defined in terms of other messages allowing an

easy iterative algorithm for solving the fixed point equations

β€œbelief” is the approximation of the marginal probability.

slide-65
SLIDE 65

Inference on trees: variational perspective

65

 For trees, a sequence of message propagations can calibrate

the tree in two passes

 Propagation process is converged and additional message passing do not

change the beliefs

 For general graphs, the process may not converge

 Information from one pass on loopy graphs will circulate and affect the

next round

 Beliefs are not necessarily the marginal probability in 𝑄  Empirically, a good approximation is still achievable

 If solution is not oscillatory but converges, it usually is a good approximation  As cycles grow long, BP becomes exact (coding)

slide-66
SLIDE 66

Reference

66

 C.M. Bishop, β€œPattern Recognition and Machine Learning”,

Chapter 10.1-10.4.

 D. Koller and N. Friedman, β€œProbabilistic Graphical Models:

Principles and Techniques”, Chapter 11.1-11.3, 11.5, 11.6.

slide-67
SLIDE 67

Some extra slides on theory of LBP if you optionally want to know it briefly

slide-68
SLIDE 68

Theory behind LBP

68

 LBP relaxes the following optimization problem:

π‘…βˆ— = argmax

π‘…βˆˆβ„³

𝐹𝑅 log 𝑄 + 𝐼(𝑅)

 and use the following optimization problem:

π‘…βˆ— = argmax

π‘…βˆˆπ’­

πΊπΆπ‘“π‘’β„Žπ‘“ 𝑄, 𝑅 πΊπΆπ‘“π‘’β„Žπ‘“ 𝑄, 𝑅 = βˆ’

𝑔

π‘βˆˆβ„±

π’šπ‘

𝑐𝑏 π’šπ‘ log 𝑐𝑏 π’šπ‘ 𝑔

𝑏 π’šπ‘

+

𝑗

1 βˆ’ 𝑒𝑗

𝑦𝑗

𝑐𝑗 𝑦𝑗 log 𝑐𝑗 𝑦𝑗 =

𝑔

π‘βˆˆβ„±

𝐹 𝑔

𝑏 π’šπ‘

+ πΌπΆπ‘“π‘’β„Žπ‘“ 𝒭 : a family of (tractable) probability distributions π’šπ‘ ≑ 𝑦𝑗|𝑗 ∈ π‘‘π‘‘π‘π‘žπ‘“ 𝑔

𝑏

𝑄 π’š = 1 π‘Ž 𝑄 π’š

slide-69
SLIDE 69

Theory behind LBP

69

 Indeed, we do not optimize 𝑅 explicitly, focus on the set of

beliefs 𝑐 = 𝑐𝑏 π’šπ‘ and 𝑐𝑗 𝑦𝑗

  • n factors and variables:

π‘βˆ— = argmin

π‘βˆˆβ„³π‘€

𝐹𝑐 log 𝑄 + 𝐼(𝑐)

 approximate objective: πΊπΆπ‘“π‘’β„Žπ‘“ = 𝐹𝑐 log

𝑄 + 𝐼(𝑐)

 relaxed feasible set:

ℳ𝑀 = 𝑐𝑗 β‰₯ 0, 𝑐𝑏 β‰₯ 0|

𝑦𝑗

𝑐𝑗 𝑦𝑗 = 1,

π’šπ‘\𝑦𝑗

𝑐𝑏 π’šπ‘ = 𝑐 𝑦𝑗

 LBP is a fixed point iteration procedure that tries to find π‘βˆ— ℳ𝑀: locally consistent pseudo-marginals (a relaxation of the original set of consistencies). both the objective and the constraint space are approximate

slide-70
SLIDE 70

Tree energy functional

70

 Consider a tree-structured distribution

 𝑐𝑏 and 𝑐𝑗 denote marginals on factors and variables

𝑅 π’š = 𝑏 𝑐𝑏(π’šπ‘) 𝑗 𝑐𝑗 𝑦𝑗 𝑗 𝑐𝑗 𝑦𝑗 𝑒𝑗 𝐼 = βˆ’

𝑏 π’šπ’ͺ 𝑏

𝑐𝑏 π’šπ‘ log 𝑐𝑏 π’šπ‘ +

𝑗

𝑒𝑗 βˆ’ 1

𝑦𝑗

𝑐𝑗 𝑦𝑗 log 𝑐𝑗 𝑦𝑗 𝐺 = βˆ’

𝑏 π’šπ’ͺ 𝑏

𝑐𝑏 π’šπ‘ log 𝑐𝑏 π’šπ‘ 𝑔

𝑏 π’šπ‘

+

𝑗

𝑒𝑗 βˆ’ 1

𝑦𝑗

𝑐𝑗 𝑦𝑗 log 𝑐𝑗 𝑦𝑗

𝑐𝑏 and 𝑐𝑗 denote pseudo marginals

slide-71
SLIDE 71

Bethe approximation of 𝐺 𝑄, 𝑅 for general graphs

71

 For a general graph, choose

𝐺 𝑄, 𝑅 = 𝐺

Bethe(𝑄, 𝑅)

πΌπΆπ‘“π‘’β„Žπ‘“ = βˆ’

𝑏 π’šπ’ͺ 𝑏

𝑐𝑏 π’šπ‘ log 𝑐𝑏 π’šπ‘ +

𝑗

𝑒𝑗 βˆ’ 1

𝑦𝑗

𝑐𝑗 𝑦𝑗 log 𝑐𝑗 𝑦𝑗 πΊπΆπ‘“π‘’β„Žπ‘“ = βˆ’

𝑏 π’šπ’ͺ 𝑏

𝑐𝑏 π’šπ‘ log 𝑐𝑏 π’šπ‘ 𝑔

𝑏 π’šπ‘

+

𝑗

𝑒𝑗 βˆ’ 1

𝑦𝑗

𝑐𝑗 𝑦𝑗 log 𝑐𝑗 𝑦𝑗 =

𝑏

𝐹𝑐𝑏 log 𝑔

𝑏 π’šπ‘

+ πΌπΆπ‘“π‘’β„Žπ‘“

slide-72
SLIDE 72

Minimizing the Bethe free energy

72

𝑀 𝑅, πœ‡ = πΊπΆπ‘“π‘’β„Žπ‘“ 𝑄, 𝑅 +

𝑗

πœ‡π‘— 1 βˆ’

𝑦𝑗

𝑐𝑗 𝑦𝑗 +

𝑏 π‘—βˆˆπ’ͺ(𝑏) 𝑦𝑗

πœ‡π‘π‘— 𝑦𝑗 𝑐𝑗 𝑦𝑗 βˆ’

π’šπ‘\𝑦𝑗

𝑐𝑏 π’šπ‘

 Stationary points:

πœ–π‘€ πœ–π‘π‘— 𝑦𝑗 = 0 β‡’ 𝑐𝑗 𝑦𝑗 ∝ exp 1 𝑒𝑗 βˆ’ 1

π‘βˆˆπ’ͺ(𝑗)

πœ‡π‘π‘— 𝑦𝑗 πœ–π‘€ πœ–π‘π‘ π’šπ‘ = 0 β‡’ 𝑐𝑏 π’šπ‘ ∝ exp log 𝑔

𝑏 π’šπ‘ + π‘—βˆˆπ’ͺ(𝑏)

πœ‡π‘π‘— 𝑦𝑗

slide-73
SLIDE 73

Fixed point equations

73

𝑐𝑗 𝑦𝑗 = exp βˆ’1 exp 1 𝑒𝑗 βˆ’ 1

π‘βˆˆπ’ͺ 𝑗

πœ‡π‘π‘— 𝑦𝑗 𝑐𝑏 π’šπ‘ = exp βˆ’1 + πœ‡π‘— exp log 𝑔

𝑏 π’šπ‘ + π‘—βˆˆπ’ͺ(𝑏)

πœ‡π‘π‘— 𝑦𝑗 𝑛𝑗→𝑏 𝑦𝑗 ≑ exp πœ‡π‘π‘— 𝑦𝑗

 Using 𝑐𝑗 𝑦𝑗 = π’šπ‘\𝑦𝑗 𝑐𝑏 π’šπ‘ , we have:

𝑛𝑏→𝑗 𝑦𝑗 =

π’šπ‘\𝑦𝑗

𝑔

𝑏 π’šπ‘ π‘˜βˆˆπ’ͺ(𝑏)\i π‘‘βˆˆπ’ͺ(π‘˜)\a

π‘›π‘‘β†’π‘˜ π‘¦π‘˜