Divergence measures and message passing Tom Minka Microsoft - - PowerPoint PPT Presentation

divergence measures and message passing
SMART_READER_LITE
LIVE PREVIEW

Divergence measures and message passing Tom Minka Microsoft - - PowerPoint PPT Presentation

Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group 1 Message-Passing Algorithms MF [Peterson,Anderson 87] Mean-field BP [Frey,MacKay 97] Loopy


slide-1
SLIDE 1

1

Divergence measures and message passing

Tom Minka Microsoft Research Cambridge, UK

with thanks to the Machine Learning and Perception Group

slide-2
SLIDE 2

2

Message-Passing Algorithms

[Minka 04] PEP

Power EP

[Wiegerinck,Heskes 02] FBP

Fractional belief propagation

[Wainwright,Jaakkola,Willsky 03] TRW

Tree-reweighted message passing

[Minka 01] EP

Expectation propagation

[Frey,MacKay 97] BP

Loopy belief propagation

[Peterson,Anderson 87] MF

Mean-field

slide-3
SLIDE 3

3

Outline

  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a

divergence measure

  • Big picture
slide-4
SLIDE 4

4

Outline

  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a

divergence measure

  • Big picture
slide-5
SLIDE 5

5

Estimation Problem

x y z a b c d f e

slide-6
SLIDE 6

6

Estimation Problem

x y z a b c d f e 1 ? 1 ? 1 ?

slide-7
SLIDE 7

7

Estimation Problem

x y z

slide-8
SLIDE 8

8

Estimation Problem

Queries: Want to do these quickly

slide-9
SLIDE 9

9

Belief Propagation

y x z

slide-10
SLIDE 10

10

Belief Propagation

x y z

Final

slide-11
SLIDE 11

11

Belief Propagation

Marginals: (Exact) (BP) Normalizing constant: 0.45 (Exact) 0.44 (BP) Argmax: (0,0,0) (Exact) (0,0,0) (BP)

slide-12
SLIDE 12

12

Outline

  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a

divergence measure

  • Big picture
slide-13
SLIDE 13

13

Message Passing = Distributed Optimization

  • Messages represent a simpler distribution

that approximates

– A distributed representation

  • Message passing = optimizing to fit

– stands in for when answering queries

  • Parameters:

– What type of distribution to construct (approximating family) – What cost to minimize (divergence measure)

slide-14
SLIDE 14

14

How to make a message-passing algorithm

  • 1. Pick an approximating family
  • fully-factorized, Gaussian, etc.
  • 2. Pick a divergence measure
  • 3. Construct an optimizer for that measure
  • usually fixed-point iteration
  • 4. Distribute the optimization across factors
slide-15
SLIDE 15

15

Outline

  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a

divergence measure

  • Big picture
slide-16
SLIDE 16

16

Kullback-Leibler (KL) divergence Let p,q be unnormalized distributions Alpha-divergence (α is any real number) Asymmetric, convex

slide-17
SLIDE 17

17

Examples of alpha-divergence

slide-18
SLIDE 18

18

Minimum alpha-divergence

q is Gaussian, minimizes Dα(p||q) α = -∞

slide-19
SLIDE 19

19

Minimum alpha-divergence

q is Gaussian, minimizes Dα(p||q) α = 0

slide-20
SLIDE 20

20

Minimum alpha-divergence

q is Gaussian, minimizes Dα(p||q) α = 0.5

slide-21
SLIDE 21

21

Minimum alpha-divergence

q is Gaussian, minimizes Dα(p||q) α = 1

slide-22
SLIDE 22

22

Minimum alpha-divergence

q is Gaussian, minimizes Dα(p||q) α = ∞

slide-23
SLIDE 23

23

Properties of alpha-divergence

  • α ≤ 0 seeks the mode with largest mass

(not tallest)

– zero-forcing: p(x)=0 forces q(x)=0 – underestimates the support of p

  • α ≥ 1 stretches to cover everything

– inclusive: p(x)>0 forces q(x)>0 – overestimates the support of p

[Frey,Patrascu,Jaakkola,Moran 00]

slide-24
SLIDE 24

24

Structure of alpha space

α 1 zero forcing inclusive (zero avoiding) MF BP, EP FBP, PEP TRW

slide-25
SLIDE 25

25

  • If q is an exact minimum of alpha-divergence:
  • Normalizing constant:
  • If α=1: Gaussian q matches mean,variance of p

– Fully factorized q matches marginals of p

Other properties

slide-26
SLIDE 26

26

Two-node example

  • q is fully-factorized, minimizes α-

divergence to p

  • q has correct marginals only for α =

1 (BP)

x y

slide-27
SLIDE 27

27

Two-node example

α = 1 (BP) Bimodal distribution

Bad Good

  • Marginals
  • Mass
  • Zeros
  • One peak
  • Zeros
  • Peak

heights

  • Marginals
  • Mass

α = 0 (MF) α ≤ 0.5

slide-28
SLIDE 28

28

Two-node example

α = ∞ Bimodal distribution

Bad Good

  • Zeros
  • Marginals
  • Peak

heights

slide-29
SLIDE 29

29

Lessons

  • Neither method is inherently superior –

depends on what you care about

  • A factorized approx does not imply

matching marginals (only for α=1)

  • Adding y to the problem can change the

estimated marginal for x (though true marginal is unchanged)

slide-30
SLIDE 30

30

Outline

  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a

divergence measure

  • Big picture
slide-31
SLIDE 31

31

Distributed divergence minimization

slide-32
SLIDE 32

32

  • Write p as product of factors:
  • Approximate factors one by one:
  • Multiply to get the approximation:

Distributed divergence minimization

slide-33
SLIDE 33

33

Global divergence to local divergence

  • Global divergence:
  • Local divergence:
slide-34
SLIDE 34

34

Message passing

  • Messages are passed between factors
  • Messages are factor approximations:
  • Factor receives

– Minimize local divergence to get – Send to other factors – Repeat until convergence

  • Produces all 6 algs
slide-35
SLIDE 35

35

Global divergence vs. local divergence

In general, local ≠ global

  • but results are similar
  • BP doesn’t minimize global KL, but comes

close

MF α local = global no loss from message passing local ≠ global

slide-36
SLIDE 36

36

Experiment

  • Which message passing algorithm is

best at minimizing global Dα(p||q)?

  • Procedure:
  • 1. Run FBP with various αL
  • 2. Compute global divergence for various

αG

  • 3. Find best αL (best alg) for each αG
slide-37
SLIDE 37

37

Results

  • Average over 20 graphs, random singleton

and pairwise potentials:

  • Mixed potentials ( ~ (-1,1)):

– best αL = αG (local should match global) – FBP with same α is best at minimizing Dα

  • BP is best at minimizing KL
slide-38
SLIDE 38

38

Outline

  • Example of message passing
  • Interpreting message passing
  • Divergence measures
  • Message passing from a

divergence measure

  • Big picture
slide-39
SLIDE 39

39

Hierarchy of algorithms

BP

  • fully factorized
  • KL(p||q)

EP

  • exp family
  • KL(p||q)

FBP

  • fully factorized
  • Dα(p||q)

Power EP

  • exp family
  • Dα(p||q)

MF

  • fully factorized
  • KL(q||p)

TRW

  • fully factorized
  • Dα(p||q),α>1

Structured MF

  • exp family
  • KL(q||p)
slide-40
SLIDE 40

40

Matrix of algorithms

BP

  • fully factorized
  • KL(p||q)

EP

  • exp family
  • KL(p||q)

FBP

  • fully factorized
  • Dα(p||q)

Power EP

  • exp family
  • Dα(p||q)

divergence measure Other families? (mixtures) MF

  • fully factorized
  • KL(q||p)

TRW

  • fully factorized
  • Dα(p||q),α>1

approximation family Structured MF

  • exp family
  • KL(q||p)

Other divergences?

slide-41
SLIDE 41

41

Other Message Passing Algorithms

Do they correspond to divergence measures?

  • Generalized belief propagation

[Yedidia,Freeman,Weiss 00]

  • Iterated conditional modes [Besag 86]
  • Max-product belief revision
  • TRW-max-product [Wainwright,Jaakkola,Willsky 02]
  • Laplace propagation [Smola,Vishwanathan,Eskin 03]
  • Penniless propagation [Cano,Moral,Salmerón 00]
  • Bound propagation [Leisink,Kappen 03]
slide-42
SLIDE 42

42

Future work

  • Understand existing message passing

algorithms

  • Understand local vs. global divergence
  • New message passing algorithms:

– Specialized divergence measures – Richer approximating families

  • Other ways to minimize divergence