divergence measures and message passing
play

Divergence measures and message passing Tom Minka Microsoft - PowerPoint PPT Presentation

Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group 1 Message-Passing Algorithms MF [Peterson,Anderson 87] Mean-field BP [Frey,MacKay 97] Loopy


  1. Divergence measures and message passing Tom Minka Microsoft Research Cambridge, UK with thanks to the Machine Learning and Perception Group 1

  2. Message-Passing Algorithms MF [Peterson,Anderson 87] Mean-field BP [Frey,MacKay 97] Loopy belief propagation EP [Minka 01] Expectation propagation TRW [Wainwright,Jaakkola,Willsky Tree-reweighted message 03] passing FBP [Wiegerinck,Heskes 02] Fractional belief propagation PEP [Minka 04] Power EP 2

  3. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 3

  4. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 4

  5. Estimation Problem b y d e x f z a c 5

  6. Estimation Problem b 0 1 ? y d e 0 0 1 ? 1 ? x f z a c 6

  7. Estimation Problem y x z 7

  8. Estimation Problem Queries: Want to do these quickly 8

  9. Belief Propagation y x z 9

  10. Belief Propagation Final y x z 10

  11. Belief Propagation Marginals: (Exact) (BP) Normalizing constant: 0.45 (Exact) 0.44 (BP) Argmax: (0,0,0) (Exact) (0,0,0) (BP) 11

  12. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 12

  13. Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � – � stands in for � when answering queries • Parameters: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 13

  14. How to make a message-passing algorithm 1. Pick an approximating family • fully-factorized, Gaussian, etc. 2. Pick a divergence measure 3. Construct an optimizer for that measure • usually fixed-point iteration 4. Distribute the optimization across factors 14

  15. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 15

  16. Let p,q be unnormalized distributions Kullback-Leibler (KL) divergence Alpha-divergence ( α is any real number) Asymmetric, convex 16

  17. Examples of alpha-divergence 17

  18. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = - ∞ 18

  19. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 0 19

  20. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 0.5 20

  21. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = 1 21

  22. Minimum alpha-divergence q is Gaussian, minimizes D α (p||q) α = ∞ 22

  23. Properties of alpha-divergence • α ≤ 0 seeks the mode with largest mass (not tallest) – zero-forcing : p(x)=0 forces q(x)=0 – underestimates the support of p • α ≥ 1 stretches to cover everything – inclusive : p(x)>0 forces q(x)>0 – overestimates the support of p [Frey,Patrascu,Jaakkola,Moran 00] 23

  24. Structure of alpha space inclusive (zero zero avoiding) forcing BP, MF EP α 0 1 TRW FBP, PEP 24

  25. Other properties • If q is an exact minimum of alpha-divergence: • Normalizing constant: • If α =1: Gaussian q matches mean,variance of p – Fully factorized q matches marginals of p 25

  26. Two-node example x y • q is fully-factorized, minimizes α - divergence to p • q has correct marginals only for α = 1 (BP) 26

  27. Two-node example Bimodal distribution Good Bad α = 1 (BP) •Zeros •Marginals •Peak •Mass heights α = 0 (MF) α ≤ 0.5 •Zeros •Marginals •One peak •Mass 27

  28. Two-node example Bimodal distribution Good Bad α = ∞ •Zeros •Peak heights •Marginals 28

  29. Lessons • Neither method is inherently superior – depends on what you care about • A factorized approx does not imply matching marginals (only for α =1) • Adding y to the problem can change the estimated marginal for x (though true marginal is unchanged) 29

  30. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 30

  31. Distributed divergence minimization 31

  32. Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 32

  33. Global divergence to local divergence • Global divergence: • Local divergence: 33

  34. Message passing • Messages are passed between factors • Messages are factor approximations: • Factor � receives – Minimize local divergence to get – Send to other factors – Repeat until convergence • Produces all 6 algs 34

  35. Global divergence vs. local divergence MF 0 α local ≠ global local = global no loss from message passing In general, local ≠ global • but results are similar • BP doesn’t minimize global KL, but comes close 35

  36. Experiment • Which message passing algorithm is best at minimizing global D α (p||q)? • Procedure: 1. Run FBP with various α L 2. Compute global divergence for various α G 3. Find best α L (best alg) for each α G 36

  37. Results • Average over 20 graphs, random singleton and pairwise potentials: ��� � � �� � � � � � • Mixed potentials ( � ~ � (-1,1)): – best α L = α G (local should match global) – FBP with same α is best at minimizing D α • BP is best at minimizing KL 37

  38. Outline • Example of message passing • Interpreting message passing • Divergence measures • Message passing from a divergence measure • Big picture 38

  39. Hierarchy of algorithms Power EP • exp family • D α (p||q) Structured MF FBP EP • exp family • fully factorized • exp family • KL(q||p) • D α (p||q) • KL(p||q) MF TRW BP • fully factorized • fully factorized • fully factorized • D α (p||q), α >1 • KL(q||p) • KL(p||q) 39

  40. Matrix of algorithms MF Structured MF • fully factorized • exp family • KL(q||p) • KL(q||p) TRW • fully factorized approximation family Other families? • D α (p||q), α >1 (mixtures) divergence measure BP EP • fully factorized • exp family • KL(p||q) • KL(p||q) FBP Power EP • fully factorized • exp family • D α (p||q) • D α (p||q) Other divergences? 40

  41. Other Message Passing Algorithms Do they correspond to divergence measures? • Generalized belief propagation [Yedidia,Freeman,Weiss 00] • Iterated conditional modes [Besag 86] • Max-product belief revision • TRW-max-product [Wainwright,Jaakkola,Willsky 02] • Laplace propagation [Smola,Vishwanathan,Eskin 03] • Penniless propagation [Cano,Moral,Salmerón 00] • Bound propagation [Leisink,Kappen 03] 41

  42. Future work • Understand existing message passing algorithms • Understand local vs. global divergence • New message passing algorithms: – Specialized divergence measures – Richer approximating families • Other ways to minimize divergence 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend