Variational Mean Field Variational Mean Field for Graphical Models - PowerPoint PPT Presentation

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov

Approximate Inference Approximate Inference • Consider general UGs ( i.e., not tree-structured) • All basic computations are intractable (for large G ) - likelihoods & partition function - marginals & conditionals - finding modes

Taxonomy of Inference Methods Taxonomy of Inference Methods Inference Inference Exact Approximate Exact Approximate VE JT BP Stochastic Deterministic Stochastic Deterministic Gibbs, M-H (MC) MC, SA ~MP MP Cluster Variational Cluster Variational LBP EP

Approximate Inference Approximate Inference • Stochastic (Sampling) - Metropolis-Hastings, Gibbs, (Markov Chain) Monte Carlo, etc - Computationally expensive , but is “exact” (in the limit) • Deterministic (Optimization) - Mean Field (MF), Loopy Belief Propagation (LBP) - Variational Bayes (VB), Expectation Propagation (EP) - Computationally cheaper , but is not exact (gives bounds)

Mean Field : Overview Mean Field : Overview • General idea - approximate p ( x ) by a simpler factored distribution q ( x ) - minimize “distance” D ( p||q ) - e.g., Kullback-Liebler original G (Naïve) MF H 0 structured MF H s ∏ ∏ ∝ φ ∝ ∝ p ( x ) c x ( ) q ( x ) q i x ( ) q ( x ) q ( x ) q ( x ) c i A A B B c i

Mean Field : Overview Mean Field : Overview • Naïve MF has roots in Statistical Mechanics (1890s) - physics of spin glasses (Ising), ferromagnetism, etc - why is it called why is it called “ “Mean Field Mean Field” ” ? ? with full factorization : E [ x i x j ] = E [ x i ] E [ x j ] • Structured MF is more “modern” Coupled HMM Structured MF approximation (with tractable chains)

KL Projection D KL Projection D ( ( Q||P Q||P ) ) • Infer hidden h given visible v (clamp v nodes with δ ‘s ) • Variational Variational : optimize KL globally • forces Q = 0 P = 0 the right density form for Q “falls out” KL is easier since we’re taking E [.] wrt simpler Q Q seeks mode with the largest mass (not height) so it will tend to underestimate the support of P

KL Projection D KL Projection D ( ( P||Q P||Q ) ) • Infer hidden h given visible v (clamp v nodes with δ ‘s ) • Expectation Propagation Expectation Propagation (EP) : optimize KL locally • forces Q > 0 P > 0 this KL is harder since we’re taking E [.] wrt P no nice global solution for Q “falls out” must sequentially tweak each q c (match moments) Q covers all modes so it overestimates support

α - α - divergences divergences • The 2 basic KL divergences are special cases of • D α ( p||q ) is non-negative and 0 iff p = q – when α � - 1 we get KL ( P||Q ) – when α � + 1 we get KL ( Q||P ) – when α = 0 D 0 ( P || Q ) is proportional to Hellinger Hellinger’ ’s s distance (metric) So many variational approximations must exist, one for each α !

α - for more on α - divergences divergences Shun-ichi Amari

α = ± 1 for specific examples of See Chapter 10 Chapter 10 Variational Single Gaussian Variational Linear Regression Variational Mixture of Gaussians Variational Logistic Regression Expectation Propagation ( α = -1)

Hierarchy of Algorithms Hierarchy of Algorithms (based on α and structuring) Power EP • exp family • D α (p||q) FBP EP Structured MF • fully factorized • exp family • exp family • D α (p||q) • KL(p||q) • KL(q||p) MF TRW BP • fully factorized • fully factorized • fully factorized • KL(q||p) • D α (p||q) α > 1 • KL(p||q) by Tom Minka

Variational MF Variational MF 1 1 � ∏ ψ ( x ) ψ = γ = γ = ( x ) log( c x ( )) p ( x ) ( x ) e c c c Z Z c c � ψ ( x ) = log Z log e dx Jensen’s ψ ( x ) e � ψ = ( x ) log Q ( x ) dx ≥ E log [ e / Q ( x )] Q Q ( x ) ψ ( x ) = sup E log [ e / Q ( x )] Q Q = ψ + sup { E Q [ ( x )] H [ Q ( x )] } Q

Variational MF Variational MF ≥ ψ + log Z sup { E [ ( x )] H [ Q ( x )] } Q Q Equality is obtained for Q ( x ) = P ( x ) (all Q admissible) Using any other Q yields a lower bound on log Z The slack in this bound is KL-divergence D ( Q||P ) Goal : restrict Q to a tractable subclass Q optimize with sup Q to tighten this bound note we’re (also) maximizing entropy H [ Q ]

Variational MF Variational MF ≥ ψ + log Z sup { E [ ( x )] H [ Q ( x )] } Q Q Most common specialized family : = � T ψ θ φ = θ φ ( x ) ( x ) ( x ) “log-linear models” c c c c linear in parameters θ (natural parameters of EFs) clique potentials φ ( x ) (sufficient statistics of EFs) Fertile ground for plowing Convex Analysis Convex Analysis

Convex Analysis Convex Analysis The Old Testament The New Testament

Variational MF for EF Variational MF for EF ≥ ψ + log Z sup { E [ ( x )] H [ Q ( x )] } Q Q T ≥ θ φ + log Z sup { E [ ( x )] H [ Q ( x )] } Q Q T ≥ θ φ + log Z sup { E [ ( x )] H [ Q ( x )] } Q Q T EF θ ≥ θ µ − µ A ( ) sup { A * ( ) } µ ∈ notation M M = set of all moment parameters realizable under subclass Q M

Variational MF for EF Variational MF for EF So it looks like we are just optimizing a concave function (linear term + negative-entropy) over a convex set �� Yet it is hard ... Why? 1. graph probability (being a measure ) requires a very large number of marginalization constraints for consistency (leads to a typically beastly marginal polytope M M in the discrete case) e.g., a complete 7-node graph’s polytope has over 10 8 facets ! In fact, optimizing just the linear term alone can be hard 2. exact computation of entropy - A *( µ ) is highly non-trivial (hence the famed Bethe & Kikuchi approximations)

Gibbs Sampling for Ising Ising Gibbs Sampling for • Binary MRF G = ( V,E ) with pairwise clique potentials 1. pick a node s at random 2. sample u ~ Uniform(0,1) 3. update node s : 4. goto step 1 a slower stochastic version of ICM

Naive MF for Ising Ising Naive MF for • use a variational mean parameter at each site 1. pick a node s at random 2. update its parameter : 3. goto step 1 • deterministic “loopy” message-passing • how well does it work? depends on θ

Graphical Models as EF Graphical Models as EF • G G ( V,E ) with nodes • • sufficient stats : • clique potentials likewise for θ st • probability • log-partition • mean parameters

Variational Theorem for EF Variational Theorem for EF • For any mean parameter µ where θ ( µ ) is the corresponding natural parameter in relative interior of M not in the closure of M • the log-partition function has this variational representation • this supremum is achieved at the moment-matching value of µ

Legendre- -Fenchel Fenchel Duality Duality Legendre • Main Idea : (convex) functions can be “supported” (lower-bounded) by a continuum of lines (hyperplanes) whose intercepts create a conjugate dual of the original function (and vice versa) conjugate dual of A conjugate dual of A* Note that A** = A (iff A is convex)

Dual Map for EF Dual Map for EF Two equivalent parameterizations of the EF Bijective mapping between Ω and the interior of M Mapping is defined by the gradients of A and its dual A * Shape & complexity of M depends on X and size and structure of G

Marginal Polytope Marginal Polytope • G G ( V,E ) = graph with discrete nodes • M = convex hull of all φ ( x ) • Then M • • equivalent to intersecting half-spaces a T µ > b • difficult to characterize for large G • hence difficult to optimize over • interior of M M is 1-to-1 with Ω

The Simplest Graph The Simplest Graph x G ( V , E ) = a single Bernoulli node φ ( x ) = x • G • • density • log-partition (of course we knew this) • we know A* too, but let’s solve for it variationally • differentiate � stationary point • rearrange to , substitute into A* Note : we found both the mean parameter and the lower bound using the variational method

nd Simplest Graph The 2 nd Simplest Graph The 2 x 1 x 2 • G G ( V,E ) = 2 connected Bernoulli nodes • • moment constraints • moments • • variational problem • solve (it’s still easy!)

rd Simplest Graph The 3 rd Simplest Graph The 3 x 1 x 2 x 3 3 nodes � 16 constraints # of constraints blows up real fast: 7 nodes � 200,000,000+ constraints hard to keep track of valid µ ’ s ( i.e ., the full shape and extent of M ) no more checking our results against closed-forms expressions that we already knew in advance! unless G remains a tree, entropy A * will not decompose nicely, etc

Variational MF for Ising Ising Variational MF for • tractable subgraph H = ( V, 0) • fully-factored distribution • moment space • entropy is additive : - • variational problem for A ( θ ) • using coordinate ascent :

Variational Mean Field Variational Mean Field for Graphical Models - PowerPoint PPT Presentation

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov Approximate Inference Approximate Inference Consider general UGs ( i.e.,

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Variational approach to mean field games with density constraints Alp ar Rich ard M

Overview of mean-field and beyond mean-field theoretical studies on giant resonances G. Col

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

The Variational Predictive Natural Gradient Da Tang 1 Rajesh Ranganath 2 1 Columbia University 2

American-style options, stochastic volatility, and degenerate parabolic variational inequalities

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Global convergence rates of some multilevel methods for variational and quasi-variational

Variational Russian Roulette for Variational Russian Roulette for Deep Bayesian Nonparametrics

Variational Perturbation Theory Variational Perturbation Theory Hagen Kleinert, FU BERLIN

Nonequilibrium variational principles Nonequilibrium variational principles from dynamical

Handling of Position Errors in Variational and Hybrid Ensemble/Variational Data Assimilation Using

10-601B Recitation 1 Calvin McCarter September 3, 2015 1 Probability 1.1 Linearity of

Mathematical Foundations for Finance Exercise 1 Martin Stefanik ETH Zurich Which Exercise Class

Today Finish up Conditional Expectation. Markov Chains. Application: Mixing Each step, pick

and how to reverse it Almsgiving is Mammons perversion of giving. It affirms the superiority

Probability and Random Processes Lecture 7 Conditional probability and expectation

Ultraproducts, QWEP von Neumann Algebras, and Effros-Mar echal Topology . Hiroshi ANDO Erwin

Is 2020 Vision Good Enough? Looking Ahead to What Comes Next Cathy Seeley NCTMs 100

My favourite open problems in universal algebra Ross Willard University of Waterloo AMS Spring