Expectation Consistent Approximate Inference Ole Winther - PowerPoint PPT Presentation

Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark owi@imm.dtu.dk In collaboration with Manfred Opper ISIS School of Electronics and Computer Science University of Southampton SO17 1BJ, United Kingdom mo@ecs.soton.ac.uk 1

Motivation • Contemporary machine learning uses complex flexible probabilistic models. • Bayesian inference is typically intractable. • Approximate polynomial complexity methods needed. • VB, Bethe, EP and EC: Use tractable factorization of original model. • EC: Expectation Consistency between 2 distributions, e.g. discrete and Gaussian microsoft001

Exact Inference in Tree Graphs Bethe – tree factorization, e.g. p ( x ) = 1 Zf 12 f 13 f 1 f 2 f 3 Write p ( x ) in terms of marginals q i ( x i ) and q ij ( x i , x j ) q ( x ) = q 12 ( x 1 , x 2 ) q 23 ( x 2 , x 3) p ( x ) = q 2 ( x 2 ) Z 12 Z 23 Z = Z 1 Message-parsing: Effective inference for p ( x ) discrete or Gaussian. microsoft001

Bethe Approximation Bethe approximation – treat p ( x ), e.g. p ( x ) = 1 Zf 12 f 23 f 13 f 1 f 2 f 3 as if it was a tree-graph q ( x ) = q 12 ( x 1 , x 2 ) q 23 ( x 2 , x 3) q 13 ( x 1 , x 3 ) . q 1 ( x 1 ) q 2 ( x 2 ) q 3 ( x 3 ) Works extremely well in “sparse systems” - e.g. low density decod- ing. Disadvantage over-counting – q ( x ) not a density. microsoft001

Variational Bayes (VB) Minimize KL-divergence in restricted tractable family q ( x ) = � i q i ( x i ): q i ( x i ) = argmin KL [ q ( x ) || p ( x )] ∝ exp � ln p ( x ) � q \ q i ( x i ) q i ( x i ) Example Gaussian: q ( x ) = N ( x ; m q , C q ) p ( x ) = N ( x ; m , C ) → 1 m q = m C q and ij = δ ij � � C − 1 ii In general (factorized) VB reliable on mean, but under-estimates width of distribution (see e.g. MacKay, 2003, Opper & Winther 2004). Important for parameter-estimation (see e.g. Minka & Lafferty). microsoft001

Motivating EC and Overview We are looking for a tractable approximation that • can handle “dense graphs” (better than Bethe+). • estimate correlations (better than VB). Free energy Why it works – central limit theorem. Algorithmics and connection to EP Simulations, conclusions and outlook microsoft001

Expectation Consistent (EC) free energy Calculate partition function � � Z = d x f ( x ) = d x f q ( x ) f r ( x ) Problem: Z intractable – integral not analytical and/or summation exponential in number of variables N . Introduce tractable distribution q ( x ) 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) Z q can be calculated in polynomial time. � d x f r ( x ) f q ( x ) exp � � ( λ q − λ q ) T g ( x ) Z Z = Z q = Z q � d x f q ( x ) exp λ T Z q q g ( x ) � � − λ T �� = Z q f r ( x ) exp q g ( x ) q microsoft001

Free energy Free energy exact: � � − λ T �� − ln Z = − ln Z q − ln f r ( x ) exp q g ( x ) q Variational approximation use Jensen: ln � f ( x ) � ≥ � ln f ( x ) � − ln Z ≤ − ln Z q − � ln f r ( x ) � q + λ T q � g ( x ) � q Find λ q by minimizing the upper bound. � − λ T � Better to average over f r ( x ) exp q g ( x ) approximately. Retain more averaging in that way. microsoft001

Expectation consistent approximation Define g ( x ) such that both 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) are tractable. Excludes some models tractable in the variational approach (with- out further approximations). microsoft001

Example I – the Ising model Binary variables – spins – x i = ± 1 with pairwise interactions � f q ( x ) = Ψ i ( x i ) i [ δ ( x i + 1) + δ ( x i − 1)] e θ i x i ψ i ( x i ) =   � 1 � 2 x T Jx  = exp � f r ( x ) = exp x i J ij x j i>j E.g. set g ( x ) to first and second order x 1 , − x 2 2 , x 2 , − x 2 2 , . . . , x N , − x 2 � � 1 2 N g ( x ) = 2 q ( x ) – a factorized binary distribution r ( x ) – multivariate Gaussian. Interpretation of g ( x ) will be clear shortly. microsoft001

Bethe and EC factorization Z Bethe = Z 12 Z 23 Z 13 . Z 1 Z 2 Z 3 Z EC will be similar in spirit: Z q Z r Z EC = . Z s(eparator) microsoft001

Example II – Gaussian processes Supervised learning: Inputs x 1 , . . . , x N and targets t 1 , . . . , t N . Gaussian process prior over functions y = ( y ( x 1 ) , . . . , y ( x N )): 1 − 1 � � 2 y T C − 1 y p ( y ) = exp � (2 π ) N det C Likelihood, observation model: p ( t | y ( x )), e.g. noise-free classifica- tion p ( t | y ( x )) = Θ( ty ( x )) � � p ( t i | y ( x i )) p ( y ) Z = d y i Same structure as ex. I – factorized and multivariate Gaussian (Opper&Winther,2000; Minka 2001). microsoft001

Expectation Consistent (Helmholtz) Free Energy Exchange average wrt q ( x ) with one over simpler distribution s ( x ). 1 � λ T � s ( x ) = Z s ( λ s ) exp s g ( x ) Approximation: � � − λ T �� − λ T �� q ≈ f r ( x ) exp q g ( x ) f r ( x ) exp q g ( x ) s Parameters λ q , λ s to be optimized in suitable way: � � − λ T �� − ln Z ≈ − ln Z q − ln f r ( x ) exp q g ( x ) s � � λ T � = − ln d x f q ( x ) exp q g ( x ) � � ( λ s − λ q ) T g ( x ) � − ln d x f r ( x ) exp � � λ T � + ln d x exp s g ( x ) microsoft001

Determining the Parameters Expectation consistency: ∂ ln Z EC = 0 : � g ( x ) � q = � g ( x ) � r ∂ λ q ∂ ln Z EC � g ( x ) � r = � g ( x ) � s = 0 : ∂ λ s where 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) with λ r = λ s − λ q 1 Z r ( λ s ) exp( λ T s ( x ) = s g ( x )) Z r Z q ≈ Z Z s Approximation symmetric in q ( x ) and r ( x ). s ( x ) is the “separator”. microsoft001

Why it Works Neither q or r are good approximations to p . But marginal distributions and moments can be precise! x 1 , − x 2 2 , . . . , x N , − x 2 � � 1 N g ( x ) = and λ = ( γ 1 , Λ 1 , . . . , γ N , Λ N ): 2 γ q,i x i − Λ q,i x 2 � � � q ( x ) = q i ( x i ) q i ( x i ) ∝ Ψ i ( x i ) exp . i i The central limit theorem saves us: the details of the distribution of the marginalized variables not important, only first and second moments. Cavity method (Onsager 1936, Mezard, Parisi & Vira- soro 1987). Exact under some conditions: “dense models”, many variables, no dominating interactions and not too strong interactions. Other complications such as non-ergodicity (RSB). microsoft001

Non-trivial estimates in EC • Marginal distributions q ( x i ) (factorized moments) Ψ i ( x i ) exp( γ T q x − x T Λ q x / 2) � q ( x ) ∝ i q ( x i ) ∝ Ψ i ( x i ) exp( γ q,i x i − x 2 i Λ q,i / 2) . • Correlations r ( x ) global Gaussian approximation r ( x ) ∝ exp( γ T r x − x T ( Λ r − J ) x / 2) � ( Λ r − J ) − 1 � Covariance C ( x i , x j ) = � x i x j � r ( x ) −� x i � r ( x ) � x j � r ( x ) = ij . • The free energy − ln Z EC ≈ − ln Z . Z is the marginal likelihood (or evidence ) of the model. • Supervised learning, Predictive distribution and leave-one-out (Opper & Winther, 2000). microsoft001

Non-Convex Optimization � d x f ( x ) exp � � λ T g ( x ) Partition function Z ( λ ) = is convex in λ : H = ∂ 2 ln Z − � g ( x ) � � g ( x ) � T . � g ( x ) g ( x ) T � = ∂ λ T λ EC non-convex optimization – like Bethe and variational. − ln Z EC ( λ q , λ s ) = − ln Z q ( λ q ) − ln Z r ( λ s − λ q )+ ln Z s ( λ s ) � � λ T � − ln = d x f q ( x ) exp q g ( x ) � � ( λ s − λ q ) T g ( x ) � − ln d x f r ( x ) exp � � λ T � + ln d x exp s g ( x ) Optimize with single loop (no warranty) or double loop (slow) . microsoft001

Single Loop – Objective Expectation consistency � g ( x ) � q = � g ( x ) � r = � g ( x ) � s with 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) with λ r = λ s − λ q 1 Z r ( λ s ) exp( λ T s ( x ) = s g ( x )) Sending messages r → q → r → . . . and make s consistent. microsoft001

Single Loop – Propagation Algorithms 1. Send messages from r to q • Calculate separator s ( x ). Solve for λ s : � g ( x ) � s = µ µ µ r ( t ) ≡ � g ( x ) � r ( x ; t ) • Update q ( x ): λ q ( t + 1) := λ s − λ r ( t ) 2. Send messages from q to r • Calculate separator s ( x ). Solve for λ s : � g ( x ) � s = µ µ µ q ( t + 1) ≡ � g ( x ) � q ( x ; t +1) • Update r ( x ): λ r ( t + 1) := λ s − λ q ( t + 1) Expectation Propagation (EP): sequential factor-by-factor update. microsoft001

Single Loop Details q ( x ) non-Gaussian, factorized or on a spanning tree and r ( x ) multi-variate Gaussian. Complexity O ( N 3 ). x 1 , − x 2 2 , x 2 , − x 2 2 , . . . , x N , − x 2 � � 1 2 N Factorized moments g ( x ) = : 2 � � γ s,i x i − Λ s,i x 2 Gaussian s ( x ) = � i s i ( x i ) and s i ( x i ) ∝ exp i / 2 . Moment matching to mean and variance of q and r : γ s,i := m i /v i and Λ s,i := 1 /v i . All second moments on a spanning tree: q ( x ) moments can be inferred by (exact) message parsing . s ( x ) multi-variate Gaussian on a spanning tree, solve using tree- decomposition of Z . microsoft001

Expectation Consistent Approximate Inference Ole Winther - PowerPoint PPT Presentation

Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark owi@imm.dtu.dk In collaboration with Manfred Opper ISIS School of Electronics and

more on expectation 1 2 properties of expectation properties of expectation Linearity, II

CS70: Jean Walrand: Lecture 27. Expectation; Conditional Expectation; B(n, p); G(p) 1. Review of

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Expectation Will Perkins January 21, 2013 Expectation Definition The expectation of a random

Feasibility of Consistent, Feasibility of Consistent, Feasibility of Consistent, Feasibility of

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Foundations of Computer Science Lecture 20 Expected Value of a Sum Linearity of Expectation

Expectation Maximization CMSC 691 UMBC Outline EM (Expectation Maximization) Basic idea Three

CSS Modules with BEM Consistent Design Consistent Design Different Module Versions Consistent

General Structure of a PW code Self-Consistent KS eqs. or Global Minimization approach

Approximate Inference: Randomized Methods October 15, 2015 Topics Hard Inference

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference Louis C. Tiao 1 Edwin V.

Expectation-Oriented Framework for Automating Approximate Programming Jongse Park , Kangqi Ni,

New Keynesian Pricing Behaviour: an Analysis of Micro Data James Cloyne, Lena Koerber, Martin

(Towards a) Bayesian Estimation of the Heuristic Switching Model using Experimental Data Mikhail

The Impact of Credit Market Sentiment Shocks A TVAR Approach NED 2019, Kiev Maximilian Bck

What is accomplished by successful non stationary stochastic prediction? Glenn Shafer, Rutgers

AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes Xiaoxia(Shirley) WU PhD

Evolution of Market Heuristics (An Explanation of an Asset-Pricing Experiment) Mikhail Anufriev

Adaptive estimation in functional linear model: a model selection approach Angelina Roche joint

Efficient adaptive experimental design Liam Paninski Department of Statistics and Center for