expectation consistent approximate inference
play

Expectation Consistent Approximate Inference Ole Winther - PowerPoint PPT Presentation

Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark owi@imm.dtu.dk In collaboration with Manfred Opper ISIS School of Electronics and


  1. Expectation Consistent Approximate Inference Ole Winther Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark owi@imm.dtu.dk In collaboration with Manfred Opper ISIS School of Electronics and Computer Science University of Southampton SO17 1BJ, United Kingdom mo@ecs.soton.ac.uk 1

  2. Motivation • Contemporary machine learning uses complex flexible proba- bilistic models. • Bayesian inference is typically intractable. • Approximate polynomial complexity methods needed. • VB, Bethe, EP and EC: Use tractable factorization of original model. • EC: Expectation Consistency between 2 distributions, e.g. dis- crete and Gaussian microsoft001

  3. Exact Inference in Tree Graphs Bethe – tree factorization, e.g. p ( x ) = 1 Zf 12 f 13 f 1 f 2 f 3 Write p ( x ) in terms of marginals q i ( x i ) and q ij ( x i , x j ) q ( x ) = q 12 ( x 1 , x 2 ) q 23 ( x 2 , x 3) p ( x ) = q 2 ( x 2 ) Z 12 Z 23 Z = Z 1 Message-parsing: Effective inference for p ( x ) discrete or Gaussian. microsoft001

  4. Bethe Approximation Bethe approximation – treat p ( x ), e.g. p ( x ) = 1 Zf 12 f 23 f 13 f 1 f 2 f 3 as if it was a tree-graph q ( x ) = q 12 ( x 1 , x 2 ) q 23 ( x 2 , x 3) q 13 ( x 1 , x 3 ) . q 1 ( x 1 ) q 2 ( x 2 ) q 3 ( x 3 ) Works extremely well in “sparse systems” - e.g. low density decod- ing. Disadvantage over-counting – q ( x ) not a density. microsoft001

  5. Variational Bayes (VB) Minimize KL-divergence in restricted tractable family q ( x ) = � i q i ( x i ): q i ( x i ) = argmin KL [ q ( x ) || p ( x )] ∝ exp � ln p ( x ) � q \ q i ( x i ) q i ( x i ) Example Gaussian: q ( x ) = N ( x ; m q , C q ) p ( x ) = N ( x ; m , C ) → 1 m q = m C q and ij = δ ij � � C − 1 ii In general (factorized) VB reliable on mean, but under-estimates width of distribution (see e.g. MacKay, 2003, Opper & Winther 2004). Important for parameter-estimation (see e.g. Minka & Lafferty). microsoft001

  6. Motivating EC and Overview We are looking for a tractable approximation that • can handle “dense graphs” (better than Bethe+). • estimate correlations (better than VB). Free energy Why it works – central limit theorem. Algorithmics and connection to EP Simulations, conclusions and outlook microsoft001

  7. Expectation Consistent (EC) free energy Calculate partition function � � Z = d x f ( x ) = d x f q ( x ) f r ( x ) Problem: Z intractable – integral not analytical and/or summation exponential in number of variables N . Introduce tractable distribution q ( x ) 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) Z q can be calculated in polynomial time. � d x f r ( x ) f q ( x ) exp � � ( λ q − λ q ) T g ( x ) Z Z = Z q = Z q � d x f q ( x ) exp λ T Z q q g ( x ) � � − λ T �� = Z q f r ( x ) exp q g ( x ) q microsoft001

  8. Free energy Free energy exact: � � − λ T �� − ln Z = − ln Z q − ln f r ( x ) exp q g ( x ) q Variational approximation use Jensen: ln � f ( x ) � ≥ � ln f ( x ) � − ln Z ≤ − ln Z q − � ln f r ( x ) � q + λ T q � g ( x ) � q Find λ q by minimizing the upper bound. � − λ T � Better to average over f r ( x ) exp q g ( x ) approximately. Retain more averaging in that way. microsoft001

  9. Expectation consistent approximation Define g ( x ) such that both 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) are tractable. Excludes some models tractable in the variational approach (with- out further approximations). microsoft001

  10. Example I – the Ising model Binary variables – spins – x i = ± 1 with pairwise interactions � f q ( x ) = Ψ i ( x i ) i [ δ ( x i + 1) + δ ( x i − 1)] e θ i x i ψ i ( x i ) =   � 1 � 2 x T Jx  = exp � f r ( x ) = exp x i J ij x j i>j E.g. set g ( x ) to first and second order x 1 , − x 2 2 , x 2 , − x 2 2 , . . . , x N , − x 2 � � 1 2 N g ( x ) = 2 q ( x ) – a factorized binary distribution r ( x ) – multivariate Gaussian. Interpretation of g ( x ) will be clear shortly. microsoft001

  11. Bethe and EC factorization Z Bethe = Z 12 Z 23 Z 13 . Z 1 Z 2 Z 3 Z EC will be similar in spirit: Z q Z r Z EC = . Z s(eparator) microsoft001

  12. Example II – Gaussian processes Supervised learning: Inputs x 1 , . . . , x N and targets t 1 , . . . , t N . Gaussian process prior over functions y = ( y ( x 1 ) , . . . , y ( x N )): 1 − 1 � � 2 y T C − 1 y p ( y ) = exp � (2 π ) N det C Likelihood, observation model: p ( t | y ( x )), e.g. noise-free classifica- tion p ( t | y ( x )) = Θ( ty ( x )) � � p ( t i | y ( x i )) p ( y ) Z = d y i Same structure as ex. I – factorized and multivariate Gaussian (Opper&Winther,2000; Minka 2001). microsoft001

  13. Expectation Consistent (Helmholtz) Free Energy Exchange average wrt q ( x ) with one over simpler distribution s ( x ). 1 � λ T � s ( x ) = Z s ( λ s ) exp s g ( x ) Approximation: � � − λ T �� � � − λ T �� q ≈ f r ( x ) exp q g ( x ) f r ( x ) exp q g ( x ) s Parameters λ q , λ s to be optimized in suitable way: � � − λ T �� − ln Z ≈ − ln Z q − ln f r ( x ) exp q g ( x ) s � � λ T � = − ln d x f q ( x ) exp q g ( x ) � � ( λ s − λ q ) T g ( x ) � − ln d x f r ( x ) exp � � λ T � + ln d x exp s g ( x ) microsoft001

  14. Determining the Parameters Expectation consistency: ∂ ln Z EC = 0 : � g ( x ) � q = � g ( x ) � r ∂ λ q ∂ ln Z EC � g ( x ) � r = � g ( x ) � s = 0 : ∂ λ s where 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) with λ r = λ s − λ q 1 Z r ( λ s ) exp( λ T s ( x ) = s g ( x )) Z r Z q ≈ Z Z s Approximation symmetric in q ( x ) and r ( x ). s ( x ) is the “separator”. microsoft001

  15. Why it Works Neither q or r are good approximations to p . But marginal distributions and moments can be precise! x 1 , − x 2 2 , . . . , x N , − x 2 � � 1 N g ( x ) = and λ = ( γ 1 , Λ 1 , . . . , γ N , Λ N ): 2 γ q,i x i − Λ q,i x 2 � � � q ( x ) = q i ( x i ) q i ( x i ) ∝ Ψ i ( x i ) exp . i i The central limit theorem saves us: the details of the distribution of the marginalized variables not important, only first and second moments. Cavity method (Onsager 1936, Mezard, Parisi & Vira- soro 1987). Exact under some conditions: “dense models”, many variables, no dominating interactions and not too strong interactions. Other complications such as non-ergodicity (RSB). microsoft001

  16. Non-trivial estimates in EC • Marginal distributions q ( x i ) (factorized moments) Ψ i ( x i ) exp( γ T q x − x T Λ q x / 2) � q ( x ) ∝ i q ( x i ) ∝ Ψ i ( x i ) exp( γ q,i x i − x 2 i Λ q,i / 2) . • Correlations r ( x ) global Gaussian approximation r ( x ) ∝ exp( γ T r x − x T ( Λ r − J ) x / 2) � ( Λ r − J ) − 1 � Covariance C ( x i , x j ) = � x i x j � r ( x ) −� x i � r ( x ) � x j � r ( x ) = ij . • The free energy − ln Z EC ≈ − ln Z . Z is the marginal likelihood (or evidence ) of the model. • Supervised learning, Predictive distribution and leave-one-out (Opper & Winther, 2000). microsoft001

  17. Non-Convex Optimization � d x f ( x ) exp � � λ T g ( x ) Partition function Z ( λ ) = is convex in λ : H = ∂ 2 ln Z − � g ( x ) � � g ( x ) � T . � g ( x ) g ( x ) T � = ∂ λ T λ EC non-convex optimization – like Bethe and variational. − ln Z EC ( λ q , λ s ) = − ln Z q ( λ q ) − ln Z r ( λ s − λ q )+ ln Z s ( λ s ) � � λ T � − ln = d x f q ( x ) exp q g ( x ) � � ( λ s − λ q ) T g ( x ) � − ln d x f r ( x ) exp � � λ T � + ln d x exp s g ( x ) Optimize with single loop (no warranty) or double loop (slow) . microsoft001

  18. Single Loop – Objective Expectation consistency � g ( x ) � q = � g ( x ) � r = � g ( x ) � s with 1 Z q ( λ q ) f q ( x ) exp( λ T q ( x ) = q g ( x )) 1 Z r ( λ r ) f r ( x ) exp( λ T r ( x ) = r g ( x )) with λ r = λ s − λ q 1 Z r ( λ s ) exp( λ T s ( x ) = s g ( x )) Sending messages r → q → r → . . . and make s consistent. microsoft001

  19. Single Loop – Propagation Algorithms 1. Send messages from r to q • Calculate separator s ( x ). Solve for λ s : � g ( x ) � s = µ µ µ r ( t ) ≡ � g ( x ) � r ( x ; t ) • Update q ( x ): λ q ( t + 1) := λ s − λ r ( t ) 2. Send messages from q to r • Calculate separator s ( x ). Solve for λ s : � g ( x ) � s = µ µ µ q ( t + 1) ≡ � g ( x ) � q ( x ; t +1) • Update r ( x ): λ r ( t + 1) := λ s − λ q ( t + 1) Expectation Propagation (EP): sequential factor-by-factor update. microsoft001

  20. Single Loop Details q ( x ) non-Gaussian, factorized or on a spanning tree and r ( x ) multi-variate Gaussian. Complexity O ( N 3 ). x 1 , − x 2 2 , x 2 , − x 2 2 , . . . , x N , − x 2 � � 1 2 N Factorized moments g ( x ) = : 2 � � γ s,i x i − Λ s,i x 2 Gaussian s ( x ) = � i s i ( x i ) and s i ( x i ) ∝ exp i / 2 . Moment matching to mean and variance of q and r : γ s,i := m i /v i and Λ s,i := 1 /v i . All second moments on a spanning tree: q ( x ) moments can be inferred by (exact) message parsing . s ( x ) multi-variate Gaussian on a spanning tree, solve using tree- decomposition of Z . microsoft001

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend