variational inference
play

Variational inference Probabilistic Graphical Models Sharif - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing s slides Nodes: = { 1 , , } Evidence: Inference query Query


  1. Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing ’ s slides

  2. Nodes: 𝒀 = {π‘Œ 1 , … , π‘Œ π‘œ } Evidence: 𝒀 𝑾 Inference query Query variables: 𝒂 = 𝒀\𝒀 𝑾  Marginal probability (Likelihood): 𝑄 𝒀 π’˜ = 𝑄(𝒂, 𝒀 π’˜ ) 𝒂  Conditional probability (a posteriori belief): 𝑄(𝒂, π’š π’˜ ) 𝑄 𝒂|π’š π’˜ = 𝒂 𝑄(𝒂, π’š π’˜ )  Marginalized conditional probability: 𝑿 𝑄(𝒁, 𝑿, π’š π’˜ ) 𝑄 𝒁|π’š π’˜ = (𝒁 = 𝒁 βˆͺ 𝑿) 𝒁 𝑿 𝑄(𝒁, 𝑿, π’š π’˜ )  Most probable assignment for some variables of interest given an evidence 𝒀 𝑾 = π’š π’˜ 𝒁 βˆ— |π’š π’˜ = argmax 𝑄 𝒁|π’š π’˜ 𝒁 2

  3. Exact methods for inference  Variable elimination  Message Passing: shared terms  Sum-product (belief propagation)  Max-product  Junction Tree 3

  4. Junction tree  General algorithm on graphs with cycles  Message passing on junction trees 𝑛 π‘—π‘˜ 𝑇 π‘—π‘˜ 𝑇 π‘—π‘˜ 𝐷 𝐷 𝑗 π‘˜ 𝑛 π‘˜π‘— 𝑇 π‘—π‘˜ 4

  5. Why approximate inference  The computational complexity of Junction tree algorithm with be at least 𝐿 𝐷 where 𝐷 shows the largest elimination clique (the largest clique in the triangulated graph) Tree-width of an 𝑂 Γ— 𝑂 grid is 𝑂  For a distribution 𝑄 associated with a complex graph, computing the marginal (or conditional) probability of arbitrary random variable(s) is intractable 5

  6. Learning and inference  Learning is also an inference problem or usually needs inference  For Bayesian inference that is one of the principal foundations for machine learning, learning is just an inference problem  For Maximum Likelihood approach, also, we need inference when we have incomplete data or when we encounter an undirected model 6

  7. Approximate inference  Approximate inference techniques  Variational algorithms  Loopy belief propagation  Mean field approximation  Expectation propagation  Stochastic simulation / sampling methods 7

  8. Variational methods  β€œ variational ” : general term for optimization-based formulations  Many problems can be expressed in terms of an optimization problem in which the quantity being optimized is a functional  Variational inference is a deterministic framework that is widely used for approximate inference 8

  9. Variatonal inference methods  Constructing an approximation to the target distribution 𝑄 where this approximation takes a simpler form for inference:  We define a target class of distributions 𝒭  Search for an instance 𝑅 βˆ— in 𝒭 that is the best approximation to 𝑄  Queries will be answered using 𝑅 βˆ— rather than on 𝑄 Constrained optimization  𝒭 : given family of distributions  Simpler families for which solving the optimization problem will be computationally tractable  However, the family may not be sufficiently expressive to encode 𝑄 9

  10. Setup  Assume that we are interested in the posterior distribution Observed variables 𝑄(π‘Ž, π‘Œ|𝛽) π‘Œ = {𝑦 1 , … , 𝑦 π‘œ } 𝑄 π‘Ž π‘Œ, 𝛽 = π‘Ž = {𝑨 1 , … , 𝑨 𝑛 } Hidden variables 𝑄 π‘Ž, π‘Œ 𝛽 π‘’π‘Ž  The problem of computing the posterior is an instance of more general problems that variational inference solves  Main idea:  We pick a family of distributions over the latent variables with its own variational parameters  Then, find the setting of the parameters that makes 𝑅 close to the posterior of interest  Use 𝑅 with the fitted parameters as an approximation for the posterior 10

  11. Approximation  Goal: Approximate a difficult distribution 𝑄(π‘Ž|π‘Œ) with a new distribution 𝑅(π‘Ž) such that:  𝑄(π‘Ž|π‘Œ) and 𝑅(π‘Ž) are close  Computation on 𝑅(π‘Ž) is easy  Typically, the true posterior is not in the variational family.  How should we measure distance between distributions?  The Kullback-Leibler divergence (KL-divergence) between two distributions 𝑄 and 𝑅 11

  12. KL divergence  Kullback-Leibler divergence between 𝑄 and 𝑅 : 𝐿𝑀(𝑄| 𝑅 = 𝑄 𝑦 log 𝑄(𝑦) 𝑅(𝑦) 𝑒𝑦  A result from information theory: For any 𝑄 and 𝑅 𝐿𝑀(𝑄| 𝑅 β‰₯ 0  𝐿𝑀(𝑄| 𝑅 = 0 if and only if 𝑄 ≑ 𝑅  𝐸 is asymmetric 12

  13. How measure the distance of 𝑄 and 𝑅 ?  We wish to find a distribution 𝑅 such that 𝑅 is a β€œ good ” approximation to 𝑄  We can therefore use KL divergence as a scoring function to decide a good 𝑅  But, 𝐿𝑀(𝑄(π‘Ž|π‘Œ)||𝑅(π‘Ž)) β‰  𝐿𝑀(𝑅(π‘Ž)||𝑄(π‘Ž|π‘Œ)) 13

  14. M-projection vs. I-projection  M-projection of 𝑅 onto 𝑄 𝑅 βˆ— = argmin 𝐿𝑀(𝑄||𝑅) π‘…βˆˆπ’­  I-projection of 𝑅 onto 𝑄 𝑅 βˆ— = argmin 𝐿𝑀(𝑅||𝑄) π‘…βˆˆπ’­  These two will differ only when 𝑅 is minimized over a restricted set of probability distributions (when 𝑄 βˆ‰ 𝑅 set of possible 𝑅 distributions) 14

  15. KL divergence: M-projection vs. I-projection  Let 𝑄 be a 2D Gaussian and 𝑅 be a Gaussian distribution with diagonal covariance matrix: 𝑅 π’œ log 𝑅 π’œ 𝑅 βˆ— = argmin 𝑄 π’œ log 𝑄 π’œ 𝑄 π’œ π‘’π’œ 𝑅 βˆ— = argmin 𝑅 π’œ π‘’π’œ 𝑅 𝑅 𝑄 : Green 𝑅 βˆ— : Red 𝐹 𝑄 π’œ = 𝐹 𝑅 [π’œ] 𝐹 𝑄 π’œ = 𝐹 𝑅 [π’œ] [Bishop] 15

  16. KL divergence: M-projection vs. I-projection  Let 𝑄 is mixture of two 2D Gaussians and 𝑅 be a 2D Gaussian distribution with arbitrary covariance matrix: 𝑅 π’œ log 𝑅 π’œ 𝑄 : Blue 𝑄 π’œ log 𝑄 π’œ 𝑅 βˆ— = argmin 𝑅 βˆ— = argmin 𝑄 π’œ π‘’π’œ 𝑅 π’œ π‘’π’œ 𝑅 βˆ— : Red 𝑅 𝑅 two good solutions! 𝐹 𝑄 π’œ = 𝐹 𝑅 π’œ 𝐷𝑝𝑀 𝑄 π’œ = 𝐷𝑝𝑀 𝑅 π’œ 16 [Bishop]

  17. M-projection  Computing 𝐿𝑀(𝑄| 𝑅 requires inference on 𝑄 𝑄 𝑨 log 𝑄 𝑨 𝐿𝑀(𝑄| 𝑅 = 𝑅 𝑨 = βˆ’πΌ 𝑄 βˆ’ 𝐹 𝑄 [log 𝑅(𝑨)] 𝑨 Inference on 𝑄 (that is difficult) is required!  When 𝑅 is in the exponential family: 𝐿𝑀(𝑄| 𝑅 = 0 ⇔ 𝐹 𝑄 π‘ˆ 𝑨 = 𝐹 𝑅 [π‘ˆ 𝑨 ] Moment projection  Expectation Propagation methods are based on minimizing 𝐿𝑀(𝑄| 𝑅 17

  18. I-projection can be computed without performing inference on  𝐿𝑀(𝑅| 𝑄 𝑄 𝐿𝑀(𝑅| 𝑄 = 𝑅 𝑨 log 𝑅 𝑨 𝑄 𝑨 𝑒𝑨 = βˆ’πΌ 𝑅 βˆ’ 𝐹 𝑅 [log 𝑄(𝑨)]  Most variational inference algorithms make use of 𝐿𝑀(𝑅| 𝑄  Computing expectations w.r.t. 𝑅 is tractable (by choosing a suitable class of distributions for 𝑅 )  We choose a restricted family of distributions such that the expectations can be evaluated and optimized efficiently.  and yet which is still sufficiently flexible as to give a good approximation 18

  19. Example of variatinal approximation Variational Laplace Approx. [Bishop] 19

  20. Evidence Lower Bound (ELBO) ln 𝑄 π‘Œ = β„’ 𝑅 + 𝐿𝑀(𝑅||𝑄) π‘Œ = {𝑦 1 , … , 𝑦 π‘œ } π‘Ž = {𝑨 1 , … , 𝑨 𝑛 } β„’ 𝑅 = 𝑅 π‘Ž ln 𝑄(π‘Œ, π‘Ž) π‘’π‘Ž 𝑅(π‘Ž) 𝐿𝑀(𝑅||𝑄) = βˆ’ 𝑅 π‘Ž ln 𝑄(π‘Ž|π‘Œ) π‘’π‘Ž 𝑅(π‘Ž) We also called β„’ 𝑅 as  We can maximize the lower bound β„’ 𝑅 𝐺[𝑄, 𝑅] latter.  equivalent to minimizing KL divergence.  if we allow any possible choice for 𝑅(π‘Ž) , then the maximum of the lower bound occurs when the KL divergence vanishes  occurs when 𝑅(π‘Ž) equals the posterior distribution 𝑄(π‘Ž|π‘Œ) .  The difference between the ELBO and the KL divergence is ln 𝑄(π‘Œ) which is what the ELBO bounds 20

  21. Evidence Lower Bound (ELBO)  Lower bound on the marginal likelihood  This quantity should increase monotonically with each iteration  we maximize the ELBO to find the parameters that gives as tight a bound as possible on the marginal likelihood  ELBO converges to a local minimum.  Variational inference is closely related to EM 21

  22. Factorized distributions 𝑅 The restriction on the distributions in the form of factorization assumptions: 𝑅 π‘Ž = 𝑅 𝑗 (π‘Ž 𝑗 ) 𝑗 β„’ 𝑅 = 𝑅 𝑗 ln 𝑄(π‘Œ, π‘Ž) βˆ’ ln 𝑅 𝑗 π‘’π‘Ž 𝑗 𝑗 Coordinate ascent to optimize β„’ 𝑅 : β„’ π‘˜ 𝑅 = 𝑅 π‘˜ ln 𝑄(π‘Œ, π‘Ž) 𝑅 𝑗 π‘’π‘Ž 𝑗 π‘’π‘Ž π‘˜ βˆ’ 𝑅 π‘˜ ln 𝑅 π‘˜ π‘’π‘Ž π‘˜ + π‘‘π‘π‘œπ‘‘π‘’ π‘—β‰ π‘˜ β‡’ β„’ π‘˜ 𝑅 = 𝑅 π‘˜ 𝐹 βˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž π‘’π‘Ž π‘˜ βˆ’ 𝑅 π‘˜ ln 𝑅 π‘˜ π‘’π‘Ž π‘˜ + π‘‘π‘π‘œπ‘‘π‘’ 𝐹 βˆ’π‘˜ ln 𝑄 π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž 𝑅 𝑗 π‘’π‘Ž 𝑗 22 π‘—β‰ π‘˜

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend