Probabilistic Graphical Models Lecture 15 Inference as - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 15 – Inference as Optimization CS/CNS/EE 155 Andreas Krause

Announcements Homework 3 due next Monday (Nov 23) Project poster session on Friday December 4 (tentative) Final writeup (8 pages NIPS format) due Dec 9 2

Approximate inference Three major classes of general-purpose approaches Message passing E.g.: Loopy Belief Propagation Inference as optimization Approximate posterior distribution by simple distribution Mean field / structured mean field Sampling based inference Importance sampling, particle filtering Gibbs sampling, MCMC Many other alternatives (often for special cases) 3

Loopy BP on arbitrary pairwise MNs What if we apply BP to a graph with loops? C Apply BP and hope for the best.. D I G S Will not generally converge.. � If it converges, will not necessarily get L correct marginals � J H However, in practice, answers often still useful! 4

Approximate inference Three major classes of general-purpose approaches Message passing E.g.: Loopy Belief Propagation (today!) Inference as optimization Approximate posterior distribution by simple distribution Mean field / structured mean field Assumed density filtering / expectation propagation Sampling based inference Importance sampling, particle filtering Gibbs sampling, MCMC Many other alternatives (often for special cases) 5

Variational approximation Graphical model with intractable (high-treewidth) joint distribution P(X 1 ,…,X n ) Want to compute posterior distributions Computing posterior exactly is intractable Key idea : Approximate posterior with simpler distribution that’s as close to P as possible 6

Why should we hope that we can find a simple approximation? Prior distribution is complicated � � � � � � � � � � � � Need to describe all possible states of the � � � � � � world (and relationships between variables) � � � � � �� Posterior distribution is often simple: Have made many observations � less uncertainty Variables can become “almost independent” C For now: Represent posterior as undirected D I model (and instantiate observations) G S L J H 7

Variational approximation Key idea : Approximate posterior with simpler distribution that’s as close as possible to P What is a “simple” distribution? What does “as close as possible” mean? Simple = efficient inference Typically: factorized (fully independent, chain, tree, …) Gaussian approximation As close as possible = KL divergence (typically) Other distance measures can be used too, but more challenging to compute 8

Kullback-Leibler (KL) divergence Distance between distributions Properties: D(P || Q) � 0 P(x)=Q(x) almost everywhere � D(P || Q) = 0 In general, D(P || Q) ≠ D(Q || P) P determines when difference is important 9

Finding simple approximate distributions KL divergence not symmetric; need to choose directions P: true distribution; Q: our approximation D(P || Q) min D(Q||P) The “right” way Q chosen to “support” P Often intractable to compute min D(P||Q) D(Q || P) The “reverse” way Underestimates support (overconfident) Will be tractable to compute Both special cases of � -divergence 10

“Simple” distributions Simplest distribution: Q fully factorized Q(X 1 ,…,X n ) = ∏ i Q i (X i ) P M = {Q: Q fully factorized} = {Q: Q(X) = ∏ i Q i (X i )} Q Can also find more structured approximations Chains: Q(X 1 ,…,X n ) = ∏ ι Q i (X i | X i=1 ) Trees Any distributions one can do efficient inference on 11

Mean field approximation the “right way” 12

Mean field approximation the reverse way 13

Reverse KL for fully factorized case 14

KL and the partition function Suppose P(X 1 ,…,X n ) =Z -1 ∏ i � i (C i ) is Markov Network Theorem : Hereby, F[P;Q] is the following energy functional 15

Reverse KL vs. log-partition function Maximizing energy functional � Minimizing reverse KL Corollary : Energy function is lower bound on log partition function 16

Optimizing for mean field approximation Want to solve Solved via Lagrange multipliers Theorem : Q stationary point iff for each i and x i : 17

Fixed point iteration for MF Initialize factors Q (0)i arbitrarily; t=0 Until converged, do t � t+1 For each variable i and each assignment x i do Guaranteed to converge! � Gives both approx. distribution Q and lower bound on ln Z Can get stuck in local optimum � 18

Computing updates Need to compute Must compute expected log potentials: 19

Example iteration C D I G S L J H 20

Structured mean field Goal of variational inference: Approximate complex distribution by simple distribution True dist. Fully-factorized Structured mean field mean field 21

Structured mean-field approximations Can get better approximations using structured approximations : Only need to be able to compute energy functional Can do whenever we can perform efficient inference in Q (e.g., chains, trees, low-treewidth models) Update equations look similar as for fully-factorized case (see reading) 22

Example: Factorial HMM Simultaneous tracking and camera registration State space decomposed into object location and camera parameters Mei and Porikli ‘08 23

Variational approximations for FHMMs Approximate posterior by independent chains 24

Summary: Variational inference Approximate complex (intractable) distribution by simpler distribution that is “as close as possible” Simple = tractable (efficient inference) Closeness = Reverse KL (efficient to compute) Interpretation: Optimize lower bound on the log-partition function Implies upper bound on event probabilities Efficient algorithm that’s guaranteed to converge (in contrast to Loopy BP..), but possibly to local optimum 25

Approximate inference Three major classes of general-purpose approaches Message passing E.g.: Loopy Belief Propagation (today!) Inference as optimization Approximate posterior distribution by simple distribution Mean field / structured mean field Assumed density filtering / expectation propagation Sampling based inference Importance sampling, particle filtering Gibbs sampling, MCMC Many other alternatives (often for special cases) 26

KL-divergence the “right” way: Find distribution Q* � M: In some applications, can compute D(P || Q) Important example: Assumed density filtering in DBNs min D(Q||P) min D(P||Q) 27

Recall: Dynamic Bayesian Networks At every timestep have a Bayesian Network A 1 A 2 A 3 D 1 D 2 D 3 B 1 B 2 B 3 E 1 E 2 E 3 C 1 C 2 C 3 Variables at each time step t called a “slice” S t “Temporal” edges connecting S t+1 with S t 28

Flow of influence in DBNs A 1 A 2 A 3 A 4 acceleration speed S 1 S 2 S 3 S 4 L 1 L 2 L 3 L 4 location Can we do efficient filtering in BNs? 29

Approximate inference in DBNs? DBN Marginals Approx. marginals A 1 A 2 A 2 A t A t B 1 B 2 B 2 B t B t or C 1 C 2 C 2 C t C t D 1 D 2 D 2 D t D t Want to find tractable approximation to marginals that’s as close to true marginals as possible 30

Assumed Density Filtering A t A t+1 B t B t+1 C t C t+1 D t D t+1 Assume distribution P( S t ) for slice t factorizes P( S t+1 ) is fully connected � Want to compute best-approximation Q* for P( S t+1 ) Q* = argmin D(P || Q) 31

Assumed Density Filtering A t A t+1 B t B t+1 C t C t+1 D t D t+1 32

Recall: Bayesian filtering Start with P(X 1 ) X 1 X 2 X 3 X 4 X 5 X 6 At time t Assume we have P(X t | y 1…t-1 ) Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Condition: P(X t | y 1…t ) Prediction: P(X t+1 , X t | y 1…t ) Marginalization: P(X t+1 | y 1…t ) 33

Assumed Density Filtering Start with P( S 1 ) At every time step t: tractable approximation Q t Q t ( S t ) � P( S t | O 1:t-1 ) Condition on observation O t � S t : Q t ( S t | O t ) Predict : multiply transition model to get Q t ( S t+1 ,S t | O t ) Q t ( S t+1 ,S t | O t ) = Q t ( S t | O t ) P( S t+1 | S t ) Marginalize S t This is intractable (connects all variables in S t+1 ) Approximate Q t ( S t+1 | O t ) by Q* s.t. Q* = argmin Q D(Q t ( S t+1 ) || Q( S t+1 )) This is done by matching moments: for discrete models, ensure that Q t+1 ( s t+1 ) = Q t ( s t+1 | o t ) 34

Summary of Assumed Density Filtering Variational inference technique for dynamical Bayesian Networks Find tractable approximation for each time slice that minimizes KL divergence (in the “right” way) Can show that errors don’t add up too much Examples: Tractable inference in DBNs Unscented Kalman Filter 35

Summary: Inference as optimization Approximate intractable distribution by a tractable one Optimize parameters of the distribution to make approximation as tight as possible Common distance measure: KL-divergence (both ways) Special case of � -divergence Can get upper bounds on event probabilities, etc. 36

Probabilistic Graphical Models Lecture 15 Inference as - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 15 Inference as Optimization CS/CNS/EE 155 Andreas Krause Announcements Homework 3 due next Monday (Nov 23) Project poster session on Friday December 4 (tentative) Final writeup (8 pages NIPS

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 6.5 R EDUCTIONS introduction designing

Probabilistic & Unsupervised Learning Expectation Propagation Maneesh Sahani

An Anytime Algorithm for Computing Inconsistency Measurement Yue Ma 1 Guilin Qi 2 Guohui Xiao 3 ,

On the Public Indifferentiability and Correlation Intractability of the 6-Round Feistel

Optimizing the Automated Programming Stack James Bornholt University of Washington Software

For Wednesday No reading Research paper topic due on Blackboard Homework: Weiss,

Homework Homework 8 Computational Complexity Help after lecture Before we start The

Computational Complexity Tutorial COMSOC 2017 Ronald de Haan Plan for Today Tutorial on

Probabilistic Graphical Models Lecture 15 Inference as - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 15 Inference as Optimization CS/CNS/EE 155 Andreas Krause Announcements Homework 3 due next Monday (Nov 23) Project poster session on Friday December 4 (tentative) Final writeup (8 pages NIPS

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Algorithms R OBERT S EDGEWICK | K EVIN W AYNE 6.5 R EDUCTIONS introduction designing

Probabilistic &amp; Unsupervised Learning Expectation Propagation Maneesh Sahani

An Anytime Algorithm for Computing Inconsistency Measurement Yue Ma 1 Guilin Qi 2 Guohui Xiao 3 ,

On the Public Indifferentiability and Correlation Intractability of the 6-Round Feistel

Optimizing the Automated Programming Stack James Bornholt University of Washington Software

For Wednesday No reading Research paper topic due on Blackboard Homework: Weiss,

Homework Homework 8 Computational Complexity Help after lecture Before we start The

Computational Complexity Tutorial COMSOC 2017 Ronald de Haan Plan for Today Tutorial on

Probabilistic & Unsupervised Learning Expectation Propagation Maneesh Sahani