Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 17 – EM CS/CNS/EE 155 Andreas Krause

Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies will be provided! Final writeup (8 pages NIPS format) due Dec 9 2

Approximate inference Three major classes of general-purpose approaches Message passing E.g.: Loopy Belief Propagation (today!) Inference as optimization Approximate posterior distribution by simple distribution Mean field / structured mean field Assumed density filtering / expectation propagation Sampling based inference Importance sampling, particle filtering Gibbs sampling, MCMC Many other alternatives (often for special cases) 3

Sample approximations of expectations x 1 ,…,x N samples from RV X Law of large numbers: Hereby, the convergence is with probability 1 (almost sure convergence) Finite samples: 4

Monte Carlo sampling from a BN Sort variables in topological ordering X 1 ,…,X n For i = 1 to n do Sample x i ~ P(X i | X 1 =x 1 , …, X i-1 =x i-1 ) Works even with high-treewidth models! C D I G S L J H 5

Computing probabilities through sampling Want to estimate probabilities C Draw N samples from BN D I Marginals G S L J H Conditionals Rejection sampling problematic for rare events 6

Sampling from intractable distributions Given unnormalized distribution P(X) � Q(X) Q(X) efficient to evaluate, but normalizer intractable For example, Q(X) = ∏ j � (C j ) Want to sample from P(X) Ingenious idea : Can create Markov chain that is efficient to simulate and that has stationary distribution P(X) 7

Markov Chain Monte Carlo Given an unnormalized distribution Q(x) Want to design a Markov chain with stationary distribution � (x) = 1/Z Q(x) Need to specify transition probabilities P(x | x’)! 8

Designing Markov Chains 1) Proposal distribution R(X’ | X) Given X t = x, sample “proposal” x’~R(X’ | X=x) Performance of algorithm will strongly depend on R 2) Acceptance distribution: Suppose X t = x With probability set X t+1 = x’ With probability 1- � , set X t+1 = x Theorem [Metropolis, Hastings]: The stationary distribution is Z -1 Q(x) Proof: Markov chain satisfies detailed balance condition! 9

Gibbs sampling Start with initial assignment x (0) to all variables For t = 1 to � do Set x (t) = x (t-1) For each variable X i Set v i = values of all x (t) except x i Sample x (t) i from P(X i | v i ) Gibbs sampling satisfies detailed balance equation for P Can efficiently compute conditional distributions P(X i | v i ) for graphical models 10

Summary of Sampling Randomized approximate inference for computing expections, (conditional) probabilities, etc. Exact in the limit But may need ridiculously many samples Can even directly sample from intractable distributions Disguise distribution as stationary distribution of Markov Chain Famous example: Gibbs sampling 11

Summary of approximate inference Deterministic and randomized approaches Deterministic Loopy BP Mean field inference Assumed density filtering Randomized Forward sampling Markov Chain Monte Carlo Gibbs Sampling 12

Recall: The “light” side Assumed everything fully observable low treewidth no hidden variables Then everything is nice � Efficient exact inference in large models Optimal parameter estimation without local minima Can even solve some structure learning tasks exactly 13

The “dark” side � � � � � � � � � � � � � � � � � � � � represent � � � � � �� States of the world, Graphical model sensor measurements, … In the real world, these assumptions are often violated.. Still want to use graphical models to solve interesting problems.. 14

Remaining Challenges Inference Approximate inference for high-treewidth models Learning Dealing with missing data Representation Dealing with hidden variables 15

Learning general BNs Known structure Unknown structure Fully observable Easy! Hard Missing data 16

Dealing with missing data So far, have assumed all variables C are observed in each training example D I G S L In practice, often have missing data J Some variables may never be observed H Missing variables may be different for each example 17

Gaussian Mixture Modeling 18

Learning with missing data Suppose X is observed variables, Z hidden variables Training data: x (1) , x (2) ,…, x (N) Marginal likelihood: Marginal likelihood doesn’t decompose 19

Intuition: EM Algorithm Iterative algorithm for parameter learning in case of missing data EM Algorithm E xpectation Step: “Hallucinate” hidden values M aximization Step: Train model as if data were fully observed Repeat Will converge to local maximum 20

E-Step: x : observed data; z : hidden data “Hallucinate” missing values by computing distribution over hidden variables using current parameter estimate: For each example x (j) , compute: Q (t+1) ( z | x (j) ) = P( z | x (j) , � (t) ) 21

Towards M-step: Jensen inequality Marginal likelihood doesn’t decompose Theorem [Jensen’s inequality] : For any distribution P( z ) and function f( z ), 22

Lower-bounding marginal likelihood Jensen’s inequality: From E-step: 23

Lower bound on marginal likelihood Bound of marginal likelihood with hidden variables Recall: Likelihood in fully observable case: Lower-bound interpreted as “weighted” data set 24

M-step: Maximize lower bound Lower bound: Choose � (t+1) to maximize lower bound Use expected sufficient statistics (counts). Will see: Whenever we used Count(x,z) in fully observable case, replace by E Qt+1 [Count( x , z )] 25

Coordinate Ascent Interpretation Define energy function For any distribution Q and parameters � : EM algorithm performs coordinate ascent on F: Monotonically converges to local maximum 26

EM for Gaussian Mixtures 27

EM Iterations [by Andrew Moore] 28

EM in Bayes Nets E B Complete data likelihood A J M 29

EM in Bayes Nets E B Incomplete data likelihood A J M 30

E-Step for BNs Need to compute For fixed z , x : Can compute using inference Naively specifying full distribution would be intractable E B A J M 31

M-step for BNs Can optimize each CPT independently! MLE in fully observed case: MLE with hidden data: 32

Computing expected counts Suppose we observe O=o Variables A hidden 33

Learning general BNs Known structure Unknown structure Fully observable Easy! Hard (2.) EM Now Missing data 34

Structure learning with hidden data Fully observable case: Score(D;G) = likelihood of data under most likely parameters Decomposes over families Score(D;G) = � ι FamScore i (X i | Pa Xi ) Can recompute score efficiently after adding/removing edges Incomplete data case: Score(D;G) = lower bound from EM Does not decompose over families Search is very expensive Structure-EM: Iterate Computing of expected counts Multiple iterations of structure search for fixed counts Guaranteed to monotonically improve likelihood score 35

Hidden variable discovery Sometimes, “invention” of a hidden variable can drastically simplify model 36

Learning general BNs Known structure Unknown structure Fully observable Easy! Hard (2.) EM Structure-EM Missing data 37

Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies will be provided! Final writeup (8

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Jay : Seaman Iris Last Lecture : Importance Sampling Xsnqcx Generate from Idea samples )

Status of LAr simulations Chris Marshall Lawrence Berkeley National Laboratory 4 th DUNE ND

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical

Multi-parameter models - Metropolis sampling Applied Bayesian Statistics Dr. Earvin Balderama

Course on Inverse Problems Albert Tarantola Fourth Lesson: Sampling a Probability Distribution

Midterm Scores Min 1Q Median 3Q Max 23 58 68 83 100 The exam will be curved, with

Summary Machine learning in general and of formal languages in particular States,