Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 - - PowerPoint PPT Presentation

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 - - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies will be provided! Final writeup (8


slide-1
SLIDE 1

Probabilistic Graphical Models

Lecture 17 – EM

CS/CNS/EE 155 Andreas Krause

slide-2
SLIDE 2

2

Announcements

Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2nd floor atrium!

Easels, poster boards and cookies will be provided!

Final writeup (8 pages NIPS format) due Dec 9

slide-3
SLIDE 3

3

Approximate inference

Three major classes of general-purpose approaches Message passing

E.g.: Loopy Belief Propagation (today!)

Inference as optimization

Approximate posterior distribution by simple distribution Mean field / structured mean field Assumed density filtering / expectation propagation

Sampling based inference

Importance sampling, particle filtering Gibbs sampling, MCMC

Many other alternatives (often for special cases)

slide-4
SLIDE 4

4

Sample approximations of expectations

x1,…,xN samples from RV X Law of large numbers: Hereby, the convergence is with probability 1 (almost sure convergence) Finite samples:

slide-5
SLIDE 5

5

Monte Carlo sampling from a BN

Sort variables in topological ordering X1,…,Xn For i = 1 to n do

Sample xi ~ P(Xi | X1=x1, …, Xi-1=xi-1)

Works even with high-treewidth models!

C D I G S L J H

slide-6
SLIDE 6

6

Computing probabilities through sampling

Want to estimate probabilities Draw N samples from BN Marginals Conditionals Rejection sampling problematic for rare events

C D I G S L J H

slide-7
SLIDE 7

7

Sampling from intractable distributions

Given unnormalized distribution P(X) Q(X) Q(X) efficient to evaluate, but normalizer intractable For example, Q(X) = ∏j (Cj) Want to sample from P(X) Ingenious idea: Can create Markov chain that is efficient to simulate and that has stationary distribution P(X)

slide-8
SLIDE 8

8

Markov Chain Monte Carlo

Given an unnormalized distribution Q(x) Want to design a Markov chain with stationary distribution (x) = 1/Z Q(x) Need to specify transition probabilities P(x | x’)!

slide-9
SLIDE 9

9

Designing Markov Chains

1) Proposal distribution R(X’ | X)

Given Xt = x, sample “proposal” x’~R(X’ | X=x) Performance of algorithm will strongly depend on R

2) Acceptance distribution:

Suppose Xt = x With probability set Xt+1 = x’ With probability 1-, set Xt+1 = x

Theorem [Metropolis, Hastings]: The stationary distribution is Z-1 Q(x)

Proof: Markov chain satisfies detailed balance condition!

slide-10
SLIDE 10

10

Gibbs sampling

Start with initial assignment x(0) to all variables For t = 1 to do Set x(t) = x(t-1) For each variable Xi

Set vi = values of all x(t) except xi Sample x(t)

i from P(Xi | vi)

Gibbs sampling satisfies detailed balance equation for P Can efficiently compute conditional distributions P(Xi | vi) for graphical models

slide-11
SLIDE 11

11

Summary of Sampling

Randomized approximate inference for computing expections, (conditional) probabilities, etc. Exact in the limit

But may need ridiculously many samples

Can even directly sample from intractable distributions

Disguise distribution as stationary distribution of Markov Chain Famous example: Gibbs sampling

slide-12
SLIDE 12

12

Summary of approximate inference

Deterministic and randomized approaches Deterministic

Loopy BP Mean field inference Assumed density filtering

Randomized

Forward sampling Markov Chain Monte Carlo Gibbs Sampling

slide-13
SLIDE 13

13

Recall: The “light” side

Assumed

everything fully observable low treewidth no hidden variables

Then everything is nice

Efficient exact inference in large models Optimal parameter estimation without local minima Can even solve some structure learning tasks exactly

slide-14
SLIDE 14

14

The “dark” side

In the real world, these assumptions are often violated.. Still want to use graphical models to solve interesting problems..

  • States of the world,

sensor measurements, … Graphical model

represent

slide-15
SLIDE 15

15

Remaining Challenges

Inference

Approximate inference for high-treewidth models

Learning

Dealing with missing data

Representation

Dealing with hidden variables

slide-16
SLIDE 16

16

Learning general BNs

Missing data Hard Easy! Fully observable Unknown structure Known structure

slide-17
SLIDE 17

17

Dealing with missing data

So far, have assumed all variables are observed in each training example In practice, often have missing data

Some variables may never be observed Missing variables may be different for each example C D I G S L J H

slide-18
SLIDE 18

18

Gaussian Mixture Modeling

slide-19
SLIDE 19

19

Learning with missing data

Suppose X is observed variables, Z hidden variables Training data: x(1), x(2),…, x(N) Marginal likelihood: Marginal likelihood doesn’t decompose

slide-20
SLIDE 20

20

Intuition: EM Algorithm

Iterative algorithm for parameter learning in case of missing data EM Algorithm

Expectation Step: “Hallucinate” hidden values Maximization Step: Train model as if data were fully

  • bserved

Repeat

Will converge to local maximum

slide-21
SLIDE 21

21

E-Step:

x: observed data; z: hidden data “Hallucinate” missing values by computing distribution over hidden variables using current parameter estimate: For each example x(j), compute: Q(t+1)(z | x(j)) = P(z | x(j), (t))

slide-22
SLIDE 22

22

Towards M-step: Jensen inequality

Marginal likelihood doesn’t decompose Theorem [Jensen’s inequality]: For any distribution P(z) and function f(z),

slide-23
SLIDE 23

23

Lower-bounding marginal likelihood

Jensen’s inequality: From E-step:

slide-24
SLIDE 24

24

Lower bound on marginal likelihood

Bound of marginal likelihood with hidden variables Recall: Likelihood in fully observable case: Lower-bound interpreted as “weighted” data set

slide-25
SLIDE 25

25

M-step: Maximize lower bound

Lower bound: Choose (t+1) to maximize lower bound Use expected sufficient statistics (counts). Will see:

Whenever we used Count(x,z) in fully observable case, replace by EQt+1[Count(x,z)]

slide-26
SLIDE 26

26

Coordinate Ascent Interpretation

Define energy function For any distribution Q and parameters : EM algorithm performs coordinate ascent on F: Monotonically converges to local maximum

slide-27
SLIDE 27

27

EM for Gaussian Mixtures

slide-28
SLIDE 28

28

EM Iterations [by Andrew Moore]

slide-29
SLIDE 29

29

EM in Bayes Nets

Complete data likelihood E B A J M

slide-30
SLIDE 30

30

EM in Bayes Nets

Incomplete data likelihood E B A J M

slide-31
SLIDE 31

31

E-Step for BNs

Need to compute For fixed z, x: Can compute using inference Naively specifying full distribution would be intractable E B A J M

slide-32
SLIDE 32

32

M-step for BNs

Can optimize each CPT independently! MLE in fully observed case: MLE with hidden data:

slide-33
SLIDE 33

33

Computing expected counts

Suppose we observe O=o Variables A hidden

slide-34
SLIDE 34

34

Learning general BNs

Now EM Missing data Hard (2.) Easy! Fully observable Unknown structure Known structure

slide-35
SLIDE 35

35

Structure learning with hidden data

Fully observable case:

Score(D;G) = likelihood of data under most likely parameters Decomposes over families Score(D;G) = ι FamScorei(Xi | PaXi) Can recompute score efficiently after adding/removing edges

Incomplete data case:

Score(D;G) = lower bound from EM Does not decompose over families Search is very expensive

Structure-EM: Iterate

Computing of expected counts Multiple iterations of structure search for fixed counts

Guaranteed to monotonically improve likelihood score

slide-36
SLIDE 36

36

Hidden variable discovery

Sometimes, “invention” of a hidden variable can drastically simplify model

slide-37
SLIDE 37

37

Learning general BNs

Structure-EM EM Missing data Hard (2.) Easy! Fully observable Unknown structure Known structure