Learning in Graphical Models Problem Dimensions Model Bayes Nets - - PowerPoint PPT Presentation

▶

Jan 07, 2024 524 likes •820 views

Learning in Graphical Models Problem Dimensions Model Bayes Nets Markov Nets Structure Known Unknown (structure learning) Data Complete Incomplete (missing values or hidden variables) Expectation-Maximization

SLIDE 1

Learning in Graphical Models

Problem Dimensions

– Model

Bayes Nets
Markov Nets

– Structure

Known
Unknown (structure learning)

– Data

Complete
Incomplete (missing values or hidden variables)

SLIDE 2

Expectation-Maximization

Last time:

– Basics of EM – Learning a mixture of Gaussians (k-means)

This time:

– Short story justifying EM

Slides based on lecture notes from Andrew Ng

– Applying EM for semi-supervised document classification – Homework #4

SLIDE 3

10,000 foot level EM

Guess some parameters, then

– Use your parameters to get a distribution over hidden variables – Re-estimate the parameters as if your distribution

ver hidden variables is correct
Seems magical. When/why does this work?

SLIDE 4

Jensen’s Inequality

For f convex, E[f(X)] >= f(E[X])

SLIDE 5

Maximizing likelihood

x(i) = data, z(i) = hidden vars,  = parameters
This lower bound is easier to maximize, but

– What is Q? What good is maximizing a lower bound?

SLIDE 6

What do we use for Q?

EM: Given a guess old for  , improve it
Idea: choose Q such that our lower bound

equals the true log likelihood at old:

SLIDE 7

Ensure the bound is tight at old

When does Jensen’s inequality hold exactly?

SLIDE 8

Ensure the bound is tight at old

When does Jensen’s inequality hold exactly?
Sufficient that

be constant with respect to z(i)

Thus, choose Q(z(i)) = p(z(i) | x(i) ; old)

SLIDE 9

Putting it together

SLIDE 10

For exponential family

E step:

– Use n to estimate expected sufficient statistics

ver complete data
M step

– Set n+1 = ML parameters given sufficient statistics

(Or MAP parameters)

SLIDE 11

EM in practice

Local maxima

– Random re-starts, simulated annealing…

Variants

– Generalized EM: increase (not nec. maximize) likelihood in each step – Approximate E-step (e.g. sampling)

SLIDE 12

Semi-supervised Learning

Unlabeled data abounds in the world

– Web, measurements, etc.

Labeled data is expensive

– Image classification, natural language processing, speech recognition, etc. all require large #s of labels

Idea: use unlabeled data to help with learning

SLIDE 13

Supervised Learning

Learn function from x = (x1, …, xd) to y  {0, 1} given labeled examples (x, y) x1 x2

SLIDE 14

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y  {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1 x2

SLIDE 15

SSL in Graphical Models

Graphical Model describes how data (x, y) is

generated

Missing Data: y
So use EM

SLIDE 16

Example: Document classification with Naïve Bayes

xi = count of word i in document
cj= document class (sports, politics, etc.)
xit= count of word i in docs of class t
M classes, W = |X| words

(from Semi-supervised Text Classification Using EM, Nigam, et al.)

SLIDE 17

Semi-supervised Training

Initialize  ignoring missing data
E-step:

– E[xit]= count of word i in docs of class t in training set

+ E[count of word i in docs of class t in unlabeled data] – E[#ct] = count of docs in class t in training + E [count of docs of class t in unlabeled data]

M-step:

– Set  according to expected statistics above, I.e.:

P (wt | ct) = (E[xit] + 1) / (W + i E[xit] )
P(ct) = (E[#ct] + 1) / (#words + M)

SLIDE 18

Semi-supervised Learning

SLIDE 19

When does semi-supervised learning work?

When a better model of P(x) -> a better model
f P(y | x)
Can’t use purely discriminative models
Accurate modeling assumptions are key

– Consider: negative class

SLIDE 20

Good example

SLIDE 21

Issue: negative class

SLIDE 22

Negative

NB*, EM* represent the negative class with the
ptimal number of model classes (ci’s)

SLIDE 23

Problem: local maxima

“Deterministic Annealing”
Slowly increase 
Results: works, but can end up confusing

classes (next slide)

SLIDE 24

Annealing performance

SLIDE 25

Homework #4 (1 of 3)

What if we don’t know the target classes in advance?
Example: Google Sets
Wait until query time to run EM? Slow.
Strategy: Learn a NB model in advance, obtain mapping from

examples->”classes”

Then at “query time” compare examples

SLIDE 26

Homework #4 (2 of 3)

Classify noun phrases based on context in text

– E.g. _ prime minister CEO of _

Model noun phrases (NPs) as P(z | w):

P(z | Canada) =

Experiment with different N
Query time input: “seeds” (e.g., Algeria, UK)

Output: ranked list of other NPs, using KL div.

0.14 0.01 … 0.06

z=1 2 N

SLIDE 27

Homework #4 (3 of 3)

Code: written in Java
You write ~5 lines

– (important ones)

Run some experiments
Homework also has a few written exercises

– Sampling

SLIDE 28

Road Map

Basics of Probability and Statistical Estimation
Bayesian Networks
Markov Networks
Inference
Learning

– Parameters, Structure, EM

HMMs
Something else?

– Candidates: Active Learning, Decision Theory, Statistical Relational Models… Role of Probabilistic Models in the Financial Crisis?

Learning in Graphical Models

– Model

– Structure

– Data

Expectation-Maximization

10,000 foot level EM

– Use your parameters to get a distribution over hidden variables – Re-estimate the parameters as if your distribution

Jensen’s Inequality

Maximizing likelihood

– What is Q? What good is maximizing a lower bound?

What do we use for Q?

equals the true log likelihood at old:

Ensure the bound is tight at old

Ensure the bound is tight at old

be constant with respect to z(i)

Putting it together

For exponential family

– Use n to estimate expected sufficient statistics

– Set n+1 = ML parameters given sufficient statistics

EM in practice

– Random re-starts, simulated annealing…

– Generalized EM: increase (not nec. maximize) likelihood in each step – Approximate E-step (e.g. sampling)

Semi-supervised Learning

– Web, measurements, etc.

– Image classification, natural language processing, speech recognition, etc. all require large #s of labels

Supervised Learning

Learn function from x = (x1, …, xd) to y  {0, 1} given labeled examples (x, y) x1 x2

Semi-Supervised Learning (SSL)

Learn function from x = (x1, …, xd) to y  {0, 1} given labeled examples (x, y) and unlabeled examples (x) x1 x2

SSL in Graphical Models

generated

Example: Document classification with Naïve Bayes

Semi-supervised Training

– E[xit]= count of word i in docs of class t in training set

– Set  according to expected statistics above, I.e.:

Semi-supervised Learning

When does semi-supervised learning work?

– Consider: negative class

Good example

Issue: negative class

Negative

Problem: local maxima

classes (next slide)

Annealing performance

Homework #4 (1 of 3)

examples->”classes”

Homework #4 (2 of 3)

– E.g. ___ prime minister CEO of ___

P(z | Canada) =

Output: ranked list of other NPs, using KL div.

0.14 0.01 … 0.06

z=1 2 N

Homework #4 (3 of 3)

– (important ones)

– Sampling

Road Map

– E.g. _ prime minister CEO of _