Expectation Maximization CMSC 473/673 UMBC Recap from last time - - PowerPoint PPT Presentation

β–Ά
expectation maximization
SMART_READER_LITE
LIVE PREVIEW

Expectation Maximization CMSC 473/673 UMBC Recap from last time - - PowerPoint PPT Presentation

Introduction to Latent Sequences & Expectation Maximization CMSC 473/673 UMBC Recap from last time (and the first unit) N-gram Language Models given some context w i-3 w i-2 w i-1 compute beliefs about what is likely


slide-1
SLIDE 1

Introduction to Latent Sequences & Expectation Maximization

CMSC 473/673 UMBC

slide-2
SLIDE 2

Recap from last time (and the first unit)…

slide-3
SLIDE 3

N-gram Language Models

predict the next word given some context…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) ∝ π‘‘π‘π‘£π‘œπ‘’(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗)

wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

slide-4
SLIDE 4

Maxent Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) = softmax(πœ„ β‹… 𝑔(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1, π‘₯𝑗))

slide-5
SLIDE 5

Neural Language Models

predict the next word given some context… wi-3 wi-2 wi wi-1 compute beliefs about what is likely…

π‘ž π‘₯𝑗 π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1) = softmax(πœ„π‘₯𝑗 β‹… π’ˆ(π‘₯π‘—βˆ’3, π‘₯π‘—βˆ’2, π‘₯π‘—βˆ’1))

create/use β€œdistributed representations”… ei-3 ei-2 ei-1 combine these representations… C = f

matrix-vector product

ew ΞΈwi

slide-6
SLIDE 6

(Some) Properties of Embeddings

Capture β€œlike” (similar) words Capture relationships

Mikolov et al. (2013)

vector(β€˜king’) – vector(β€˜man’) + vector(β€˜woman’) β‰ˆ vector(β€˜queen’) vector(β€˜Paris’) - vector(β€˜France’) + vector(β€˜Italy’) β‰ˆ vector(β€˜Rome’)

slide-7
SLIDE 7

Learn more in:

  • Your project
  • Paper (673)
  • Other classes (478/678)

Four kinds of vector models

Sparse vector representations

1. Mutual-information weighted word co-

  • ccurrence matrices

Dense vector representations:

2. Singular value decomposition/Latent Semantic Analysis 3. Neural-network-inspired models (skip-grams, CBOW) 4. Brown clusters

slide-8
SLIDE 8

Shared Intuition

Model the meaning of a word by β€œembedding” in a vector space The meaning of a word is a vector of numbers Contrast: word meaning is represented in many computational linguistic applications by a vocabulary index (β€œword number 545”) or the string itself

slide-9
SLIDE 9

Intrinsic Evaluation: Cosine Similarity

Divide the dot product by the length of the two vectors This is the cosine of the angle between them Are the vectors parallel?

  • 1: vectors point in
  • pposite directions

+1: vectors point in same directions 0: vectors are orthogonal

slide-10
SLIDE 10

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

slide-11
SLIDE 11

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule

Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity

slide-12
SLIDE 12

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity

Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

slide-13
SLIDE 13

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling

Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (β€œposterior in

  • ne go”)

How to learn the weights: gradient descent

slide-14
SLIDE 14

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (β€œposterior in one go”) How to learn the weights: gradient descent

Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

slide-15
SLIDE 15

Course Recap So Far

Basics of Probability Requirements to be a distribution (β€œproportional to”, ∝) Definitions of conditional probability, joint probability, and independence Bayes rule, (probability) chain rule Basics of language modeling Goal: model (be able to predict) and give a score to language (whole sequences of characters or words) Simple count-based model Smoothing (and why we need it): Laplace (add-Ξ»), interpolation, backoff Evaluation: perplexity Tasks and Classification (use Bayes rule!) Posterior decoding vs. noisy channel model Evaluations: accuracy, precision, recall, and FΞ² (F1) scores NaΓ―ve Bayes (given the label, generate/explain each feature independently) and connection to language modeling Maximum Entropy Models Meanings of feature functions and weights Use for language modeling or conditional classification (β€œposterior in one go”) How to learn the weights: gradient descent Distributed Representations & Neural Language Models What embeddings are and what their motivation is A common way to evaluate: cosine similarity

slide-16
SLIDE 16

LATENT SEQUENCES AND LATENT VARIABLE MODELS

slide-17
SLIDE 17

Is Language Modeling β€œLatent?”

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

slide-18
SLIDE 18

Is Language Modeling β€œLatent?” Most* of What We’ve Discussed: Not Really

p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

these values are unknown but the generation process (explanation) is transparent

*Neural language modeling as an exception

slide-19
SLIDE 19

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification β€œLatent?”

slide-20
SLIDE 20

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification β€œLatent?” As We’ve Discussed

argmax𝑍 ΰ·‘

𝑗

π‘ž(π‘Œπ‘—|𝑍) βˆ— π‘ž(𝑍) argmax𝑍 exp πœ„ β‹… 𝑔 𝑦, 𝑧 π‘Ž(𝑦) βˆ— π‘ž(𝑍) argmax𝑍 exp(πœ„ β‹… 𝑔 𝑦, 𝑧 )

slide-21
SLIDE 21

ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region.

Is Document Classification β€œLatent?” As We’ve Discussed: Not Really

argmax𝑍 ΰ·‘

𝑗

π‘ž(π‘Œπ‘—|𝑍) βˆ— π‘ž(𝑍) argmax𝑍 exp πœ„ β‹… 𝑔 𝑦, 𝑧 π‘Ž(𝑦) βˆ— π‘ž(𝑍) argmax𝑍 exp(πœ„ β‹… 𝑔 𝑦, 𝑧 )

these values are unknown but the generation process (explanation) is transparent

slide-22
SLIDE 22

Ambiguity β†’ Part of Speech Tagging

British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands British Left Waffles on Falkland Islands

Adjective Noun Verb Noun Verb Noun

slide-23
SLIDE 23
  • rthography

morphology

Adapted from Jason Eisner, Noah Smith

lexemes syntax semantics pragmatics discourse

  • bserved text
slide-24
SLIDE 24

Adapted from Jason Eisner, Noah Smith

Latent Modeling

explain what you see/annotate with things β€œof importance” you don’t

  • rthography

morphology lexemes syntax semantics pragmatics discourse

  • bserved text
slide-25
SLIDE 25

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

slide-26
SLIDE 26

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

slide-27
SLIDE 27

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

slide-28
SLIDE 28

Latent Sequence Models: Part of Speech p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) 2. Produce a tag sequence for this sentence Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

slide-29
SLIDE 29

Noisy Channel Model

Decode Rerank

π‘ž π‘Œ 𝑍) ∝ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ)

possible (clean)

  • utput
  • bserved

(noisy) text translation/ decode model (clean) language model

slide-30
SLIDE 30

Latent Sequence Model: Machine Translation

Decode Rerank

π‘ž π‘Œ 𝑍) ∝ π‘ž 𝑍 π‘Œ) βˆ— π‘ž(π‘Œ)

possible (clean)

  • utput
  • bserved

(noisy) text

translation/ decode model

(clean) language model

slide-31
SLIDE 31

Latent Sequence Model: Machine Translation

Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

slide-32
SLIDE 32

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

slide-33
SLIDE 33

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations!

slide-34
SLIDE 34

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations! How? Learn a β€œreverse” latent alignment model p(French words, alignments | English words)

slide-35
SLIDE 35

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations! How? Learn a β€œreverse” latent alignment model p(French words, alignments | English words) Alignment? Words can have different meaning/senses

slide-36
SLIDE 36

Latent Sequence Model: Machine Translation

The cat is on the chair. Le chat est sur la chaise.

Eddie Izzard, β€œDress to Kill” (MPAA: R) https://www.youtube.com/watch?v=x1sQkEfAdfY

How do you know what words translate as? Learn the translations! How? Learn a β€œreverse” latent alignment model p(French words, alignments | English words)

π‘ž English French) ∝ π‘ž French English) βˆ— π‘ž(English)

Why Reverse? Alignment? Words can have different meaning/senses

slide-37
SLIDE 37

How to Learn With Latent Variables (Sequences)

Expectation Maximization

slide-38
SLIDE 38

Example: Unigram Language Modeling

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

slide-39
SLIDE 39

Example: Unigram Language Modeling

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗

maximize (log-)likelihood to learn the probability parameters

slide-40
SLIDE 40

Example: Unigram Language Modeling with Hidden Class

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

slide-41
SLIDE 41

Example: Unigram Language Modeling with Hidden Class

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

examples of latent classes z:

  • part of speech tag
  • topic (β€œsports” vs. β€œpolitics”)
slide-42
SLIDE 42

Example: Unigram Language Modeling with Hidden Class

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

add complexity to better explain what we see

goal: maximize (log-)likelihood

slide-43
SLIDE 43

Example: Unigram Language Modeling with Hidden Class

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values we just see the words w

add complexity to better explain what we see

goal: maximize (log-)likelihood

slide-44
SLIDE 44

Example: Unigram Language Modeling with Hidden Class

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values we just see the words w

add complexity to better explain what we see

goal: maximize (log-)likelihood if we did observe z, estimating the probability parameters would be easy… but we don’t! :(

slide-45
SLIDE 45

Example: Unigram Language Modeling with Hidden Class

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = π‘ž π‘₯1 π‘ž π‘₯2 β‹― π‘ž π‘₯𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗 π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values we just see the words w

add complexity to better explain what we see

goal: maximize (log-)likelihood if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameters would be easy… but we don’t! :(

slide-46
SLIDE 46

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood

slide-47
SLIDE 47

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

slide-48
SLIDE 48

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

z=1 & w z=2 & w z=3 & w z=4 & w

slide-49
SLIDE 49

Marginal(ized) Probability

w

π‘ž π‘₯ = π‘ž 𝑨 = 1, π‘₯ + π‘ž 𝑨 = 2, π‘₯ + π‘ž 𝑨 = 3, π‘₯ + π‘ž(𝑨 = 4, π‘₯)

z=1 & w z=2 & w z=3 & w z=4 & w

(assuming z can take on only four possible values…)

slide-50
SLIDE 50

Marginal(ized) Probability

w

π‘ž π‘₯ = π‘ž 𝑨 = 1 , π‘₯ + π‘ž 𝑨 = 2, π‘₯ + π‘ž 𝑨 = 3, π‘₯ + π‘ž 𝑨 = 4, π‘₯ = ෍

𝑨=1 4

π‘ž(𝑨 = 𝑗, π‘₯)

z=1 & w z=2 & w z=3 & w z=4 & w

(assuming z can take on only four possible values…)

slide-51
SLIDE 51

Marginal(ized) Probability

w

π‘ž π‘₯ = ෍

𝑨

π‘ž(𝑨, π‘₯)

z=1 & w z=2 & w z=3 & w z=4 & w

slide-52
SLIDE 52

Marginal(ized) Probability

w

π‘ž π‘₯ = ෍

𝑨

π‘ž(𝑨, π‘₯) = ෍

𝑨

π‘ž 𝑨 π‘ž π‘₯ 𝑨)

z=1 & w z=2 & w z=3 & w z=4 & w

slide-53
SLIDE 53

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗

we don’t actually observe these z values goal: maximize marginalized (log-)likelihood w

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = ෍

𝑨1

π‘ž(𝑨1, π‘₯1) ෍

𝑨2

π‘ž(𝑨2, π‘₯2) β‹― ෍

𝑨𝑂

π‘ž(𝑨𝑂, π‘₯𝑂)

z=1 & w z=2 & w z=3 & w z=4 & w

slide-54
SLIDE 54

Example: Unigram Language Modeling with Hidden Class

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂 π‘ž π‘₯𝑂|𝑨𝑂

goal: maximize marginalized (log-)likelihood w

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = ෍

𝑨1

π‘ž(𝑨1, π‘₯1) ෍

𝑨2

π‘ž(𝑨2, π‘₯2) β‹― ෍

𝑨𝑂

π‘ž(𝑨𝑂, π‘₯𝑂)

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

z=1 & w z=2 & w z=3 & w z=4 & w

slide-55
SLIDE 55

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

slide-56
SLIDE 56

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-57
SLIDE 57

http://blog.innotas.com/wp-content/uploads/2015/08/chicken-or-egg-cropped1.jpg

if we did observe z, estimating the probability parameters would be easy… but we don’t! :( if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :(

slide-58
SLIDE 58

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-59
SLIDE 59

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, π‘₯𝑗) π‘ž(𝑨𝑗)

slide-60
SLIDE 60

Expectation Maximization (EM): E-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

count(𝑨𝑗, π‘₯𝑗) π‘ž(𝑨𝑗)

We’ve already seen this type of counting, when computing the gradient in maxent models.

slide-61
SLIDE 61

Expectation Maximization (EM): M-step

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

π‘ž 𝑒+1 (𝑨) π‘ž(𝑒)(𝑨)

estimated counts

slide-62
SLIDE 62

EM Math

max 𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-63
SLIDE 63

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-64
SLIDE 64

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-65
SLIDE 65

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

current parameters

slide-66
SLIDE 66

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-67
SLIDE 67

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

E-step: count under uncertainty M-step: maximize log-likelihood

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-68
SLIDE 68

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

slide-69
SLIDE 69

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • nly observe these

(record heads vs. tails

  • utcome)

don’t observe this

slide-70
SLIDE 70

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • bserved:

a, b, e, etc. We run the code, vs. The run failed unobserved: vowel or constonant? part of speech?

slide-71
SLIDE 71

Three Coins/Unigram With Class Example

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-72
SLIDE 72

Three Coins/Unigram With Class Example

Imagine three coins

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ” Three parameters to estimate: Ξ», Ξ³, and ψ

slide-73
SLIDE 73

Three Coins/Unigram With Class Example

If all flips were observed

H H T H T H H T H T T T

π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-74
SLIDE 74

Three Coins/Unigram With Class Example

If all flips were observed

H H T H T H H T H T T T

π‘ž heads = 4 6 π‘ž tails = 2 6 π‘ž heads = 1 4 π‘ž heads = 1 2 π‘ž tails = 3 4 π‘ž tails = 1 2 π‘ž heads = πœ‡ π‘ž tails = 1 βˆ’ πœ‡ π‘ž heads = 𝛿 π‘ž heads = πœ” π‘ž tails = 1 βˆ’ 𝛿 π‘ž tails = 1 βˆ’ πœ”

slide-75
SLIDE 75

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

slide-76
SLIDE 76

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4 π‘ž heads | observed item H = π‘ž(heads & H) π‘ž(H)

Use these values to compute posteriors

π‘ž heads | observed item T = π‘ž(heads & T) π‘ž(T)

slide-77
SLIDE 77

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values

H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H)

Use these values to compute posteriors

marginal likelihood rewrite joint using Bayes rule

slide-78
SLIDE 78

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H)

Use these values to compute posteriors

π‘ž H | heads = .8 π‘ž T | heads = .2

slide-79
SLIDE 79

Three Coins/Unigram With Class Example

But not all flips are observed β†’ set parameter values H H T H T H H T H T T T

π‘ž heads = πœ‡ = .6 π‘ž tails = .4 π‘ž heads = .8 π‘ž heads = .6 π‘ž tails = .2 π‘ž tails = .4

Use these values to compute posteriors

π‘ž H = π‘ž H | heads βˆ— π‘ž heads + π‘ž H | tails * π‘ž(tails) = .8 βˆ— .6 + .6 βˆ— .4

π‘ž heads | observed item H = π‘ž H heads)π‘ž(heads) π‘ž(H) π‘ž H | heads = .8 π‘ž T | heads = .2

slide-80
SLIDE 80

Three Coins/Unigram With Class Example

H H T H T H H T H T T T

π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667

Use posteriors to update parameters

π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

slide-81
SLIDE 81

Three Coins/Unigram With Class Example

H H T H T H H T H T T T Use posteriors to update parameters

π‘ž heads = # heads from penny # total flips of penny fully observed setting

  • ur setting: partially-observed

π‘ž heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334 (in general, p(heads | obs. H) and p(heads| obs. T) do NOT sum to 1)

slide-82
SLIDE 82

Three Coins/Unigram With Class Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting: partially-observed

π‘ž(𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny = π”½π‘ž(𝑒)[# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] # total flips of penny π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

slide-83
SLIDE 83

Three Coins/Unigram With Class Example

H H T H T H H T H T T T Use posteriors to update parameters

  • ur setting:

partially-

  • bserved

π‘ž(𝑒+1) heads = # π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny # total flips of penny = π”½π‘ž(𝑒)[# π‘“π‘¦π‘žπ‘“π‘‘π‘’π‘“π‘’ heads from penny] # total flips of penny = 2 βˆ— π‘ž heads | obs. H + 4 βˆ— π‘ž heads | obs. π‘ˆ 6 β‰ˆ 0.444 π‘ž heads | obs. H = π‘ž H heads)π‘ž(heads) π‘ž(H) = .8 βˆ— .6 .8 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.667 π‘ž heads | obs. T = π‘ž T heads)π‘ž(heads) π‘ž(T) = .2 βˆ— .6 .2 βˆ— .6 + .6 βˆ— .4 β‰ˆ 0.334

slide-84
SLIDE 84

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm:

  • 1. E-step: count under uncertainty (compute expectations)
  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

slide-85
SLIDE 85

Related to EM

Latent clustering K-means:

https://www.csee.umbc.edu/courses/undergraduate/473/f19/kmeans/

Gaussian mixture modeling