Part-of-Speech T agging: HMM & structured perceptron CMSC 723 - - PowerPoint PPT Presentation

part of speech
SMART_READER_LITE
LIVE PREVIEW

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 - - PowerPoint PPT Presentation

Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last time What are parts of speech (POS)? Equivalence classes or categories of words Open class vs.


slide-1
SLIDE 1

Part-of-Speech T agging: HMM & structured perceptron

CMSC 723 / LING 723 / INST 725 MARINE CARPUAT

marine@cs.umd.edu

slide-2
SLIDE 2

Last time…

  • What are parts of speech (POS)?

– Equivalence classes or categories of words – Open class vs. closed class – Nouns, Verbs, Adjectives, Adverbs (English)

  • What is POS tagging?

– Assigning POS tags to words in context – Penn Treebank

  • How to POS tag text automatically?

– Multiclass classification vs. sequence labeling

slide-3
SLIDE 3

T

  • day
  • 2 approaches to POS tagging

–Hidden Markov Models –Structured Perceptron

slide-4
SLIDE 4

Hidden Markov Models

  • Common approach to sequence labeling
  • A finite state machine with probabilistic

transitions

  • Markov Assumption

– next state only depends on the current state and independent of previous history

slide-5
SLIDE 5

HMM: Formal Specification

  • Q: a finite set of N states

– Q = {q0, q1, q2, q3, …}

  • N  N Transition probability matrix A = [aij]

– aij = P(qj|qi), Σ aij = 1 I

  • Sequence of observations O = o1, o2, ... oT

– Each drawn from a given set of symbols (vocabulary V)

  • N  |V| Emission probability matrix, B = [bit]

– bit = bi(ot) = P(ot|qi), Σ bit = 1 i

  • Start and end states

– An explicit start state q0 or alternatively, a prior distribution over start states: {π1, π2, π3, …}, Σ πi = 1 – The set of final states: qF

Markov Assumption

slide-6
SLIDE 6

Stock Market HMM

π1=0.5 π2=0.2 π3=0.3

States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors? ✓ ✓

slide-7
SLIDE 7

HMMs: Three Problems

  • Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

  • Decoding: Given an HMM λ = (A, B, ∏), and an
  • bservation sequence O, find the most likely

(hidden) state sequence

  • Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

slide-8
SLIDE 8

HMM Problem #1: Likelihood

slide-9
SLIDE 9

Computing Likelihood

1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock

Assuming λstock models the stock market, how likely are we to observe the sequence of outputs?

π1=0.5 π2=0.2 π3=0.3

slide-10
SLIDE 10

Computing Likelihood

  • First try:

– Sum over all possible ways in which we could generate O from λ

Takes O(NT) time to compute!

slide-11
SLIDE 11

Forward Algorithm

  • Use an N  T trellis or chart [αtj]
  • Forward probabilities: αtj or αt(j)

= P(being in state j after seeing t observations) = P(o1, o2, ... ot, qt=j)

  • Each cell = ∑ extensions of all paths from other cells

αt(j) = ∑i αt-1(i) aij bj(ot)

– αt-1(i): forward path probability until (t-1) – aij: transition probability of going from state i to j – bj(ot): probability of emitting symbol ot in state j

  • P(O|λ) = ∑i αT(i)
slide-12
SLIDE 12

Forward Algorithm: Formal Definition

  • Initialization
  • Recursion
  • Termination
slide-13
SLIDE 13

Forward Algorithm

↑ ↓ ↑ O = find P(O|λstock)

slide-14
SLIDE 14

Forward Algorithm

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

slide-15
SLIDE 15

Forward Algorithm: Initialization

α1(Bull) α1(Bear) α1(Static)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09

Bear Bull Static

states

slide-16
SLIDE 16

Forward Algorithm: Recursion

0.140.60.1=0.0084

α1(Bull)aBullBullbBull(↓)

.... and so on

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145

Bear Bull Static

states

slide-17
SLIDE 17

Forward Algorithm: Recursion

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145 ? ? ? ? ?

Bear Bull Static

states

Work through the rest of these numbers… What’s the asymptotic complexity of this algorithm?

slide-18
SLIDE 18

HMM Problem #2: Decoding

slide-19
SLIDE 19

Decoding

Given λstock as our model and O as our observations, what are the most likely states the market went through to produce O?

1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock

π1=0.5 π2=0.2 π3=0.3

slide-20
SLIDE 20

Decoding

  • “Decoding” because states are hidden
  • First try:

– Compute P(O) for all possible state sequences, then choose sequence with highest probability

slide-21
SLIDE 21

Viterbi Algorithm

  • “Decoding” = computing most likely state

sequence

– Another dynamic programming algorithm – Efficient: polynomial vs. exponential (brute force)

  • Same idea as the forward algorithm

– Store intermediate computation results in a trellis – Build new cells from existing cells

slide-22
SLIDE 22

Viterbi Algorithm

  • Use an N  T trellis [vtj]

– Just like in forward algorithm

  • vtj or vt(j)

= P(in state j after seeing t observations and passing through the most likely state sequence so far) = P(q1, q2, ... qt-1, qt=j, o1, o2, ... ot)

  • Each cell = extension of most likely path from other cells

vt(j) = maxi vt-1(i) aij bj(ot)

– vt-1(i): Viterbi probability until (t-1) – aij: transition probability of going from state i to j – bj(ot) : probability of emitting symbol ot in state j

  • P = maxi vT(i)
slide-23
SLIDE 23

Viterbi vs. Forward

  • Maximization instead of summation over previous paths
  • This algorithm is still missing something!

– In forward algorithm, we only care about the probabilities – What’s different here?

  • We need to store the most likely path (transition):

– Use “backpointers” to keep track of most likely transition – At the end, follow the chain of backpointers to recover the most likely state sequence

slide-24
SLIDE 24

Viterbi Algorithm: Formal Definition

  • Initialization
  • Recursion
  • Termination
slide-25
SLIDE 25

Viterbi Algorithm

↑ ↓ ↑ O =

find most likely state sequence given λstock

slide-26
SLIDE 26

Viterbi Algorithm

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

slide-27
SLIDE 27

Viterbi Algorithm: Initialization

α1(Bull) α1(Bear) α1(Static)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09

Bear Bull Static

states

slide-28
SLIDE 28

Viterbi Algorithm: Recursion

0.140.60.1=0.0084

Max

α1(Bull)aBullBullbBull(↓)

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084

Bear Bull Static

states

slide-29
SLIDE 29

Viterbi Algorithm: Recursion

.... and so on

time

↑ ↓ ↑

t=1 t=2 t=3

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084

Bear Bull Static

states

store backpointer

slide-30
SLIDE 30

Viterbi Algorithm: Recursion

time

↑ ↓ ↑

t=1 t=2 t=3

Bear Bull Static

states

0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084 ? ? ? ? ?

Work through the rest of the algorithm…

slide-31
SLIDE 31

POS T agging with HMMs

slide-32
SLIDE 32

HMM for POS tagging: intuition

Credit: Jordan Boyd Graber

slide-33
SLIDE 33

HMM for POS tagging: intuition

Credit: Jordan Boyd Graber

slide-34
SLIDE 34

HMMs: Three Problems

  • Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

  • Decoding: Given an HMM λ = (A, B, ∏), and an
  • bservation sequence O, find the most likely

(hidden) state sequence

  • Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

slide-35
SLIDE 35

HMM Problem #3: Learning

slide-36
SLIDE 36

Learning HMMs for POS tagging is a supervised task

  • A POS tagged corpus tells us the hidden states!
  • We can compute Maximum Likelihood Estimates

(MLEs) for the various parameters

– MLE = fancy way of saying “count and divide”

  • These parameter estimates maximize the

likelihood of the data being generated by the model

slide-37
SLIDE 37

Supervised Training

  • Transition Probabilities

– Any P(ti | ti-1) = C(ti-1, ti) / C(ti-1), from the tagged data – Example: for P(NN|VB)

  • count how many times a noun follows a verb
  • divide by the total number of times you see a verb
slide-38
SLIDE 38

Supervised Training

  • Emission Probabilities

– Any P(wi | ti) = C(wi, ti) / C(ti), from the tagged data – For P(bank|NN)

  • count how many times bank is tagged as a noun
  • divide by how many times anything is tagged as a

noun

slide-39
SLIDE 39

Supervised Training

  • Priors

– Any P(q1 = ti) = πi = C(ti)/N, from the tagged data – For πNN , count the number of times NN occurs and divide by the total number of tags (states)

slide-40
SLIDE 40

HMMs: Three Problems

  • Likelihood: Given an HMM λ = (A, B, ∏), and a

sequence of observed events O, find P(O|λ)

  • Decoding: Given an HMM λ = (A, B, ∏), and an
  • bservation sequence O, find the most likely

(hidden) state sequence

  • Learning: Given a set of observation sequences

and the set of states Q in λ, compute the parameters A and B

slide-41
SLIDE 41

Prediction Problems

  • Given x, predict y

A book review Oh, man I love this book! This book is so boring... Is it positive? yes no

Binary Prediction (2 choices)

A tweet On the way to the park! 公園に行くなう! Its language English Japanese

Multi-class Prediction (several choices)

A sentence I read a book Its syntactic parse

Structured Prediction (millions of choices)

I read a book

DET NN NP VBD VP S N

slide-42
SLIDE 42

Approaches to POS tagging

Classifiers

Multiclass classification problem Logistic Regression Model context using lots of features

Generative Models

Structured prediction problem (Sequence labeling) Hidden Markov Models Models transitions between states/POS

Structured perceptron → Classification with lots of features

  • ver structured models!
slide-43
SLIDE 43

Let’s restructure HMMs with features…

  • Given a sentence X, predict its part of

speech sequence Y

Natural language processing ( NLP ) is a field of computer science

JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN

slide-44
SLIDE 44

Let’s restructure HMMs with features…

  • POS→POS transition probabilities
  • POS→Word emission probabilities

natural language processing ( nlp ) ... <s> JJ NN NN LRB NN RRB ... </s>

PT(JJ|<s>) PT(NN|JJ) PT(NN|NN) … PE(natural|JJ) PE(language|NN) PE(processing|NN) …

𝑄 𝑍 ≈

𝑗=1 𝐽+1

𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1 𝑄 𝑌 ∣ 𝑍 ≈

1 𝐽

𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

* * * *

slide-45
SLIDE 45

Restructuring HMM With Features

𝑄 𝑌, 𝑍 =

1 𝐽

𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Normal HMM:

slide-46
SLIDE 46

Restructuring HMM With Features

𝑄 𝑌, 𝑍 =

1 𝐽

𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Normal HMM:

log𝑄 𝑌, 𝑍 =

1 𝐽

log 𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

log 𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Log Likelihood:

slide-47
SLIDE 47

Restructuring HMM With Features

𝑄 𝑌, 𝑍 =

1 𝐽

𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Normal HMM:

log𝑄 𝑌, 𝑍 =

1 𝐽

log 𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

log 𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Log Likelihood:

𝑇 𝑌, 𝑍 =

1 𝐽

𝑥𝐹,𝑧𝑗,𝑦𝑗

𝑗=1 𝐽+1

𝑥𝑈,𝑧𝑗−1,𝑧𝑗

Score

slide-48
SLIDE 48

Restructuring HMM With Features

𝑄 𝑌, 𝑍 =

1 𝐽

𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Normal HMM:

log𝑄 𝑌, 𝑍 =

1 𝐽

log 𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

𝑗=1 𝐽+1

log 𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

Log Likelihood:

𝑇 𝑌, 𝑍 =

1 𝐽

𝑥𝐹,𝑧𝑗,𝑦𝑗

𝑗=1 𝐽+1

𝑥𝑈,𝑧𝑗−1,𝑧𝑗

Score

𝑥𝐹,𝑧𝑗,𝑦𝑗 = log𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗

When:

𝑥𝑈,𝑧𝑗−1,𝑧𝑗 = log𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1

log P(X,Y) = S(X,Y)

slide-49
SLIDE 49

Example

I visited Nara PRP VBD NNP

φ( ) =

I visited Nara NNP VBD NNP

φ( ) =

φT,<S>,PRP(X,Y1) = 1 φT,PRP,VBD(X,Y1) = 1 φT,VBD,NNP(X,Y1) = 1 φT,NNP,</S>(X,Y1) = 1 φE,PRP,”I”(X,Y1) = 1 φE,VBD,”visited”(X,Y1) = 1 φE,NNP,”Nara”(X,Y1) = 1 φT,<S>,NNP(X,Y2) = 1 φT,NNP,VBD(X,Y2) = 1 φT,VBD,NNP(X,Y2) = 1 φT,NNP,</S>(X,Y2) = 1 φE,NNP,”I”(X,Y2) = 1 φE,VBD,”visited”(X,Y2) = 1 φE,NNP,”Nara”(X,Y2) = 1 φCAPS,PRP(X,Y1) = 1 φCAPS,NNP(X,Y1) = 1 φCAPS,NNP(X,Y2) = 2 φSUF,VBD,”...ed”(X,Y1) = 1 φSUF,VBD,”...ed”(X,Y2) = 1

slide-50
SLIDE 50

How to decode?

  • We must find the POS sequence that

satisfies: Solution: Viterbi algorithm

𝑍 = argmax𝑍

𝑗

𝑥𝑗 ϕ𝑗 𝑌, 𝑍

slide-51
SLIDE 51

HMM Viterbi Algorithm

  • Forward step, calculate the best path to a

node

  • Find the path to each node with the lowest

negative log probability

  • Backward step, reproduce the path
slide-52
SLIDE 52

Forward Step: Part 1

  • First, calculate transition from <S> and

emission of the first word for every POS

1:NN 1:JJ 1:VB

1:PRP 1:NNP

0:<S>

I

best_score[“1 NN”] = -log PT(NN|<S>) + -log PE(I | NN) best_score[“1 JJ”] = -log PT(JJ|<S>) + -log PE(I | JJ) best_score[“1 VB”] = -log PT(VB|<S>) + -log PE(I | VB) best_score[“1 PRP”] = -log PT(PRP|<S>) + -log PE(I | PRP) best_score[“1 NNP”] = -log PT(NNP|<S>) + -log PE(I | NNP)

slide-53
SLIDE 53

Forward Step: Middle Parts

  • For middle words, calculate the minimum

score for all possible previous POS tags

1:NN 1:JJ 1:VB

1:PRP 1:NNP

… I

best_score[“2 NN”] = min( best_score[“1 NN”] + -log PT(NN|NN) + -log PE(visited | NN), best_score[“1 JJ”] + -log PT(NN|JJ) + -log PE(language | NN), best_score[“1 VB”] + -log PT(NN|VB) + -log PE(language | NN), best_score[“1 PRP”] + -log PT(NN|PRP) + -log PE(language | NN), best_score[“1 NNP”] + -log PT(NN|NNP) + -log PE(language | NN), ... )

2:NN 2:JJ 2:VB

2:PRP 2:NNP

… visited

best_score[“2 JJ”] = min( best_score[“1 NN”] + -log PT(JJ|NN) + -log PE(language | JJ), best_score[“1 JJ”] + -log PT(JJ|JJ) + -log PE(language | JJ), best_score[“1 VB”] + -log PT(JJ|VB) + -log PE(language | JJ), ...

slide-54
SLIDE 54

HMM Viterbi with Features

  • Same as probabilities, use feature weights

1:NN 1:JJ 1:VB

1:PRP 1:NNP

0:<S>

I

best_score[“1 NN”] = wT,<S>,NN + wE,NN,I best_score[“1 JJ”] = wT,<S>,JJ + wE,JJ,I best_score[“1 VB”] = wT,<S>,VB + wE,VB,I best_score[“1 PRP”] = wT,<S>,PRP + wE,PRP,I best_score[“1 NNP”] = wT,<S>,NNP + wE,NNP,I

slide-55
SLIDE 55

HMM Viterbi with Features

  • Can add additional features

1:NN 1:JJ 1:VB

1:PRP 1:NNP

0:<S>

I

best_score[“1 NN”] = wT,<S>,NN + wE,NN,I + wCAPS,NN best_score[“1 JJ”] = wT,<S>,JJ + wE,JJ,I + wCAPS,JJ best_score[“1 VB”] = wT,<S>,VB + wE,VB,I + wCAPS,VB best_score[“1 PRP”] = wT,<S>,PRP + wE,PRP,I + wCAPS,PRP best_score[“1 NNP”] = wT,<S>,NNP + wE,NNP,I + wCAPS,NNP

slide-56
SLIDE 56

Learning in the Structured Perceptron

  • Remember the perceptron algorithm
  • If there is a mistake:
  • Update weights to:

increase score of positive examples decrease score of negative examples

  • What is positive/negative in structured

perceptron?

𝐱 ← 𝐱 + 𝑧𝛠 𝑦

slide-57
SLIDE 57

Learning in the Structured Perceptron

  • Positive example, correct feature vector:
  • Negative example, incorrect feature vector:

I visited Nara PRP VBD NNP

φ( )

I visited Nara NNP VBD NNP

φ( )

slide-58
SLIDE 58

Choosing an Incorrect Feature Vector

  • There are many incorrect feature vectors!

I visited Nara NNP VBD NNP

φ( )

I visited Nara PRP VBD NN

φ( )

I visited Nara PRP VB NNP

φ( )

slide-59
SLIDE 59

Choosing an Incorrect Feature Vector

  • Answer: We update using the incorrect

answer with the highest score

  • Our update rule becomes:
  • Y' is the correct answer
  • Note: If highest scoring answer is correct, no change

𝑍 = argmax𝑍

𝑗

𝑥𝑗 ϕ𝑗 𝑌, 𝑍

𝐱 ← 𝐱 + 𝛠 𝑌, 𝑍′ − 𝛠 𝑌, 𝑍

slide-60
SLIDE 60

Structured Perceptron Algorithm

  • create map w

for I iterations for each labeled pair X, Y_prime in the data Y_hat = hmm_viterbi(w, X) phi_prime = create_features(X, Y_prime) phi_hat = create_features(X, Y_hat) w += phi_prime - phi_hat

slide-61
SLIDE 61

Recap: POS tagging…

  • A structured prediction task

– Hidden Markov Models

  • Decoding with Viterbi algorithm
  • Supervised training: counts-based estimates

– Structured Perceptron

  • Decoding with Viterbi algorithm
  • Supervised training: perceptron algorithm with

structured weight updates

  • First step toward modeling syntactic structure