Part-of-Speech T agging: HMM & structured perceptron
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
Part-of-Speech T agging: HMM & structured perceptron CMSC 723 - - PowerPoint PPT Presentation
Part-of-Speech T agging: HMM & structured perceptron CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu Last time What are parts of speech (POS)? Equivalence classes or categories of words Open class vs.
CMSC 723 / LING 723 / INST 725 MARINE CARPUAT
marine@cs.umd.edu
– Equivalence classes or categories of words – Open class vs. closed class – Nouns, Verbs, Adjectives, Adverbs (English)
– Assigning POS tags to words in context – Penn Treebank
– Multiclass classification vs. sequence labeling
– next state only depends on the current state and independent of previous history
– Q = {q0, q1, q2, q3, …}
– aij = P(qj|qi), Σ aij = 1 I
– Each drawn from a given set of symbols (vocabulary V)
– bit = bi(ot) = P(ot|qi), Σ bit = 1 i
– An explicit start state q0 or alternatively, a prior distribution over start states: {π1, π2, π3, …}, Σ πi = 1 – The set of final states: qF
Markov Assumption
π1=0.5 π2=0.2 π3=0.3
States? ✓ Transitions? ✓ Vocabulary? ✓ Emissions? Priors? ✓ ✓
sequence of observed events O, find P(O|λ)
(hidden) state sequence
and the set of states Q in λ, compute the parameters A and B
1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock
Assuming λstock models the stock market, how likely are we to observe the sequence of outputs?
π1=0.5 π2=0.2 π3=0.3
– Sum over all possible ways in which we could generate O from λ
Takes O(NT) time to compute!
= P(being in state j after seeing t observations) = P(o1, o2, ... ot, qt=j)
αt(j) = ∑i αt-1(i) aij bj(ot)
– αt-1(i): forward path probability until (t-1) – aij: transition probability of going from state i to j – bj(ot): probability of emitting symbol ot in state j
↑ ↓ ↑ O = find P(O|λstock)
time
↑ ↓ ↑
t=1 t=2 t=3
Bear Bull Static
α1(Bull) α1(Bear) α1(Static)
time
↑ ↓ ↑
t=1 t=2 t=3
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09
Bear Bull Static
0.140.60.1=0.0084
∑
α1(Bull)aBullBullbBull(↓)
time
↑ ↓ ↑
t=1 t=2 t=3
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145
Bear Bull Static
time
↑ ↓ ↑
t=1 t=2 t=3
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0145 ? ? ? ? ?
Bear Bull Static
Work through the rest of these numbers… What’s the asymptotic complexity of this algorithm?
Given λstock as our model and O as our observations, what are the most likely states the market went through to produce O?
1 2 3 4 5 6 t: ↑ ↓ ↔ ↑ ↓ ↔ O: λstock
π1=0.5 π2=0.2 π3=0.3
– Compute P(O) for all possible state sequences, then choose sequence with highest probability
sequence
– Another dynamic programming algorithm – Efficient: polynomial vs. exponential (brute force)
– Store intermediate computation results in a trellis – Build new cells from existing cells
– Just like in forward algorithm
= P(in state j after seeing t observations and passing through the most likely state sequence so far) = P(q1, q2, ... qt-1, qt=j, o1, o2, ... ot)
vt(j) = maxi vt-1(i) aij bj(ot)
– vt-1(i): Viterbi probability until (t-1) – aij: transition probability of going from state i to j – bj(ot) : probability of emitting symbol ot in state j
– In forward algorithm, we only care about the probabilities – What’s different here?
– Use “backpointers” to keep track of most likely transition – At the end, follow the chain of backpointers to recover the most likely state sequence
↑ ↓ ↑ O =
find most likely state sequence given λstock
time
↑ ↓ ↑
t=1 t=2 t=3
Bear Bull Static
α1(Bull) α1(Bear) α1(Static)
time
↑ ↓ ↑
t=1 t=2 t=3
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09
Bear Bull Static
0.140.60.1=0.0084
Max
α1(Bull)aBullBullbBull(↓)
time
↑ ↓ ↑
t=1 t=2 t=3
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084
Bear Bull Static
time
↑ ↓ ↑
t=1 t=2 t=3
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084
Bear Bull Static
time
↑ ↓ ↑
t=1 t=2 t=3
Bear Bull Static
0.20.7=0 .14 0.50.1 =0.05 0.30.3 =0.09 0.0084 ? ? ? ? ?
Work through the rest of the algorithm…
Credit: Jordan Boyd Graber
Credit: Jordan Boyd Graber
sequence of observed events O, find P(O|λ)
(hidden) state sequence
and the set of states Q in λ, compute the parameters A and B
(MLEs) for the various parameters
– MLE = fancy way of saying “count and divide”
likelihood of the data being generated by the model
– Any P(ti | ti-1) = C(ti-1, ti) / C(ti-1), from the tagged data – Example: for P(NN|VB)
– Any P(wi | ti) = C(wi, ti) / C(ti), from the tagged data – For P(bank|NN)
noun
– Any P(q1 = ti) = πi = C(ti)/N, from the tagged data – For πNN , count the number of times NN occurs and divide by the total number of tags (states)
sequence of observed events O, find P(O|λ)
(hidden) state sequence
and the set of states Q in λ, compute the parameters A and B
A book review Oh, man I love this book! This book is so boring... Is it positive? yes no
Binary Prediction (2 choices)
A tweet On the way to the park! 公園に行くなう! Its language English Japanese
Multi-class Prediction (several choices)
A sentence I read a book Its syntactic parse
Structured Prediction (millions of choices)
I read a book
DET NN NP VBD VP S N
Classifiers
Multiclass classification problem Logistic Regression Model context using lots of features
Generative Models
Structured prediction problem (Sequence labeling) Hidden Markov Models Models transitions between states/POS
Structured perceptron → Classification with lots of features
Natural language processing ( NLP ) is a field of computer science
JJ NN NN -LRB- NN -RRB- VBZ DT NN IN NN NN
natural language processing ( nlp ) ... <s> JJ NN NN LRB NN RRB ... </s>
PT(JJ|<s>) PT(NN|JJ) PT(NN|NN) … PE(natural|JJ) PE(language|NN) PE(processing|NN) …
𝑄 𝑍 ≈
𝑗=1 𝐽+1
𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1 𝑄 𝑌 ∣ 𝑍 ≈
1 𝐽
𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
* * * *
𝑄 𝑌, 𝑍 =
1 𝐽
𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Normal HMM:
𝑄 𝑌, 𝑍 =
1 𝐽
𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Normal HMM:
log𝑄 𝑌, 𝑍 =
1 𝐽
log 𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
log 𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Log Likelihood:
𝑄 𝑌, 𝑍 =
1 𝐽
𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Normal HMM:
log𝑄 𝑌, 𝑍 =
1 𝐽
log 𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
log 𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Log Likelihood:
𝑇 𝑌, 𝑍 =
1 𝐽
𝑥𝐹,𝑧𝑗,𝑦𝑗
𝑗=1 𝐽+1
𝑥𝑈,𝑧𝑗−1,𝑧𝑗
Score
𝑄 𝑌, 𝑍 =
1 𝐽
𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Normal HMM:
log𝑄 𝑌, 𝑍 =
1 𝐽
log 𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
𝑗=1 𝐽+1
log 𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
Log Likelihood:
𝑇 𝑌, 𝑍 =
1 𝐽
𝑥𝐹,𝑧𝑗,𝑦𝑗
𝑗=1 𝐽+1
𝑥𝑈,𝑧𝑗−1,𝑧𝑗
Score
𝑥𝐹,𝑧𝑗,𝑦𝑗 = log𝑄𝐹 𝑦𝑗 ∣ 𝑧𝑗
When:
𝑥𝑈,𝑧𝑗−1,𝑧𝑗 = log𝑄𝑈 𝑧𝑗 ∣ 𝑧𝑗−1
log P(X,Y) = S(X,Y)
I visited Nara PRP VBD NNP
I visited Nara NNP VBD NNP
φT,<S>,PRP(X,Y1) = 1 φT,PRP,VBD(X,Y1) = 1 φT,VBD,NNP(X,Y1) = 1 φT,NNP,</S>(X,Y1) = 1 φE,PRP,”I”(X,Y1) = 1 φE,VBD,”visited”(X,Y1) = 1 φE,NNP,”Nara”(X,Y1) = 1 φT,<S>,NNP(X,Y2) = 1 φT,NNP,VBD(X,Y2) = 1 φT,VBD,NNP(X,Y2) = 1 φT,NNP,</S>(X,Y2) = 1 φE,NNP,”I”(X,Y2) = 1 φE,VBD,”visited”(X,Y2) = 1 φE,NNP,”Nara”(X,Y2) = 1 φCAPS,PRP(X,Y1) = 1 φCAPS,NNP(X,Y1) = 1 φCAPS,NNP(X,Y2) = 2 φSUF,VBD,”...ed”(X,Y1) = 1 φSUF,VBD,”...ed”(X,Y2) = 1
𝑍 = argmax𝑍
𝑗
𝑥𝑗 ϕ𝑗 𝑌, 𝑍
negative log probability
1:NN 1:JJ 1:VB
1:PRP 1:NNP
…
0:<S>
I
best_score[“1 NN”] = -log PT(NN|<S>) + -log PE(I | NN) best_score[“1 JJ”] = -log PT(JJ|<S>) + -log PE(I | JJ) best_score[“1 VB”] = -log PT(VB|<S>) + -log PE(I | VB) best_score[“1 PRP”] = -log PT(PRP|<S>) + -log PE(I | PRP) best_score[“1 NNP”] = -log PT(NNP|<S>) + -log PE(I | NNP)
1:NN 1:JJ 1:VB
1:PRP 1:NNP
… I
best_score[“2 NN”] = min( best_score[“1 NN”] + -log PT(NN|NN) + -log PE(visited | NN), best_score[“1 JJ”] + -log PT(NN|JJ) + -log PE(language | NN), best_score[“1 VB”] + -log PT(NN|VB) + -log PE(language | NN), best_score[“1 PRP”] + -log PT(NN|PRP) + -log PE(language | NN), best_score[“1 NNP”] + -log PT(NN|NNP) + -log PE(language | NN), ... )
2:NN 2:JJ 2:VB
2:PRP 2:NNP
… visited
best_score[“2 JJ”] = min( best_score[“1 NN”] + -log PT(JJ|NN) + -log PE(language | JJ), best_score[“1 JJ”] + -log PT(JJ|JJ) + -log PE(language | JJ), best_score[“1 VB”] + -log PT(JJ|VB) + -log PE(language | JJ), ...
1:NN 1:JJ 1:VB
1:PRP 1:NNP
…
0:<S>
I
best_score[“1 NN”] = wT,<S>,NN + wE,NN,I best_score[“1 JJ”] = wT,<S>,JJ + wE,JJ,I best_score[“1 VB”] = wT,<S>,VB + wE,VB,I best_score[“1 PRP”] = wT,<S>,PRP + wE,PRP,I best_score[“1 NNP”] = wT,<S>,NNP + wE,NNP,I
1:NN 1:JJ 1:VB
1:PRP 1:NNP
…
0:<S>
I
best_score[“1 NN”] = wT,<S>,NN + wE,NN,I + wCAPS,NN best_score[“1 JJ”] = wT,<S>,JJ + wE,JJ,I + wCAPS,JJ best_score[“1 VB”] = wT,<S>,VB + wE,VB,I + wCAPS,VB best_score[“1 PRP”] = wT,<S>,PRP + wE,PRP,I + wCAPS,PRP best_score[“1 NNP”] = wT,<S>,NNP + wE,NNP,I + wCAPS,NNP
increase score of positive examples decrease score of negative examples
perceptron?
I visited Nara PRP VBD NNP
I visited Nara NNP VBD NNP
I visited Nara NNP VBD NNP
I visited Nara PRP VBD NN
I visited Nara PRP VB NNP
𝑍 = argmax𝑍
𝑗
𝑥𝑗 ϕ𝑗 𝑌, 𝑍
for I iterations for each labeled pair X, Y_prime in the data Y_hat = hmm_viterbi(w, X) phi_prime = create_features(X, Y_prime) phi_hat = create_features(X, Y_hat) w += phi_prime - phi_hat
– Hidden Markov Models
– Structured Perceptron
structured weight updates