CS11-747 Neural Networks for NLP
Structured Prediction Basics
Graham Neubig
Site https://phontron.com/class/nn4nlp2019/
Structured Prediction Basics Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Structured Prediction Basics Graham Neubig Site https://phontron.com/class/nn4nlp2019/ A Prediction Problem very good good I hate this movie neutral bad very bad very good good I love this
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2019/
I hate this movie I love this movie
very good good neutral bad very bad very good good neutral bad very bad
I hate this movie
positive negative
I hate this movie PRP VBP DT NN I hate this movie kono eiga ga kirai I hate this movie
very good good neutral bad very bad
structure to learn efficiently
are similar: PRP VBP DT NN PRP VBP VBP NN
I hate this movie PRP VBP DT NN
The movie featured Keanu O O O B-PER Reeves I-PER
prediction model: multi-class classification I hate this movie <s> <s>
classifier
PRP VBP DT NN
classifier classifier classifier
independent I hate this movie <s> <s>
classifier
PRP VBP
classifier
DT
classifier
NN
classifier
time flies like an arrow NN VBZ IN DT NN NN NNS VB DT NN VB NNS IN DT NN NN NNS IN DT NN
(time moves similarly to an arrow) (“time flies” are fond of arrows) (please measure the time of flies similarly to how an arrow would) (“time flies” that are similar to an arrow)
max frequency
(this is like an seq2seq model with hard attention on a single word)
I hate this movie <s> <s>
classifier
PRP VBP DT NN
classifier classifier classifier
bilstm-tagger.py bilstm-variant-tagger.py -teacher
but at test time we may make mistakes that propagate
He hates this movie <s> <s>
classifier
PRN NNS
classifier
NNS
classifier
NNS
classifier
during training, and cannot deal with them at test
by the model has a probability that adds to one
based models): each sentence has a score, which is not normalized over a particular decision P(Y | X) =
|Y |
Y
j=1
eS(yj|X,y1,...,yj−1) P
˜ yj∈V eS(˜ yj|X,y1,...,yj−1)
P(Y | X) = e
P|Y |
j=1 S(yj|X,y1,...,yj−1)
P
˜ Y ∈V ∗ e P| ˜
Y | j=1 S(˜
yj|X,˜ y1,...,˜ yj−1)
reduce its score directly (Lafferty et al. 2001) r r i
b Looks ok! P(i|r) = 1 Looks horrible! But no other options so P(o|r) = 1
decisions that have few decisions
time)
hypotheses (in a bit)
adjust parameters to fix this
Y 6=Y S( ˜
∂θ
Y |X;θ) ∂θ
Find one best If score better than reference Increase score
score of one-best (here, SGD update)
loss function! `percept(X, Y ) = max(0, S( ˆ Y | X; ✓) − S(Y | X; ✓))
candidate: must do prediction during training
@`percept(X, Y ; ✓) @✓ = (
∂S(Y |X;θ) ∂θ
− ∂S( ˆ
Y |X;θ) ∂θ
if S( ˆ Y | X; ✓) ≥ S(Y | X; ✓)
probabilistic models
`percept(X, Y ) = max(0, S( ˆ Y | X; ✓) − S(Y | X; ✓)) `global(X, Y ; ✓) = − log eS(Y |X) P
˜ Y eS( ˜ Y |X)
`global-percept(X, Y ) = X
˜ Y
max(0, S( ˜ Y | X; ✓) − S(Y | X; ✓))
big output space; training is hard
but suffers from exposure bias, label bias
then fine-tune with more complicated algorithm
Perceptron Hinge `hinge(x, y; ✓) = max(0, m + S(ˆ y | x; ✓) − S(y | x; ✓))
I hate this movie <s> <s>
hinge
PRP VBP DT NN
hinge hinge hinge
loss = dy.pickneglogsoftmax(score, answer) ↓ loss = dy.hinge(score, answer, m=1)
mistake much worse for downstream apps
incorrect decision, and sets margin equal to this `ca-hinge(x, y; ✓) = max(0, cost(ˆ y, y) + S(ˆ y | x; ✓) − S(y | x; ✓))
(lengths are identical)
costzero-one( ˆ Y , Y ) = δ( ˆ Y 6= Y ) costhamming( ˆ Y , Y ) =
|Y |
X
j=1
δ(ˆ yj 6= yj)
violation ˆ Y = argmax ˜
Y 6=Y cost( ˜
Y , Y ) + S( ˜ Y | X; θ)
`ca-hinge(X, Y ; ✓) = max(0, cost( ˆ Y , Y ) + S( ˆ Y | X; ✓) − S(Y | X; ✓))
calculated easily, we can consider loss in search.
I hate this movie <s> <s>
NN VBP PRP DT … 0.5
1.3
… +1 +1 +1
NN
then, it’s not as stable as teacher forcing w/ MLE)
samples wrong decisions and feeds them in
annealing
“dynamic oracle” (e.g. Goldberg and Nivre 2013)
PRP
loss
NN
samp
VBP
loss
VB
samp
DT
loss
DT
samp
NN
loss
NN
samp
I hate this movie <s> <s>
score score score score
sometimes during training (Gal and Ghahramani 2015)
predictions, while still using them
I hate this movie <s> <s>
classifier
PRP VBP DT NN
classifier classifier classifier
x x
maximum likelihood
MLE PRP NN DT NN sample I hate this movie PRP VBP DT NN