CS11-747 Neural Networks for NLP
Structured Prediction with Local Dependencies
Xuezhe Ma (Max)
Site https://phontron.com/class/nn4nlp2017/
Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence Labeling One tag for one word
CS11-747 Neural Networks for NLP
Xuezhe Ma (Max)
Site https://phontron.com/class/nn4nlp2017/
I hate this movie PRP VBP DT NN
The movie featured Keanu O O O B-PER Reeves I-PER
I hate this movie <s> <s>
classifier
PRP VBP DT NN
classifier classifier classifier
I hate this movie <s> <s>
classifier
PRP VBP DT NN
classifier classifier classifier
π π π =
π=1 π
π(π§π|π)
π π π =
π=1 π
π(π§π|π, π§<π)
particular decision π π π = exp(π π, π ) πβ² exp(π πβ², π ) = π(π, π) πβ² π(πβ², π) where π(π, π) are potential functions.
y1 y2 y3 yn
x yn-
1
π π π = π=1
π
ππ(π§πβ1, π§π, π) πβ² π=1
π
ππ(π§β²πβ1, π§β²π, π)
General form of globally normalized model First-order linear CRF
y1 y2 y3 yn
x
π π π = π(π, π) πβ² π(πβ², π)
ππ π§πβ1, π§π, π = exp π
π§πβ1,π§π π
πΊ π, π +ππ§π
π πΊ π, π + ππ§πβ1,π§π
ππ π§πβ1, π§π, π = exp π
π§πβ1,π§π + ππ§π π πΊ π, π + ππ§πβ1,π§π
I hate this movie <s> <s> PRP VBP DT NN
π=1
π
ππ(π§πβ1,π§π,π) πβ² π=1
π
ππ(π§β²πβ1,π§β²π,π) = π=1
π
ππ(π§πβ1,π§π,π) π(π)
π π =
π π=1 π
ππ(π§πβ1, π§π, π)
π§β = ππ ππππ¦ππ(π|π) Go through the output space of Y which grows exponentially with the length of the input sequence.
compute easily the score of sequence π§1, β¦ , π§πβ1, π§π
in each π§πβ1
about π§π
π π =
π π=1 π
ππ(π§πβ1, π§π, π)
ππ’ π§ π =
π§π,β¦,π§π’β1 π=1 π’β1
ππ π§πβ1, π§π, π ππ’(π§π’β1, π§π’ = π§, π) =
π§π’β1
ππ’(π§π’β1, π§π’ = π§, π)
π§π,β¦,π§π’β2 π=1 π’β2
ππ π§πβ1, π§π, π ππ’β1(π§π’β2, π§π’β1, π) =
π§π’β1
ππ’(π§π’β1, π§π’ = π§, π)ππ’β1 π§π’β1 π
ο¬ First, calculate transition from <S> and emission of the
first word for every POS 1:NN 1:JJ 1:VB
1:LRB 1:RRB
β¦
0:<S>
natural
score[β1 NNβ] = T(NN|<S>) + S(natural | NN) score[β1 JJβ] = T(JJ|<S>) + S(natural | JJ) score[β1 VBβ] = T(VB|<S>) + S(natural | VB) score[β1 LRBβ] = T(LRB|<S>) + S(natural | LRB) score[β1 RRBβ] = T(RRB|<S>) + S(natural | RRB)
ο¬ For middle words, calculate the scores for all possible previous POS tags
1:NN 1:JJ 1:VB
1:LRB 1:RRB
β¦ natural
score[β2 NNβ] = log_sum_exp( score[β1 NNβ] + T(NN|NN) + S(language | NN), score[β1 JJβ] + T(NN|JJ) + S(language | NN), score[β1 VBβ] + T(NN|VB) + S(language | NN), score[β1 LRBβ] + T(NN|LRB) + S(language | NN), score[β1 RRBβ] + T(NN|RRB) + S(language | NN), ...)
2:NN 2:JJ 2:VB
2:LRB 2:RRB
β¦ language
score[β2 JJβ] = log_sum_exp( score[β1 NNβ] + T(JJ|NN) + S(language | JJ), score[β1 JJβ] + T(JJ|JJ) + S(language | JJ), score[β1 VBβ] + T(JJ|VB) + S(language | JJ), ...
πππ π‘π£π ππ¦π(π¦, π§) = log(exp π¦ + exp π§ )
ο¬ Finish up the sentence with the sentence final symbol
I:NN I:JJ I:VB
I:LRB I:RRB
β¦ science
score[βI+1 </S>β] = log_sum_exp( score[βI NNβ] + T(</S>|NN), score[βI JJβ] + T(</S>|JJ), score[βI VBβ] + T(</S>|VB), score[βI LRBβ] + T(</S>|LRB), score[βI NNβ] + T(</S>|RRB), ... )
I+1:</S>
ππππ(π, π; π) ππ = πΊ π, π β πΉπ π π; π πΊ(π, π)
π π§πβ1 = π§β², π§π = π§ π; π = π½πβ1 π§β² π ππ π§β², π§, π πΎπ(π§|π) π(π)
task, requiring no feature engineering and data pre-processing.
suffix of a word
model word-level information.
to consider the co-relation between labels.
3 πππ , + 3 πππ], where πππ = 30
Model POS NER Dev Test Dev Test Acc. Acc. Prec. Recall F1 Prec. Recall F1 BRNN 96.56 96.76 92.04 89.13 90.56 87.05 83.88 85.44 BLSTM 96.88 96.93 92.31 90.85 91.57 87.77 86.23 87.00 BLSTM-CNN 97.34 97.33 92.52 93.64 93.07 88.53 90.21 89.36 BLSTM-CNN-CRF 97.46 97.55 94.85 94.63 94.74 91.35 91.06 91.21
Do different reward functions impact our decisions?
β1 π = β2 π ?
Predictor
Reward is the amount of money we get
Predictor
Predictor
Reward is the amount of money we get
Predictor
ππππ π¦, π§; π = πΉπ(π|π=π¦; π)[βπ π, π§ ]
specific reward R
train w/ maximum likelihood π π§ π§β; π = exp(π(π§, π§β/π) π§β² exp(π(π§β², π§β/π) Can be shown to approximately maximize reward, Norouzi et al., (2016) and Ma et al. (2017)
I hate this movie PRP NN DT NN PRP VBP DT NN MLE sample