CS11-747 Neural Networks for NLP
Structured Perceptron/ Margin Methods
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
Structured Perceptron/ Margin Methods Graham Neubig Site - - PowerPoint PPT Presentation
CS11-747 Neural Networks for NLP Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Types of Prediction Two classes ( binary classification ) positive I hate this movie negative
CS11-747 Neural Networks for NLP
Graham Neubig
Site https://phontron.com/class/nn4nlp2020/
I hate this movie
positive negative
I hate this movie PRP VBP DT NN I hate this movie kono eiga ga kirai I hate this movie
very good good neutral bad very bad
Covered already Covered today
by the model has a probability that adds to one
based models): each sentence has a score, which is not normalized over a particular decision P(Y | X) =
|Y |
Y
j=1
eS(yj|X,y1,...,yj−1) P
˜ yj∈V eS(˜ yj|X,y1,...,yj−1)
˜ Y ∈V ∗ eS(X, ˜ Y )
<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit>programming, e.g. linear chain CRF, CFG
hypothesis space
˜ Y ∈V ∗ eS(X, ˜ Y )
<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit>distribution
true distribution
hypothesis ˆ Y = argmax
Y
P(Y | X)
<latexit sha1_base64="UGuFTAHRm8opjMQBk8ESxdLvEfw=">ACGXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCuhBENy4rWK10SslkbtgkhmSO2IZxt9w46+4caHiUlf+jWntwteBwMk597kniVwqLvf3hj4xOTU9Mzs6W5+YXFpfLyrlNMsOhzhOZmEbELEihoY4CJTRSA0xFEi6iq+OBf3ENxopEn2E/hZiXS06gjN0UrschD2G+WVBD2iY6dhVwuCahwg3mDPTVeymKG5rm5c0VCKmja12ueJX/SHoXxKMSIWMUGuX38I4ZkCjVwya5uBn2LzUbBJRSlMLOQMn7FutB0VDMFtpUPVyvohlNi2kmMOxrpUP3ekTNlbV9FrlIx7Nnf3kD8z2tm2Nlr5UKnGYLmXw91MkxoYOcaCwMcJR9Rxg3wv2V8h4zjKPLqORCH6v/JfUt6v71eB0p3J4NEpjhqyRdbJArJLDskJqZE64eSOPJAn8uzde4/ei/f6VTrmjXpWyQ94758kuKEe</latexit><latexit sha1_base64="UGuFTAHRm8opjMQBk8ESxdLvEfw=">ACGXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCuhBENy4rWK10SslkbtgkhmSO2IZxt9w46+4caHiUlf+jWntwteBwMk597kniVwqLvf3hj4xOTU9Mzs6W5+YXFpfLyrlNMsOhzhOZmEbELEihoY4CJTRSA0xFEi6iq+OBf3ENxopEn2E/hZiXS06gjN0UrschD2G+WVBD2iY6dhVwuCahwg3mDPTVeymKG5rm5c0VCKmja12ueJX/SHoXxKMSIWMUGuX38I4ZkCjVwya5uBn2LzUbBJRSlMLOQMn7FutB0VDMFtpUPVyvohlNi2kmMOxrpUP3ekTNlbV9FrlIx7Nnf3kD8z2tm2Nlr5UKnGYLmXw91MkxoYOcaCwMcJR9Rxg3wv2V8h4zjKPLqORCH6v/JfUt6v71eB0p3J4NEpjhqyRdbJArJLDskJqZE64eSOPJAn8uzde4/ei/f6VTrmjXpWyQ94758kuKEe</latexit><latexit sha1_base64="UGuFTAHRm8opjMQBk8ESxdLvEfw=">ACGXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCuhBENy4rWK10SslkbtgkhmSO2IZxt9w46+4caHiUlf+jWntwteBwMk597kniVwqLvf3hj4xOTU9Mzs6W5+YXFpfLyrlNMsOhzhOZmEbELEihoY4CJTRSA0xFEi6iq+OBf3ENxopEn2E/hZiXS06gjN0UrschD2G+WVBD2iY6dhVwuCahwg3mDPTVeymKG5rm5c0VCKmja12ueJX/SHoXxKMSIWMUGuX38I4ZkCjVwya5uBn2LzUbBJRSlMLOQMn7FutB0VDMFtpUPVyvohlNi2kmMOxrpUP3ekTNlbV9FrlIx7Nnf3kD8z2tm2Nlr5UKnGYLmXw91MkxoYOcaCwMcJR9Rxg3wv2V8h4zjKPLqORCH6v/JfUt6v71eB0p3J4NEpjhqyRdbJArJLDskJqZE64eSOPJAn8uzde4/ei/f6VTrmjXpWyQ94758kuKEe</latexit>P(Y | X) = eS(X,Y ) P
˜ Y ∈V ∗ eS(X, ˜ Y )
<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit>ˆ Y = argmax
Y
S(X, Y )
<latexit sha1_base64="oEd+NbrEPR59gWE8ZOCb02MTpy4=">ACFXicbVBNSwMxEM3W7/pV9eglWIQKWnZFUA+C6MWjotWbinZdNoGs9klmRXLsv4JL/4VLx5UvAre/DemtQetPgi8vDczybwglsKg6346ubHxicmp6Zn87Nz8wmJhafnSRInmUOGRjHQ1YAakUFBgRKqsQYWBhKuguvjvn91A9qISF1gL4ZGyDpKtAVnaKVmYcvMkxrGT2gfqJathL619RHuMWU6U7IbrPs7rxU3aS1jWah6JbdAehf4g1JkQx2ix8+K2IJyEo5JIZU/fcGBt2LgouIcv7iYGY8WvWgbqlioVgGulgrYyuW6VF25G2RyEdqD87UhYa0wsDWxky7JpRry/+59UTbO81UqHiBEHx74faiaQY0X5GtCU0cJQ9SxjXwv6V8i7TjKPNJ29D8EZX/ksq2+X9sne2Uzw8GqYxTVbJGikRj+ySQ3JCTkmFcHJPHskzeXEenCfn1Xn7Ls05w54V8gvO+xf3FZ9v</latexit><latexit sha1_base64="oEd+NbrEPR59gWE8ZOCb02MTpy4=">ACFXicbVBNSwMxEM3W7/pV9eglWIQKWnZFUA+C6MWjotWbinZdNoGs9klmRXLsv4JL/4VLx5UvAre/DemtQetPgi8vDczybwglsKg6346ubHxicmp6Zn87Nz8wmJhafnSRInmUOGRjHQ1YAakUFBgRKqsQYWBhKuguvjvn91A9qISF1gL4ZGyDpKtAVnaKVmYcvMkxrGT2gfqJathL619RHuMWU6U7IbrPs7rxU3aS1jWah6JbdAehf4g1JkQx2ix8+K2IJyEo5JIZU/fcGBt2LgouIcv7iYGY8WvWgbqlioVgGulgrYyuW6VF25G2RyEdqD87UhYa0wsDWxky7JpRry/+59UTbO81UqHiBEHx74faiaQY0X5GtCU0cJQ9SxjXwv6V8i7TjKPNJ29D8EZX/ksq2+X9sne2Uzw8GqYxTVbJGikRj+ySQ3JCTkmFcHJPHskzeXEenCfn1Xn7Ls05w54V8gvO+xf3FZ9v</latexit><latexit sha1_base64="oEd+NbrEPR59gWE8ZOCb02MTpy4=">ACFXicbVBNSwMxEM3W7/pV9eglWIQKWnZFUA+C6MWjotWbinZdNoGs9klmRXLsv4JL/4VLx5UvAre/DemtQetPgi8vDczybwglsKg6346ubHxicmp6Zn87Nz8wmJhafnSRInmUOGRjHQ1YAakUFBgRKqsQYWBhKuguvjvn91A9qISF1gL4ZGyDpKtAVnaKVmYcvMkxrGT2gfqJathL619RHuMWU6U7IbrPs7rxU3aS1jWah6JbdAehf4g1JkQx2ix8+K2IJyEo5JIZU/fcGBt2LgouIcv7iYGY8WvWgbqlioVgGulgrYyuW6VF25G2RyEdqD87UhYa0wsDWxky7JpRry/+59UTbO81UqHiBEHx74faiaQY0X5GtCU0cJQ9SxjXwv6V8i7TjKPNJ29D8EZX/ksq2+X9sne2Uzw8GqYxTVbJGikRj+ySQ3JCTkmFcHJPHskzeXEenCfn1Xn7Ls05w54V8gvO+xf3FZ9v</latexit>adjust parameters to fix this
Y 6=Y S( ˜
∂θ
Y |X;θ) ∂θ
Find one best If score better than reference Increase score
score of one-best (here, SGD update)
loss function! `percept(X, Y ) = max(0, S( ˆ Y | X; ✓) − S(Y | X; ✓))
candidate: must do prediction during training
@`percept(X, Y ; ✓) @✓ = (
∂S(Y |X;θ) ∂θ
− ∂S( ˆ
Y |X;θ) ∂θ
if S( ˆ Y | X; ✓) ≥ S(Y | X; ✓)
probabilistic models
`percept(X, Y ) = max(0, S( ˆ Y | X; ✓) − S(Y | X; ✓)) `global(X, Y ; ✓) = − log eS(Y |X) P
˜ Y eS( ˜ Y |X)
`global-percept(X, Y ) = X
˜ Y
max(0, S( ˜ Y | X; ✓) − S(Y | X; ✓))
big output space; training is hard
but suffers from exposure bias
then fine-tune with more complicated algorithm
Perceptron Hinge `hinge(x, y; ✓) = max(0, m + S(ˆ y | x; ✓) − S(y | x; ✓))
I hate this movie <s> <s>
hinge
PRP VBP DT NN
hinge hinge hinge
loss = dy.pickneglogsoftmax(score, answer) ↓ loss = dy.hinge(score, answer, m=1)
e.g. in DyNet
mistake much worse for downstream apps
incorrect decision, and sets margin equal to this `ca-hinge(x, y; ✓) = max(0, cost(ˆ y, y) + S(ˆ y | x; ✓) − S(y | x; ✓))
(lengths are identical)
costzero-one( ˆ Y , Y ) = δ( ˆ Y 6= Y ) costhamming( ˆ Y , Y ) =
|Y |
X
j=1
δ(ˆ yj 6= yj)
violation ˆ Y = argmax ˜
Y 6=Y cost( ˜
Y , Y ) + S( ˜ Y | X; θ)
`ca-hinge(X, Y ; ✓) = max(0, cost( ˆ Y , Y ) + S( ˆ Y | X; ✓) − S(Y | X; ✓))
calculated easily, we can consider loss in search.
I hate this movie <s> <s>
NN VBP PRP DT … 0.5
1.3
… +1 +1 +1
NN
then, it’s not as stable as teacher forcing w/ MLE)
samples wrong decisions and feeds them in
annealing
“dynamic oracle” (e.g. Goldberg and Nivre 2013)
PRP
loss
NN
samp
VBP
loss
VB
samp
DT
loss
DT
samp
NN
loss
NN
samp
I hate this movie <s> <s>
score score score score
sometimes during training (Gal and Ghahramani 2015)
predictions, while still using them
I hate this movie <s> <s>
classifier
PRP VBP DT NN
classifier classifier classifier
x x
maximum likelihood
MLE PRP NN DT NN sample I hate this movie PRP VBP DT NN