Structured Perceptron/ Margin Methods Graham Neubig Site - - PowerPoint PPT Presentation

structured perceptron margin methods
SMART_READER_LITE
LIVE PREVIEW

Structured Perceptron/ Margin Methods Graham Neubig Site - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Structured Perceptron/ Margin Methods Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Types of Prediction Two classes ( binary classification ) positive I hate this movie negative


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Structured Perceptron/ Margin Methods

Graham Neubig

Site https://phontron.com/class/nn4nlp2020/

slide-2
SLIDE 2

Types of Prediction

  • Two classes (binary classification)

I hate this movie

positive negative

  • Multiple classes (multi-class classification)
  • Exponential/infinite labels (structured prediction)

I hate this movie PRP VBP DT NN I hate this movie kono eiga ga kirai I hate this movie

very good good neutral bad very bad

slide-3
SLIDE 3

Many Varieties of Structured Prediction!

  • Models:
  • RNN-based decoders
  • Convolution/self attentional decoders
  • CRFs w/ local factors
  • Training algorithms:
  • Maximum likelihood w/ teacher forcing
  • Sequence level likelihood
  • Structured perceptron, structured large margin
  • Reinforcement learning/minimum risk training
  • Sampling corruptions of data

Covered already Covered today

slide-4
SLIDE 4

Reminder: Globally Normalized Models

  • Locally normalized models: each decision made

by the model has a probability that adds to one

  • Globally normalized models (a.k.a. energy-

based models): each sentence has a score, which is not normalized over a particular decision P(Y | X) =

|Y |

Y

j=1

eS(yj|X,y1,...,yj−1) P

˜ yj∈V eS(˜ yj|X,y1,...,yj−1)

P(Y | X) = eS(X,Y ) P

˜ Y ∈V ∗ eS(X, ˜ Y )

<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit>
slide-5
SLIDE 5

Globally Normalized Likelihood

slide-6
SLIDE 6

Difficulties Training Globally Normalized Models

  • Partition function problematic
  • Two options for calculating partition function
  • Structure model to allow enumeration via dynamic

programming, e.g. linear chain CRF, CFG

  • Estimate partition function through sub-sampling

hypothesis space

P(Y | X) = eS(X,Y ) P

˜ Y ∈V ∗ eS(X, ˜ Y )

<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit>
slide-7
SLIDE 7

Two Methods for Approximation

  • Sampling:
  • Sample k samples according to the probability distribution
  • + Unbiased estimator: as k gets large will approach true

distribution

  • - High variance: what if we get low-probability samples?
  • Beam search:
  • Search for k best hypotheses
  • - Biased estimator: may result in systematic differences from

true distribution

  • + Lower variance: more likely to get high-probability outputs
slide-8
SLIDE 8

Un-normalized Models: Structured Perceptron

slide-9
SLIDE 9

Normalization often Not Necessary for Inference!

  • At inference time, we often just want the best

hypothesis ˆ Y = argmax

Y

P(Y | X)

<latexit sha1_base64="UGuFTAHRm8opjMQBk8ESxdLvEfw=">ACGXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCuhBENy4rWK10SslkbtgkhmSO2IZxt9w46+4caHiUlf+jWntwteBwMk597kniVwqLvf3hj4xOTU9Mzs6W5+YXFpfLyrlNMsOhzhOZmEbELEihoY4CJTRSA0xFEi6iq+OBf3ENxopEn2E/hZiXS06gjN0UrschD2G+WVBD2iY6dhVwuCahwg3mDPTVeymKG5rm5c0VCKmja12ueJX/SHoXxKMSIWMUGuX38I4ZkCjVwya5uBn2LzUbBJRSlMLOQMn7FutB0VDMFtpUPVyvohlNi2kmMOxrpUP3ekTNlbV9FrlIx7Nnf3kD8z2tm2Nlr5UKnGYLmXw91MkxoYOcaCwMcJR9Rxg3wv2V8h4zjKPLqORCH6v/JfUt6v71eB0p3J4NEpjhqyRdbJArJLDskJqZE64eSOPJAn8uzde4/ei/f6VTrmjXpWyQ94758kuKEe</latexit><latexit sha1_base64="UGuFTAHRm8opjMQBk8ESxdLvEfw=">ACGXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCuhBENy4rWK10SslkbtgkhmSO2IZxt9w46+4caHiUlf+jWntwteBwMk597kniVwqLvf3hj4xOTU9Mzs6W5+YXFpfLyrlNMsOhzhOZmEbELEihoY4CJTRSA0xFEi6iq+OBf3ENxopEn2E/hZiXS06gjN0UrschD2G+WVBD2iY6dhVwuCahwg3mDPTVeymKG5rm5c0VCKmja12ueJX/SHoXxKMSIWMUGuX38I4ZkCjVwya5uBn2LzUbBJRSlMLOQMn7FutB0VDMFtpUPVyvohlNi2kmMOxrpUP3ekTNlbV9FrlIx7Nnf3kD8z2tm2Nlr5UKnGYLmXw91MkxoYOcaCwMcJR9Rxg3wv2V8h4zjKPLqORCH6v/JfUt6v71eB0p3J4NEpjhqyRdbJArJLDskJqZE64eSOPJAn8uzde4/ei/f6VTrmjXpWyQ94758kuKEe</latexit><latexit sha1_base64="UGuFTAHRm8opjMQBk8ESxdLvEfw=">ACGXicbVDLSgMxFM34tr6qLt0Ei6CbMiOCuhBENy4rWK10SslkbtgkhmSO2IZxt9w46+4caHiUlf+jWntwteBwMk597kniVwqLvf3hj4xOTU9Mzs6W5+YXFpfLyrlNMsOhzhOZmEbELEihoY4CJTRSA0xFEi6iq+OBf3ENxopEn2E/hZiXS06gjN0UrschD2G+WVBD2iY6dhVwuCahwg3mDPTVeymKG5rm5c0VCKmja12ueJX/SHoXxKMSIWMUGuX38I4ZkCjVwya5uBn2LzUbBJRSlMLOQMn7FutB0VDMFtpUPVyvohlNi2kmMOxrpUP3ekTNlbV9FrlIx7Nnf3kD8z2tm2Nlr5UKnGYLmXw91MkxoYOcaCwMcJR9Rxg3wv2V8h4zjKPLqORCH6v/JfUt6v71eB0p3J4NEpjhqyRdbJArJLDskJqZE64eSOPJAn8uzde4/ei/f6VTrmjXpWyQ94758kuKEe</latexit>
  • If that's all we need, no need for normalization!

P(Y | X) = eS(X,Y ) P

˜ Y ∈V ∗ eS(X, ˜ Y )

<latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit><latexit sha1_base64="OZ0fwiGra8uyR0OpSCMUh2f+CTQ=">ACNnicbVBSxtBGJ3VamO0GvXoZWgQEilhV4S2B0H04klS2piEbAyzs9/qkJnZWZWCMP+Ky/+DW968aDitT+hkxikTfpg4PHe+/jme1HGmTa+f+8tLH5YWv5YWimvrn1a36hsbp3rNFcUWjTlqepERANnElqGQ6dTAEREYd2NDwZ+1rUJql8pcZdAX5FKyhFinDSonDVrXRwKFuNOHR/iMFGEWriw+Get8wV360VhQ52LgQ0N4zHYboFDJvH5XvEendceFCp+g1/AjxPgimpoimag8pdGKc0FyAN5UTrXuBnpm+JMoxyKMphriEjdEguoeoJAJ0307uLvCuU2KcpMo9afBE/XvCEqH1SEQuKYi50rPeWPyf18tN8q1vmcxyA5K+LUpyjk2KxyXimCmgho8cIVQx91dMr4irzriqy6EYPbkedLab3xvBD8OqkfH0zZKaAd9RjUoK/oCJ2iJmohim7QA3pCz96t9+i9eK9v0QVvOrON/oH3+w9wk6ow</latexit>

ˆ Y = argmax

Y

S(X, Y )

<latexit sha1_base64="oEd+NbrEPR59gWE8ZOCb02MTpy4=">ACFXicbVBNSwMxEM3W7/pV9eglWIQKWnZFUA+C6MWjotWbinZdNoGs9klmRXLsv4JL/4VLx5UvAre/DemtQetPgi8vDczybwglsKg6346ubHxicmp6Zn87Nz8wmJhafnSRInmUOGRjHQ1YAakUFBgRKqsQYWBhKuguvjvn91A9qISF1gL4ZGyDpKtAVnaKVmYcvMkxrGT2gfqJathL619RHuMWU6U7IbrPs7rxU3aS1jWah6JbdAehf4g1JkQx2ix8+K2IJyEo5JIZU/fcGBt2LgouIcv7iYGY8WvWgbqlioVgGulgrYyuW6VF25G2RyEdqD87UhYa0wsDWxky7JpRry/+59UTbO81UqHiBEHx74faiaQY0X5GtCU0cJQ9SxjXwv6V8i7TjKPNJ29D8EZX/ksq2+X9sne2Uzw8GqYxTVbJGikRj+ySQ3JCTkmFcHJPHskzeXEenCfn1Xn7Ls05w54V8gvO+xf3FZ9v</latexit><latexit sha1_base64="oEd+NbrEPR59gWE8ZOCb02MTpy4=">ACFXicbVBNSwMxEM3W7/pV9eglWIQKWnZFUA+C6MWjotWbinZdNoGs9klmRXLsv4JL/4VLx5UvAre/DemtQetPgi8vDczybwglsKg6346ubHxicmp6Zn87Nz8wmJhafnSRInmUOGRjHQ1YAakUFBgRKqsQYWBhKuguvjvn91A9qISF1gL4ZGyDpKtAVnaKVmYcvMkxrGT2gfqJathL619RHuMWU6U7IbrPs7rxU3aS1jWah6JbdAehf4g1JkQx2ix8+K2IJyEo5JIZU/fcGBt2LgouIcv7iYGY8WvWgbqlioVgGulgrYyuW6VF25G2RyEdqD87UhYa0wsDWxky7JpRry/+59UTbO81UqHiBEHx74faiaQY0X5GtCU0cJQ9SxjXwv6V8i7TjKPNJ29D8EZX/ksq2+X9sne2Uzw8GqYxTVbJGikRj+ySQ3JCTkmFcHJPHskzeXEenCfn1Xn7Ls05w54V8gvO+xf3FZ9v</latexit><latexit sha1_base64="oEd+NbrEPR59gWE8ZOCb02MTpy4=">ACFXicbVBNSwMxEM3W7/pV9eglWIQKWnZFUA+C6MWjotWbinZdNoGs9klmRXLsv4JL/4VLx5UvAre/DemtQetPgi8vDczybwglsKg6346ubHxicmp6Zn87Nz8wmJhafnSRInmUOGRjHQ1YAakUFBgRKqsQYWBhKuguvjvn91A9qISF1gL4ZGyDpKtAVnaKVmYcvMkxrGT2gfqJathL619RHuMWU6U7IbrPs7rxU3aS1jWah6JbdAehf4g1JkQx2ix8+K2IJyEo5JIZU/fcGBt2LgouIcv7iYGY8WvWgbqlioVgGulgrYyuW6VF25G2RyEdqD87UhYa0wsDWxky7JpRry/+59UTbO81UqHiBEHx74faiaQY0X5GtCU0cJQ9SxjXwv6V8i7TjKPNJ29D8EZX/ksq2+X9sne2Uzw8GqYxTVbJGikRj+ySQ3JCTkmFcHJPHskzeXEenCfn1Xn7Ls05w54V8gvO+xf3FZ9v</latexit>
slide-10
SLIDE 10

The Structured
 Perceptron Algorithm

  • An extremely simple way of training (non-probabilistic) global models
  • Find the one-best, and if it’s score is better than the correct answer,

adjust parameters to fix this

ˆ Y = argmax ˜

Y 6=Y S( ˜

Y | X; θ) if S( ˆ Y | X; θ) ≥ S(Y | X; θ) then θ ← θ + α( ∂S(Y |X;θ)

∂θ

− ∂S( ˆ

Y |X;θ) ∂θ

) end if

Find one best If score better than reference Increase score

  • f ref, decrease

score of one-best (here, SGD update)

slide-11
SLIDE 11

Structured Perceptron Loss

  • Structured perceptron can also be expressed as a

loss function! `percept(X, Y ) = max(0, S( ˆ Y | X; ✓) − S(Y | X; ✓))

  • Resulting gradient looks like perceptron algorithm
  • This is a normal loss function, can be used in NNs
  • But! Requires finding the argmax in addition to the true

candidate: must do prediction during training

@`percept(X, Y ; ✓) @✓ = (

∂S(Y |X;θ) ∂θ

− ∂S( ˆ

Y |X;θ) ∂θ

if S( ˆ Y | X; ✓) ≥ S(Y | X; ✓)

  • therwise
slide-12
SLIDE 12

Contrasting Perceptron and Global Normalization

  • Globally normalized probabilistic model


  • Structured perceptron


  • Global structured perceptron?



 


  • Same computational problems as globally normalized

probabilistic models

`percept(X, Y ) = max(0, S( ˆ Y | X; ✓) − S(Y | X; ✓)) `global(X, Y ; ✓) = − log eS(Y |X) P

˜ Y eS( ˜ Y |X)

`global-percept(X, Y ) = X

˜ Y

max(0, S( ˜ Y | X; ✓) − S(Y | X; ✓))

slide-13
SLIDE 13

Structured Training
 and Pre-training

  • Neural network models have lots of parameters and a

big output space; training is hard

  • Tradeoffs between training algorithms:
  • Selecting just one negative example is inefficient
  • Teacher forcing efficiently updates all parameters,

but suffers from exposure bias

  • Thus, it is common to pre-train with teacher forcing,

then fine-tune with more complicated algorithm

slide-14
SLIDE 14

Hinge Loss and
 Cost-sensitive Training

slide-15
SLIDE 15

Perceptron and Uncertainty

  • Which is better, dotted or dashed?
  • Both have zero perceptron loss!
slide-16
SLIDE 16

Adding a “Margin”
 with Hinge Loss

  • Penalize when incorrect answer is within margin m

Perceptron Hinge `hinge(x, y; ✓) = max(0, m + S(ˆ y | x; ✓) − S(y | x; ✓))

slide-17
SLIDE 17

Hinge Loss for Any Classifier!

  • We can swap cross-entropy for hinge loss anytime

I hate this movie <s> <s>

hinge

PRP VBP DT NN

hinge hinge hinge

loss = dy.pickneglogsoftmax(score, answer) ↓ loss = dy.hinge(score, answer, m=1)

e.g. in DyNet

slide-18
SLIDE 18

Cost-augmented Hinge

  • Sometimes some decisions are worse than others
  • e.g. VB -> VBP mistake not so bad, VB -> NN

mistake much worse for downstream apps

  • Cost-augmented hinge defines a cost for each

incorrect decision, and sets margin equal to this `ca-hinge(x, y; ✓) = max(0, cost(ˆ y, y) + S(ˆ y | x; ✓) − S(y | x; ✓))

slide-19
SLIDE 19

Costs over Sequences

  • Zero-one loss: 1 if sentences differ, zero otherwise
  • Hamming loss: 1 for every different element

(lengths are identical)

  • Other losses: edit distance, 1-BLEU, etc.

costzero-one( ˆ Y , Y ) = δ( ˆ Y 6= Y ) costhamming( ˆ Y , Y ) =

|Y |

X

j=1

δ(ˆ yj 6= yj)

slide-20
SLIDE 20

Structured Hinge Loss

  • Hinge loss over sequence with the largest margin

violation ˆ Y = argmax ˜

Y 6=Y cost( ˜

Y , Y ) + S( ˜ Y | X; θ)

`ca-hinge(X, Y ; ✓) = max(0, cost( ˆ Y , Y ) + S( ˆ Y | X; ✓) − S(Y | X; ✓))

  • Problem: How do we find the argmax above?
  • Solution: In some cases, where the loss can be

calculated easily, we can consider loss in search.

slide-21
SLIDE 21

Cost-Augmented Decoding for Hamming Loss

  • Hamming loss is decomposable over each word
  • Solution: add a score = cost to each incorrect choice during search

I hate this movie <s> <s>

NN VBP PRP DT … 0.5

  • 0.2

1.3

  • 2.0

… +1 +1 +1

NN

slide-22
SLIDE 22

Simpler Remedies to Exposure Bias

slide-23
SLIDE 23

What’s Wrong w/
 Structured Hinge Loss?

  • It may work, but…
  • Considers fewer hypotheses, so unstable
  • Requires decoding, so slow
  • Generally must resort to pre-training (and even

then, it’s not as stable as teacher forcing w/ MLE)

slide-24
SLIDE 24

Solution 1: Sample Mistakes in Training
 (Ross et al. 2010)

  • DAgger, also known as “scheduled sampling”, etc., randomly

samples wrong decisions and feeds them in
 
 
 
 
 
 
 
 
 


  • Start with no mistakes, and then gradually introduce them using

annealing

  • How to choose the next tag? Use the gold standard, or create a

“dynamic oracle” (e.g. Goldberg and Nivre 2013)

PRP

loss

NN

samp

VBP

loss

VB

samp

DT

loss

DT

samp

NN

loss

NN

samp

I hate this movie <s> <s>

score score score score

slide-25
SLIDE 25

Solution 2: Drop Out Inputs

  • Basic idea: Simply don’t input the previous decision

sometimes during training (Gal and Ghahramani 2015)
 
 
 
 
 
 


  • Helps ensure that the model doesn’t rely too heavily on

predictions, while still using them

I hate this movie <s> <s>

classifier

PRP VBP DT NN

classifier classifier classifier

x x

slide-26
SLIDE 26

Solution 3:
 Corrupt Training Data

  • Reward augmented maximum likelihood (Nourozi et al. 2016)
  • Basic idea: randomly sample incorrect training data, train w/

maximum likelihood

  • Sampling probability proportional to goodness of output
  • Can be shown to approximately minimize risk (next class)

MLE PRP NN DT NN sample I hate this movie PRP VBP DT NN

slide-27
SLIDE 27

Questions?