Machine Learning 2 DS 4420 - Spring 2020 Structured prediction, II - - PowerPoint PPT Presentation
Machine Learning 2 DS 4420 - Spring 2020 Structured prediction, II - - PowerPoint PPT Presentation
Machine Learning 2 DS 4420 - Spring 2020 Structured prediction, II Byron C Wallace Today From HMMs to MEMMs to CRFs <latexit
Today
- From HMMs to MEMMs to CRFs
Structured output spaces
“Play Kanye West”
x1
<latexit sha1_base64="MWSDWkw1NdOauHNwPQkLknLX4o4=">AB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szMwLEikMu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2k</latexit>x2
<latexit sha1_base64="vdTCQWpAcdEoAqjXndSIH2U27gw=">AB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szMwLEikMu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb6/ZilEVfIJDWm47kJ+hnVKJjk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindeqd5dlGvXeRwFOIYTOAMPLqEGt1CHBjAYwDO8wpsjnRfn3fmYt64+cwR/IHz+QMOgo2l</latexit>x3
<latexit sha1_base64="GcfdmiVXIuQAVIE+3vSRqlRiStc=">AB6nicbVDLTgJBEOzF+IL9ehlIjHxRHbBRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOpBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStmrlit3F6XadRZHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEAaNpg=</latexit>y3
<latexit sha1_base64="I1Yqf/dyfMgBYEmGxGuyvCmwQ4=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0laQY9FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOl+0m/3i9X3Ko7B1klXk4qkKPZL3/1BjFLI5SGCap13MT42dUGc4ETku9VGNC2ZgOsWupBFqP5ufOiVnVhmQMFa2pCFz9fdERiOtJ1FgOyNqRnrZm4n/ed3UhFd+xmWSGpRsShMBTExmf1NBlwhM2JiCWK21sJG1FmbHplGwI3vLq6Rdq3r1au3uotK4zuMowgmcwjl4cAkNuIUmtIDBEJ7hFd4c4bw4787HorXg5DPH8AfO5w8RjI2n</latexit>y2
<latexit sha1_base64="UmY8miGJFsYtImgQ4UOSFc3rPg=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6xCzhfkSHSoSCUbTSQ9av9csVt+rOQVaJl5MK5Gj0y1+9QczSiCtkhrT9dwE/QnVKJjk01IvNTyhbEyHvGupohE3/mR+6pScWVAwljbUkjm6u+JCY2MyaLAdkYUR2bZm4n/ed0Uw2t/IlSIldsShMJcGYzP4mA6E5Q5lZQpkW9lbCRlRThjadkg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QMQCI2m</latexit>y1
<latexit sha1_base64="gXLzr9lA6QyErQrPkt90wKdvXMk=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6xCzhfkSHSoSCUbTSQ9b3+uWKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6C/oRqFEzyamXGp5QNqZD3rVU0YgbfzI/dUrOrDIgYaxtKSRz9fEhEbGZFgOyOKI7PszcT/vG6K4bU/ESpJkSu2WBSmkmBMZn+TgdCcocwsoUwLeythI6opQ5tOyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnRfn3flYtBacfOY/sD5/AEOhI2l</latexit>Structured output spaces
Source: http://cocodataset.org/
Space of problems
Given Predict An image Contains a cat? Classification Type? An image Coordinates that
- utline all cats
Structured prediction A tweet Names in the tweet Structured prediction A tweet Sentiment in tweet Classification
A generative model of sequences
P(X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn = yn)
=
n+1
Y
i=1
P(yi|yi−1)
n
Y
i=1
P(xi|yi)
<latexit sha1_base64="uGKM4ZjoMHjNKaJNf9IciBFWvE=">ACJXicbVDLSgMxFM3UV62vqks3wSK0iGWmCrqwUHTjsoJ9QDsOmTRtQzOZIcmIwzg/48ZfcePCIoIrf8V02oW2HgczjmXm3vcgFGpTPLyCwtr6yuZdzG5tb2zv53b2m9EOBSQP7zBdtF0nCKCcNRUj7UAQ5LmMtNzR9cRvPRAhqc/vVBQ20MDTvsUI6UlJ39Zhd1A+D0nplUruY/5sZXAejFy6FOktRMrKc0FknrxMXVpyckXzLKZAi4Sa0YKYIa6kx93ez4OPcIVZkjKjmUGyo6RUBQzkuS6oSQBwiM0IB1NOfKItOP0ygQeaUH+7QjyuYqr8nYuRJGXmuTnpIDeW8NxH/8zqh6l/YMeVBqAjH0X9kEHlw0lsEcFwYpFmiAsqP4rxEMkEFa62JwuwZo/eZE0K2XrtFy5PSvUrmZ1ZMEBOARFYIFzUAM3oA4aAINn8Arewdh4Md6MD+NzGs0Ys5l98AfG9w9gk6SZ</latexit>Emission probability Transition probability
P(X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn = yn)
=
n+1
Y
i=1
P(yi|yi−1)
n
Y
i=1
P(xi|yi)
<latexit sha1_base64="uGKM4ZjoMHjNKaJNf9IciBFWvE=">ACJXicbVDLSgMxFM3UV62vqks3wSK0iGWmCrqwUHTjsoJ9QDsOmTRtQzOZIcmIwzg/48ZfcePCIoIrf8V02oW2HgczjmXm3vcgFGpTPLyCwtr6yuZdzG5tb2zv53b2m9EOBSQP7zBdtF0nCKCcNRUj7UAQ5LmMtNzR9cRvPRAhqc/vVBQ20MDTvsUI6UlJ39Zhd1A+D0nplUruY/5sZXAejFy6FOktRMrKc0FknrxMXVpyckXzLKZAi4Sa0YKYIa6kx93ez4OPcIVZkjKjmUGyo6RUBQzkuS6oSQBwiM0IB1NOfKItOP0ygQeaUH+7QjyuYqr8nYuRJGXmuTnpIDeW8NxH/8zqh6l/YMeVBqAjH0X9kEHlw0lsEcFwYpFmiAsqP4rxEMkEFa62JwuwZo/eZE0K2XrtFy5PSvUrmZ1ZMEBOARFYIFzUAM3oA4aAINn8Arewdh4Md6MD+NzGs0Ys5l98AfG9w9gk6SZ</latexit>Emission probability Transition probability
A generative model of sequences
x1 x2 x3 x4 y1 y2 y3 y4 y0 y5 x5
Graphical Model (HMMs)
Limitations to HMMs
- We are restricted to features that have a coherent “generative
story”
- Why bother “modeling” x — it’s given! What we really care about
is p(y|x)
Generative v discriminative
Generative Model joint distribution P(x,y) Can generate new “examples” To predict y, use Bayes’ rule
Generative v discriminative
Generative Model joint distribution P(x,y) Can generate new “examples” To predict y, use Bayes’ rule Discriminative Model conditional distribution P(y|x) Not as amenable to semi-supervised settings; cannot readily “generate” new samples
Enter Max Entropy Markov Models (MEMMs)
- These extend standard log-linear models to capture structure in
the outputs.
- A bit like the structured perceptron we introduced last time, but
explicitly model conditional probabilities of labels.
Log-Linear Models
p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>Log-Linear Models
p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>measures plausibility of y given x
Log-likelihood
p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>Log-likelihood
p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>LL(w) = X
i
log p(yi|xi, w)
<latexit sha1_base64="7mJHTCqAYDNbu6aBYQG92LgaqOc=">ACE3icbVC7SgNBFJ2Nrxhfq5Y2g0FIRMJuFLQRgjYWKSKYByRhmZ1MkiGzD2bumoQ1/2Djr9hYKGJrY+fOHkUmnjgwuGce7n3HjcUXIFlfRuJpeWV1bXkempjc2t7x9zdq6gkpSVaSACWXOJYoL7rAwcBKuFkhHPFazq9q7HfvWeScUD/w6GIWt6pOPzNqcEtOSYx8Vip/Fl7ihIs/hOG4AG0Asg4ejXCYGTr8YeDwE9zPOmbaylkT4EViz0gazVByzK9GK6CRx3ygihVt60QmjGRwKlgo1QjUiwktEc6rK6pTzymvHkpxE+0koLtwOpywc8UX9PxMRTaui5utMj0FXz3lj8z6tH0L5oxtwPI2A+nS5qRwJDgMcB4RaXjIYakKo5PpWTLtEgo6xpQOwZ5/eZFU8jn7NJe/PUsXrmZxJNEBOkQZKNzVEA3qITKiKJH9Ixe0ZvxZLwY78bHtDVhzGb20R8Ynz/IlJzd</latexit>Back to sequences
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>Want to model conditional probability of sequence of y
Back to sequences
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>Want to model conditional probability of sequence of y
m
Y
i
p(yi|y1, ..., yi−1, x1, ...xm)
<latexit sha1_base64="DP+wYovuYi0G/eNDG2VCl/L9ey8=">ACGnicbZDLSsNAFIYnXmu9RV26GSxChRqSKuiy6MZlBXuBNobJdNoOnUnCzEQMsc/hxldx40IRd+LGt3HSZqGtPwx8/OczpzfjxiVyra/jYXFpeWV1cJacX1jc2vb3NltyjAWmDRwyELR9pEkjAakoahipB0JgrjPSMsfXWb1h0RkobBjUoi4nI0CGifYqS05ZlONxJhz6O3HEblxKMPiedUoGVZFZh4KT12xhV4n1sa+FHRM0u2ZU8E58HJoQRy1T3zs9sLcxJoDBDUnYcO1JuioSimJFxsRtLEiE8QgPS0RgTqSbTk4bw0Pt9GA/FPoFCk7c3xMp4lIm3NedHKmhnK1l5n+1Tqz6525KgyhWJMDTRf2YQRXCLCfYo4JgxRINCAuq/wrxEAmElU4zC8GZPXkemlXLObGq16el2kUeRwHsgwNQBg4AzVwBeqgATB4BM/gFbwZT8aL8W58TFsXjHxmD/yR8fUDSrGdNQ=</latexit>i here used to denote “index”
Back to sequences
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>Want to model conditional probability of sequence of y
m
Y
i
p(yi|y1, ..., yi−1, x1, ...xm)
<latexit sha1_base64="DP+wYovuYi0G/eNDG2VCl/L9ey8=">ACGnicbZDLSsNAFIYnXmu9RV26GSxChRqSKuiy6MZlBXuBNobJdNoOnUnCzEQMsc/hxldx40IRd+LGt3HSZqGtPwx8/OczpzfjxiVyra/jYXFpeWV1cJacX1jc2vb3NltyjAWmDRwyELR9pEkjAakoahipB0JgrjPSMsfXWb1h0RkobBjUoi4nI0CGifYqS05ZlONxJhz6O3HEblxKMPiedUoGVZFZh4KT12xhV4n1sa+FHRM0u2ZU8E58HJoQRy1T3zs9sLcxJoDBDUnYcO1JuioSimJFxsRtLEiE8QgPS0RgTqSbTk4bw0Pt9GA/FPoFCk7c3xMp4lIm3NedHKmhnK1l5n+1Tqz6525KgyhWJMDTRf2YQRXCLCfYo4JgxRINCAuq/wrxEAmElU4zC8GZPXkemlXLObGq16el2kUeRwHsgwNQBg4AzVwBeqgATB4BM/gFbwZT8aL8W58TFsXjHxmD/yR8fUDSrGdNQ=</latexit>=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>i here used to denote “index”
p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>MEMMs
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>Combine With ??? For:
p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>MEMMs
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>Combine With ??? For:
p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>MEMMs
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>Combine With ??? For:
p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>MEMMs
p(y1, y2, ...ym|x1, x2, ..., xm)
<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>p(y|x, w) = exp(w · φ(x, y)) P
y02Y exp(w · φ(x, y0))
<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>Combine With For:
Training
What will our training examples look like?
p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>Training
What will our training examples look like?
p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>{y, [yi−1, x1, ..., xm]}
<latexit sha1_base64="z3zjSER6/HcImz3PzcrSqcjRDA=">ACHicbZDLSsNAFIYn9VbrLerShYNFcFDUgVdFt24rGAv0IQwmU7aoZMLMxMxhCzd+CpuXCji1kdw59s4abPQ1h8GPv5zDmfO78WMCma31plaXlda26XtvY3Nre0Xf3uiJKOCYdHLGI9z0kCKMh6UgqGenHnKDAY6TnTa6Leu+ecEGj8E6mMXECNAqpTzGSynL1QztLG3CQuhk9tfIGfHCtBjQMo6DAsXNXr5uGORVcBKuEOijVdvUvexjhJChxAwJMbDMWDoZ4pJiRvKanQgSIzxBIzJQGKACebHpLDY+UMoR9x9UIJp+7viQwFQqSBpzoDJMdivlaY/9UGifQvnYyGcSJiGeL/IRBGcEiFTiknGDJUgUIc6r+CvEYcYSlyq6mQrDmT16EbtOwzozm7Xm9dVXGUQUH4AicAtcgBa4AW3QARg8gmfwCt60J+1Fe9c+Zq0VrZzZB3+kf4A90yXYw=</latexit>Predicting
arg maxˆ
y∈Y p(y|x, w)
<latexit sha1_base64="m2lAb5FLSi5DqwxswnlPaKZTapc=">ACJnicbVBNS8NAEN34bf2qevSyWAQFKYkKehFELx4VrK0pUy23Zxswm7E2I+TVe/CtePCgi3vwpbj8Oan0w8Hhvhpl5QSyFQdf9dCYmp6ZnZufmCwuLS8srxdW1axMlmvEKi2SkawEYLoXiFRQoeS3WHMJA8mpwe9b3q3dcGxGpK0xj3giho0RbMEArNYvHmY+8hxnoDg2hl+fNzO8CZmlOfaGoHwJ2GcjsJrfCoJPmN5OH3q79H6nWSy5ZXcAOk68ESmRES6axVe/FbEk5AqZBGPqnhtjw25HwSTPC35ieAzsFjq8bqmCkJtGNngzp1tWadF2pG0pAP150QGoTFpGNjO/tnmr9cX/PqCbaPGplQcYJcseGidiIpRrSfGW0JzRnK1BJgWthbKeuCBoY2YINwfv78ji53it7+W9y4PSyekojmyQTbJNvHITkh5+SCVAgj+SZvJI358l5cd6dj2HrhDOaWSe/4Hx9A9Phpoc=</latexit>Predicting
arg maxˆ
y∈Y p(y|x, w)
<latexit sha1_base64="m2lAb5FLSi5DqwxswnlPaKZTapc=">ACJnicbVBNS8NAEN34bf2qevSyWAQFKYkKehFELx4VrK0pUy23Zxswm7E2I+TVe/CtePCgi3vwpbj8Oan0w8Hhvhpl5QSyFQdf9dCYmp6ZnZufmCwuLS8srxdW1axMlmvEKi2SkawEYLoXiFRQoeS3WHMJA8mpwe9b3q3dcGxGpK0xj3giho0RbMEArNYvHmY+8hxnoDg2hl+fNzO8CZmlOfaGoHwJ2GcjsJrfCoJPmN5OH3q79H6nWSy5ZXcAOk68ESmRES6axVe/FbEk5AqZBGPqnhtjw25HwSTPC35ieAzsFjq8bqmCkJtGNngzp1tWadF2pG0pAP150QGoTFpGNjO/tnmr9cX/PqCbaPGplQcYJcseGidiIpRrSfGW0JzRnK1BJgWthbKeuCBoY2YINwfv78ji53it7+W9y4PSyekojmyQTbJNvHITkh5+SCVAgj+SZvJI358l5cd6dj2HrhDOaWSe/4Hx9A9Phpoc=</latexit>Viterbi (see last lecture)!
HMMs v MEMMs
p(yi|yi−1)p(xi|yi)
<latexit sha1_base64="QRHI5XSTKvIjvWoYD+N4bumkyMY=">ACAnicbVDLSsNAFJ3UV62vqCtxM1iEdmFJqDLohuXFewD2hAm0k7dDIJMxMxOLGX3HjQhG3foU7/8ZpmoW2Hrhw5px7mXuPFzEqlWV9G4Wl5ZXVteJ6aWNza3vH3N1ryzAWmLRwyELR9ZAkjHLSUlQx0o0EQYHSMcbX039zh0Rkob8ViURcQI05NSnGCktueZBVElc+pC4KT2xJ1UYVe6zJ626ZtmqWRngIrFzUgY5mq751R+EOA4IV5ghKXu2FSknRUJRzMik1I8liRAeoyHpacpRQKSTZidM4LFWBtAPhS6uYKb+nkhRIGUSeLozQGok572p+J/Xi5V/4aSUR7EiHM8+8mMGVQinecABFQrlmiCsKB6V4hHSCsdGolHYI9f/Iiadr9mtfnNWblzmcRTBITgCFWCDc9A16AJWgCDR/AMXsGb8WS8GO/Gx6y1YOQz+APjM8fIOCWnA=</latexit>HMM
) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>MEMM permits richer representations!
φ
<latexit sha1_base64="U7EeqmYiKeTo/r/N2HofpG7Xero=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq7ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBPF7K2IRFhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D</latexit>Feature engineering
permits richer representations!
φ
<latexit sha1_base64="U7EeqmYiKeTo/r/N2HofpG7Xero=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq7ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBPF7K2IRFhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D</latexit>) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">AAACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcrr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnFF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bbwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt22aisJYYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISss2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpHH9AhmiAWPYn2onfR+/hZfBB/jD9dpsbRuuYx+ivio1+Tp8Vi</latexit>MEMM
Yesterday secretary of state Mike Pompeo meet with Ethiopia’s Prime Minister Abiy Ahmed
Consider NER: What are some potential features here?
Feature engineering
Suppose we have some deep neural network that yields embeddings for each word; we could stack a MEMM on top of this. What would reasonable features be here?
) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>The “label bias” problem
) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>The “label bias” problem
) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>Locally re-normalized; states are competing against each other
McCallum et al., 2001
The “label bias” problem
) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>Locally re-normalized; states are competing against each other In an extreme case, a particular y may always place all mass on some other y’ — ignoring the local input!
The “label bias” problem
Figure from Awni Hannun, https://awni.github.io/label-bias/
Example from Awni Hannun, https://awni.github.io/label-bias/
Y
<latexit sha1_base64="5liJAzaocy9CDCGFglLpK9GdXNs=">AB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRjcsK9iHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTAQ34LrfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt4CBYL9GMxKFg3XBym/vdJ6YNV/IBpgkLYjKSPOKUgJX8fkxgTInIHmeDas2tu3PgVeIVpIYKtAbVr/5Q0TRmEqgxviem0CQEQ2cCjar9FPDEkInZMR8SyWJmQmyeQZPrPKEdK2ycBz9XfGxmJjZnGoZ3MI5plLxf/8/wUousg4zJgUm6+ChKBQaF8/vxkGtGQUwtIVRzmxXTMdGEgm2pYkvwlk9eJZ1G3buoN+4va82bo4yOkGn6Bx56Ao10R1qoTaiSKFn9IreHBenHfnYzFacoqdY/QHzucPlhSRcw=</latexit>Example
V
Example from Awni Hannun, https://awni.github.io/label-bias/
x: cat sat N V y: V [cat, sat, the]
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>Example from Awni Hannun, https://awni.github.io/label-bias/
x: cat sat N V y:
???
V [cat, sat, the]
Example from Awni Hannun, https://awni.github.io/label-bias/
x: cat sat N V y:
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>V [cat, sat, the]
Example from Awni Hannun, https://awni.github.io/label-bias/
x: cat sat N V y:
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>p(A, N|cat, sat) = 0.9 · 0.3 = 0.27
<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>???
V [cat, sat, the]
Example from Awni Hannun, https://awni.github.io/label-bias/
x: cat sat N V y:
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>p(A, N|cat, sat) = 0.9 · 0.3 = 0.27
<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>V [cat, sat, the]
Example from Awni Hannun, https://awni.github.io/label-bias/
Why is this happening?
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>p(A, N|cat, sat) = 0.9 · 0.3 = 0.27
<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>Example from Awni Hannun, https://awni.github.io/label-bias/
Why is this happening?
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>p(A, N|cat, sat) = 0.9 · 0.3 = 0.27
<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>“cat” rarely seen as first word; poorly calibrated. But the mass has to go somewhere! Why?
Example from Awni Hannun, https://awni.github.io/label-bias/
Why is this happening?
p(N, V|cat, sat) = 0.1 · 1.0 = 0.1
<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>p(A, N|cat, sat) = 0.9 · 0.3 = 0.27
<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>“cat” rarely seen as first word; poorly calibrated. But the mass has to go somewhere! Why? Transitions are locally normalized
Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores
Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores Both are low! We are not confident about what to do with cat
Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores
score(A, N|cat, sat) = 5 + 21 = 26
<latexit sha1_base64="OS4pNCHnCcV5NwRVoFqrt8IWOo=">ACInicbVDJSgNBEO1xjXGLevTSGARFCTNxPwhRL54kglEhCaGnU5M06VnorhHDON/ixV/x4kFRT4IfY2c5uD1oePVeFdX13EgKjb9Y2Mjo1PTGamstMzs3PzuYXFSx3GikOFhzJU1y7TIEUAFRQo4TpSwHxXwpXbOen5VzegtAiDC+xGUPdZKxCe4AyN1Mgd1BuMdE8VJCuJYPqaJOependoDCdm1QzTNfpId2hG7ToGFLcbeTydsHug/4lzpDkyRDlRu6t1gx57EOAXDKtq4dYT1hCgWXkGZrsYaI8Q5rQdXQgPmg60n/xJSuGqVJvVCZFyDtq98nEuZr3fVd0+kzbOvfXk/8z6vG6O3XExFEMULAB4u8WFIMaS8v2hQKOMquIYwrYf5KeZspxtGkmjUhOL9P/ksuiwVnq1A8386XjodxZMgyWSFrxCF7pEROSZlUCf35JE8kxfrwXqyXq3QeuINZxZIj9gfX4BVuaiIQ=</latexit>score(N, V|cat, sat) = 3 + 100 = 103
<latexit sha1_base64="neHIvlPxb2nEFk2kxH1AH4/JLE=">ACJHicbVDJSgNBEO2Je9yiHr0BiGihBkjKIgevEkEUwiJEPo6dSYxp6F7hoxjPMxXvwVLx5c8ODFb7GzHDT6oOHVe1VU1/NiKTa9qeVm5icmp6ZncvPLywuLRdWVus6ShSHGo9kpK48pkGKEGoUMJVrIAFnoSGd3Pa9xu3oLSIwkvsxeAG7DoUvuAMjdQuHLYQ7jDVPFKQldJhdb5D61l2PyxM5w7VDLMtekQrdJs6tm2Y1fahaJdtgegf4kzIkUyQrVdeGt1Ip4ECKXTOumY8fopkyh4BKyfCvREDN+w6haWjIAtBuOjgyo5tG6VA/UuaFSAfqz4mUBVr3As90Bgy7etzri/95zQT9AzcVYZwghHy4yE8kxYj2E6MdoYCj7BnCuBLmr5R3mWIcTa5E4IzfvJfUt8tO5Xy7sVe8fhkFMcsWScbpEQcsk+OyRmpkhrh5IE8kRfyaj1az9a79TFszVmjmTXyC9bXN3CToqI=</latexit>Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores
score(A, N|cat, sat) = 5 + 21 = 26
<latexit sha1_base64="OS4pNCHnCcV5NwRVoFqrt8IWOo=">ACInicbVDJSgNBEO1xjXGLevTSGARFCTNxPwhRL54kglEhCaGnU5M06VnorhHDON/ixV/x4kFRT4IfY2c5uD1oePVeFdX13EgKjb9Y2Mjo1PTGamstMzs3PzuYXFSx3GikOFhzJU1y7TIEUAFRQo4TpSwHxXwpXbOen5VzegtAiDC+xGUPdZKxCe4AyN1Mgd1BuMdE8VJCuJYPqaJOependoDCdm1QzTNfpId2hG7ToGFLcbeTydsHug/4lzpDkyRDlRu6t1gx57EOAXDKtq4dYT1hCgWXkGZrsYaI8Q5rQdXQgPmg60n/xJSuGqVJvVCZFyDtq98nEuZr3fVd0+kzbOvfXk/8z6vG6O3XExFEMULAB4u8WFIMaS8v2hQKOMquIYwrYf5KeZspxtGkmjUhOL9P/ksuiwVnq1A8386XjodxZMgyWSFrxCF7pEROSZlUCf35JE8kxfrwXqyXq3QeuINZxZIj9gfX4BVuaiIQ=</latexit>score(N, V|cat, sat) = 3 + 100 = 103
<latexit sha1_base64="neHIvlPxb2nEFk2kxH1AH4/JLE=">ACJHicbVDJSgNBEO2Je9yiHr0BiGihBkjKIgevEkEUwiJEPo6dSYxp6F7hoxjPMxXvwVLx5c8ODFb7GzHDT6oOHVe1VU1/NiKTa9qeVm5icmp6ZncvPLywuLRdWVus6ShSHGo9kpK48pkGKEGoUMJVrIAFnoSGd3Pa9xu3oLSIwkvsxeAG7DoUvuAMjdQuHLYQ7jDVPFKQldJhdb5D61l2PyxM5w7VDLMtekQrdJs6tm2Y1fahaJdtgegf4kzIkUyQrVdeGt1Ip4ECKXTOumY8fopkyh4BKyfCvREDN+w6haWjIAtBuOjgyo5tG6VA/UuaFSAfqz4mUBVr3As90Bgy7etzri/95zQT9AzcVYZwghHy4yE8kxYj2E6MdoYCj7BnCuBLmr5R3mWIcTa5E4IzfvJfUt8tO5Xy7sVe8fhkFMcsWScbpEQcsk+OyRmpkhrh5IE8kRfyaj1az9a79TFszVmjmTXyC9bXN3CToqI=</latexit>Reversed!
Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores
score(A, N|cat, sat) = 5 + 21 = 26
<latexit sha1_base64="OS4pNCHnCcV5NwRVoFqrt8IWOo=">ACInicbVDJSgNBEO1xjXGLevTSGARFCTNxPwhRL54kglEhCaGnU5M06VnorhHDON/ixV/x4kFRT4IfY2c5uD1oePVeFdX13EgKjb9Y2Mjo1PTGamstMzs3PzuYXFSx3GikOFhzJU1y7TIEUAFRQo4TpSwHxXwpXbOen5VzegtAiDC+xGUPdZKxCe4AyN1Mgd1BuMdE8VJCuJYPqaJOependoDCdm1QzTNfpId2hG7ToGFLcbeTydsHug/4lzpDkyRDlRu6t1gx57EOAXDKtq4dYT1hCgWXkGZrsYaI8Q5rQdXQgPmg60n/xJSuGqVJvVCZFyDtq98nEuZr3fVd0+kzbOvfXk/8z6vG6O3XExFEMULAB4u8WFIMaS8v2hQKOMquIYwrYf5KeZspxtGkmjUhOL9P/ksuiwVnq1A8386XjodxZMgyWSFrxCF7pEROSZlUCf35JE8kxfrwXqyXq3QeuINZxZIj9gfX4BVuaiIQ=</latexit>score(N, V|cat, sat) = 3 + 100 = 103
<latexit sha1_base64="neHIvlPxb2nEFk2kxH1AH4/JLE=">ACJHicbVDJSgNBEO2Je9yiHr0BiGihBkjKIgevEkEUwiJEPo6dSYxp6F7hoxjPMxXvwVLx5c8ODFb7GzHDT6oOHVe1VU1/NiKTa9qeVm5icmp6ZncvPLywuLRdWVus6ShSHGo9kpK48pkGKEGoUMJVrIAFnoSGd3Pa9xu3oLSIwkvsxeAG7DoUvuAMjdQuHLYQ7jDVPFKQldJhdb5D61l2PyxM5w7VDLMtekQrdJs6tm2Y1fahaJdtgegf4kzIkUyQrVdeGt1Ip4ECKXTOumY8fopkyh4BKyfCvREDN+w6haWjIAtBuOjgyo5tG6VA/UuaFSAfqz4mUBVr3As90Bgy7etzri/95zQT9AzcVYZwghHy4yE8kxYj2E6MdoYCj7BnCuBLmr5R3mWIcTa5E4IzfvJfUt8tO5Xy7sVe8fhkFMcsWScbpEQcsk+OyRmpkhrh5IE8kRfyaj1az9a79TFszVmjmTXyC9bXN3CToqI=</latexit>Reversed!
Note: Why add instead of multiply here?
Example from Awni Hannun, https://awni.github.io/label-bias/
Label bias
- Because transitions are locally normalized, MEMMs prefer
low entropy states
- Difficult to “recover” from mistakes
s(x, y) = X
i
s(yi, xi, yi−1)
<latexit sha1_base64="NTiu2QC7zQWcXiqF/ZOg/1cknFU=">ACDnicbVDLSsNAFJ34rPUVdelmsBRaqCWpgm6EohuXFewD2hAm02k7dCYJMxNpCP0CN/6KGxeKuHXtzr9x0mahrQfu5XDOvczc4WMSmVZ38bK6tr6xmZuK7+9s7u3bx4ctmQCUyaOGCB6HhIEkZ90lRUMdIJBUHcY6TtjW9Sv/1AhKSBf6/ikDgcDX06oBgpLblmUZYmFRiX4RXsyYi7FMpS7NIKnKQtdhN6ak/LrlmwqtYMcJnYGSmADA3X/Or1Ax4ivMkJRd2wqVkyChKGZkmu9FkoQIj9GQdDX1ESfSWbnTGFRK304CIQuX8GZ+nsjQVzKmHt6kiM1koteKv7ndSM1uHQS6oeRIj6ePzSIGFQBTLOBfSoIVizWBGFB9V8hHiGBsNIJ5nUI9uLJy6RVq9pn1drdeaF+ncWRA8fgBJSADS5AHdyCBmgCDB7BM3gFb8aT8WK8Gx/z0RUj2zkCf2B8/gAs6pms</latexit>Global scores
s(x, y) = X
i
s(yi, xi, yi−1)
<latexit sha1_base64="NTiu2QC7zQWcXiqF/ZOg/1cknFU=">ACDnicbVDLSsNAFJ34rPUVdelmsBRaqCWpgm6EohuXFewD2hAm02k7dCYJMxNpCP0CN/6KGxeKuHXtzr9x0mahrQfu5XDOvczc4WMSmVZ38bK6tr6xmZuK7+9s7u3bx4ctmQCUyaOGCB6HhIEkZ90lRUMdIJBUHcY6TtjW9Sv/1AhKSBf6/ikDgcDX06oBgpLblmUZYmFRiX4RXsyYi7FMpS7NIKnKQtdhN6ak/LrlmwqtYMcJnYGSmADA3X/Or1Ax4ivMkJRd2wqVkyChKGZkmu9FkoQIj9GQdDX1ESfSWbnTGFRK304CIQuX8GZ+nsjQVzKmHt6kiM1koteKv7ndSM1uHQS6oeRIj6ePzSIGFQBTLOBfSoIVizWBGFB9V8hHiGBsNIJ5nUI9uLJy6RVq9pn1drdeaF+ncWRA8fgBJSADS5AHdyCBmgCDB7BM3gFb8aT8WK8Gx/z0RUj2zkCf2B8/gAs6pms</latexit>Global scores
Why can’t we just maximize this, period?
s(x, y) = X
i
s(yi, xi, yi−1)
<latexit sha1_base64="NTiu2QC7zQWcXiqF/ZOg/1cknFU=">ACDnicbVDLSsNAFJ34rPUVdelmsBRaqCWpgm6EohuXFewD2hAm02k7dCYJMxNpCP0CN/6KGxeKuHXtzr9x0mahrQfu5XDOvczc4WMSmVZ38bK6tr6xmZuK7+9s7u3bx4ctmQCUyaOGCB6HhIEkZ90lRUMdIJBUHcY6TtjW9Sv/1AhKSBf6/ikDgcDX06oBgpLblmUZYmFRiX4RXsyYi7FMpS7NIKnKQtdhN6ak/LrlmwqtYMcJnYGSmADA3X/Or1Ax4ivMkJRd2wqVkyChKGZkmu9FkoQIj9GQdDX1ESfSWbnTGFRK304CIQuX8GZ+nsjQVzKmHt6kiM1koteKv7ndSM1uHQS6oeRIj6ePzSIGFQBTLOBfSoIVizWBGFB9V8hHiGBsNIJ5nUI9uLJy6RVq9pn1drdeaF+ncWRA8fgBJSADS5AHdyCBmgCDB7BM3gFb8aT8WK8Gx/z0RUj2zkCf2B8/gAs6pms</latexit>Global scores
Why can’t we just maximize this, period? No competition between different labels!
Global normalization
p(y|x) = exp{P
i s(yi, xi, yi1)}
P
y0 exp{P i s(y0 i, xi, y0 i1)}
<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>Global normalization
p(y|x) = exp{P
i s(yi, xi, yi1)}
P
y0 exp{P i s(y0 i, xi, y0 i1)}
<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>This is a linear-chain Conditional Random Field (CRF)
) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>MEMMs locally normalize, chain together transition probabilities:
MEMMs vs CRFs
=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>p(y|x) =
<latexit sha1_base64="xBuP5C34Eg6p5fs4XVAek6u82uE=">AB73icbVBNS8NAEJ3Ur1q/qh69LBahXkrSCnoRil48VrAf0Iay2W7apZtN3N2IfZPePGgiFf/jf/jds2B219MPB4b4aZeV7EmdK2/W3lVlbX1jfym4Wt7Z3dveL+QUuFsS0SUIeyo6HFeVM0KZmtNOJCkOPE7b3vh6rcfqFQsFHc6iagb4KFgPiNYG6kTlZOnx1N02S+W7Io9A1omTkZKkKHRL371BiGJAyo04ViprmNH2k2x1IxwOin0YkUjTMZ4SLuGChxQ5azeyfoxCgD5IfSlNBopv6eSHGgVBJ4pjPAeqQWvan4n9eNtX/hpkxEsaCzBf5MUc6RNPn0YBJSjRPDMFEMnMrIiMsMdEmoIJwVl8eZm0qhWnVqnenpXqV1kceTiCYyiDA+dQhxtoQBMIcHiGV3iz7q0X6936mLfmrGzmEP7A+vwBGSuPWQ=</latexit>) = exp(w · φ(x1, ..., xm, yi1, yi)) P
y02Y exp(w · φ(x1, ..., xm, yi1, y0))
<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>p(y|x) = exp{P
i s(yi, xi, yi1)}
P
y0 exp{P i s(y0 i, xi, y0 i1)}
<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>MEMMs locally normalize, chain together transition probabilities:
MEMMs vs CRFs
=
m
Y
i
p(yi|yi−1, x1, ...xm)
<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>CRFs globally normalize
p(y|x) =
<latexit sha1_base64="xBuP5C34Eg6p5fs4XVAek6u82uE=">AB73icbVBNS8NAEJ3Ur1q/qh69LBahXkrSCnoRil48VrAf0Iay2W7apZtN3N2IfZPePGgiFf/jf/jds2B219MPB4b4aZeV7EmdK2/W3lVlbX1jfym4Wt7Z3dveL+QUuFsS0SUIeyo6HFeVM0KZmtNOJCkOPE7b3vh6rcfqFQsFHc6iagb4KFgPiNYG6kTlZOnx1N02S+W7Io9A1omTkZKkKHRL371BiGJAyo04ViprmNH2k2x1IxwOin0YkUjTMZ4SLuGChxQ5azeyfoxCgD5IfSlNBopv6eSHGgVBJ4pjPAeqQWvan4n9eNtX/hpkxEsaCzBf5MUc6RNPn0YBJSjRPDMFEMnMrIiMsMdEmoIJwVl8eZm0qhWnVqnenpXqV1kceTiCYyiDA+dQhxtoQBMIcHiGV3iz7q0X6936mLfmrGzmEP7A+vwBGSuPWQ=</latexit>Example from Awni Hannun, https://awni.github.io/label-bias/
p(y|x) = exp{P
i s(yi, xi, yi1)}
P
y0 exp{P i s(y0 i, xi, y0 i1)}
<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>Y
<latexit sha1_base64="5liJAzaocy9CDCGFglLpK9GdXNs=">AB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRjcsK9iHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTAQ34LrfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt4CBYL9GMxKFg3XBym/vdJ6YNV/IBpgkLYjKSPOKUgJX8fkxgTInIHmeDas2tu3PgVeIVpIYKtAbVr/5Q0TRmEqgxviem0CQEQ2cCjar9FPDEkInZMR8SyWJmQmyeQZPrPKEdK2ycBz9XfGxmJjZnGoZ3MI5plLxf/8/wUousg4zJgUm6+ChKBQaF8/vxkGtGQUwtIVRzmxXTMdGEgm2pYkvwlk9eJZ1G3buoN+4va82bo4yOkGn6Bx56Ao10R1qoTaiSKFn9IreHBenHfnYzFacoqdY/QHzucPlhSRcw=</latexit>V [cat, sat, the] x = cat sat
A, N A, V N, V
For simplicity assume no self-loops
N, A
Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =
<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>Example from Awni Hannun, https://awni.github.io/label-bias/
p(y|x) = exp{P
i s(yi, xi, yi1)}
P
y0 exp{P i s(y0 i, xi, y0 i1)}
<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>Y
<latexit sha1_base64="5liJAzaocy9CDCGFglLpK9GdXNs=">AB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRjcsK9iHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTAQ34LrfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt4CBYL9GMxKFg3XBym/vdJ6YNV/IBpgkLYjKSPOKUgJX8fkxgTInIHmeDas2tu3PgVeIVpIYKtAbVr/5Q0TRmEqgxviem0CQEQ2cCjar9FPDEkInZMR8SyWJmQmyeQZPrPKEdK2ycBz9XfGxmJjZnGoZ3MI5plLxf/8/wUousg4zJgUm6+ChKBQaF8/vxkGtGQUwtIVRzmxXTMdGEgm2pYkvwlk9eJZ1G3buoN+4va82bo4yOkGn6Bx56Ao10R1qoTaiSKFn9IreHBenHfnYzFacoqdY/QHzucPlhSRcw=</latexit>V [cat, sat, the] x = cat sat
A, N A, V N, V
For simplicity assume no self-loops
N, A
Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =
<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>p(N, V ) ∼ 1
<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>A, N A, V N, V N, A
Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =
<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>p(N, V ) ∼ 1
<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>Regularization
- This suggests maybe not great calibration — scores too large?
exp(103) is really big!
- Important to regularize parameters
A, N A, V N, V N, A
Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =
<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>p(N, V ) ∼ 1
<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>Prediction
- Do we actually need to compute Z if we just want to make a
prediction?
A, N A, V N, V N, A
Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =
<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>p(N, V ) ∼ 1
<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>Prediction
- Do we actually need to compute Z if we just want to make a
prediction?
- No; we just need argmax over y’. How can we compute
efficiently?
A, N A, V N, V N, A
Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =
<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>p(N, V ) ∼ 1
<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>Prediction
- Do we actually need to compute Z if we just want to make a
prediction?
- No; we just need argmax over y’. How can we compute
efficiently? Dynamic programming (from last time)
Parameter estimation for Linear-Chain CRFs (board)
Example: OCR
https://pystruct.github.io/auto_examples/plot_letters.html
(Notebook)
Beyond linear-chains
Logistic Regression Linear-chain CRFs
SEQUENCE
General CRFs
General GRAPHS
Figure from Sutton and McCallum, 2011
Beyond linear-chains
Logistic Regression Linear-chain CRFs
SEQUENCE
General CRFs
General GRAPHS
Figure from Sutton and McCallum, 2011
Beyond linear-chains
Figure from Hugo Larochelle, http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
Grid structure (pixels in image) General pair-wise structure (webpages sharing a link)
p(y|X) = 1 Z(X) Y
f
Ψf(y, X)
Training general CRFs
Figure from Hugo Larochelle, http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
- −
- ∂−log p(y(t)|X(t))
∂θ
= − ⇣P
f ∂ ∂θ log Ψf(y(t), X(t))
) − Ey hP
f ∂ ∂θ log Ψf(y, X(t))
- X(t) i⌘
)}
)}
make y(t) more likely make everything less likely
Looks similar to what we had for linear-chain, but can no longer use dynamic programming to efficiently take expectation over y
* Here we denote parameters instead of w
θ
<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>*
Summary: Structured prediction
- When labels y are correlated (and where for a given
instance x and y are both tensors) structured prediction models attempt to exploit this
Summary: Structured prediction
- When labels y are correlated (and where for a given
instance x and y are both tensors) structured prediction models attempt to exploit this
- Hidden Markov Models (HMMs) are a generative
approach that model P(x, y)
Summary: Structured prediction
- When labels y are correlated (and where for a given
instance x and y are both tensors) structured prediction models attempt to exploit this
- Hidden Markov Models (HMMs) are a generative
approach that model P(x, y)
- Structured perceptrons, MEMMs, and CRFs are
conditional models model p(y|x)
Summary: Structured prediction
- When labels y are correlated (and where for a given
instance x and y are both tensors) structured prediction models attempt to exploit this
- Hidden Markov Models (HMMs) are a generative
approach that model P(x, y)
- Structured perceptrons, MEMMs, and CRFs are
conditional models model p(y|x)
- For all we use dynamic programming (Viterbi) for efficient
argmaxing, and variants of this to efficiently compute normalization constants, etc.