Machine Learning 2 DS 4420 - Spring 2020 Structured prediction, II - - PowerPoint PPT Presentation

machine learning 2
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2 DS 4420 - Spring 2020 Structured prediction, II - - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Structured prediction, II Byron C Wallace Today From HMMs to MEMMs to CRFs <latexit


slide-1
SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Structured prediction, II

Byron C Wallace

slide-2
SLIDE 2

Today

  • From HMMs to MEMMs to CRFs
slide-3
SLIDE 3

Structured output spaces

“Play Kanye West”

x1

<latexit sha1_base64="MWSDWkw1NdOauHNwPQkLknLX4o4=">AB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szMwLEikMu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qn9Uplt+LOQJaJl5My5Kj3Sl/dfszSiCtkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEM/o2k</latexit>

x2

<latexit sha1_base64="vdTCQWpAcdEoAqjXndSIH2U27gw=">AB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkV9Fj04rGi/YA2lM120y7dbMLuRCyhP8GLB0W8+ou8+W/ctjlo64OBx3szMwLEikMu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXJtRKwecJxwP6IDJULBKFrp/qlX7ZXKbsWdgSwTLydlyFHvlb6/ZilEVfIJDWm47kJ+hnVKJjk2I3NTyhbEQHvGOpohE3fjY7dUJOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tO0YbgLb68TJrVindeqd5dlGvXeRwFOIYTOAMPLqEGt1CHBjAYwDO8wpsjnRfn3fmYt64+cwR/IHz+QMOgo2l</latexit>

x3

<latexit sha1_base64="GcfdmiVXIuQAVIE+3vSRqlRiStc=">AB6nicbVDLTgJBEOzF+IL9ehlIjHxRHbBRI9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX7RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOpBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStmrlit3F6XadRZHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AEAaNpg=</latexit>

y3

<latexit sha1_base64="I1Yqf/dyfMgBYEmGxGuyvCmwQ4=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0laQY9FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOl+0m/3i9X3Ko7B1klXk4qkKPZL3/1BjFLI5SGCap13MT42dUGc4ETku9VGNC2ZgOsWupBFqP5ufOiVnVhmQMFa2pCFz9fdERiOtJ1FgOyNqRnrZm4n/ed3UhFd+xmWSGpRsShMBTExmf1NBlwhM2JiCWK21sJG1FmbHplGwI3vLq6Rdq3r1au3uotK4zuMowgmcwjl4cAkNuIUmtIDBEJ7hFd4c4bw4787HorXg5DPH8AfO5w8RjI2n</latexit>

y2

<latexit sha1_base64="UmY8miGJFsYtImgQ4UOSFc3rPg=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6xCzhfkSHSoSCUbTSQ9av9csVt+rOQVaJl5MK5Gj0y1+9QczSiCtkhrT9dwE/QnVKJjk01IvNTyhbEyHvGupohE3/mR+6pScWVAwljbUkjm6u+JCY2MyaLAdkYUR2bZm4n/ed0Uw2t/IlSIldsShMJcGYzP4mA6E5Q5lZQpkW9lbCRlRThjadkg3BW35lbRqVe+iWru/rNRv8jiKcAKncA4eXEd7qABTWAwhGd4hTdHOi/Ou/OxaC04+cwx/IHz+QMQCI2m</latexit>

y1

<latexit sha1_base64="gXLzr9lA6QyErQrPkt90wKdvXMk=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6xCzhfkSHSoSCUbTSQ9b3+uWKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6C/oRqFEzyamXGp5QNqZD3rVU0YgbfzI/dUrOrDIgYaxtKSRz9fEhEbGZFgOyOKI7PszcT/vG6K4bU/ESpJkSu2WBSmkmBMZn+TgdCcocwsoUwLeythI6opQ5tOyYbgLb+8Slq1qndRrd1fVuo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnRfn3flYtBacfOY/sD5/AEOhI2l</latexit>
slide-4
SLIDE 4

Structured output spaces

Source: http://cocodataset.org/

slide-5
SLIDE 5

Space of problems

Given Predict An image Contains a cat? Classification Type? An image Coordinates that

  • utline all cats

Structured prediction A tweet Names in the tweet Structured prediction A tweet Sentiment in tweet Classification

slide-6
SLIDE 6

A generative model of sequences

P(X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn = yn)

=

n+1

Y

i=1

P(yi|yi−1)

n

Y

i=1

P(xi|yi)

<latexit sha1_base64="uGKM4ZjoMHjNKaJNf9IciBFWvE=">ACJXicbVDLSgMxFM3UV62vqks3wSK0iGWmCrqwUHTjsoJ9QDsOmTRtQzOZIcmIwzg/48ZfcePCIoIrf8V02oW2HgczjmXm3vcgFGpTPLyCwtr6yuZdzG5tb2zv53b2m9EOBSQP7zBdtF0nCKCcNRUj7UAQ5LmMtNzR9cRvPRAhqc/vVBQ20MDTvsUI6UlJ39Zhd1A+D0nplUruY/5sZXAejFy6FOktRMrKc0FknrxMXVpyckXzLKZAi4Sa0YKYIa6kx93ez4OPcIVZkjKjmUGyo6RUBQzkuS6oSQBwiM0IB1NOfKItOP0ygQeaUH+7QjyuYqr8nYuRJGXmuTnpIDeW8NxH/8zqh6l/YMeVBqAjH0X9kEHlw0lsEcFwYpFmiAsqP4rxEMkEFa62JwuwZo/eZE0K2XrtFy5PSvUrmZ1ZMEBOARFYIFzUAM3oA4aAINn8Arewdh4Md6MD+NzGs0Ys5l98AfG9w9gk6SZ</latexit>

Emission probability Transition probability

slide-7
SLIDE 7

P(X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn = yn)

=

n+1

Y

i=1

P(yi|yi−1)

n

Y

i=1

P(xi|yi)

<latexit sha1_base64="uGKM4ZjoMHjNKaJNf9IciBFWvE=">ACJXicbVDLSgMxFM3UV62vqks3wSK0iGWmCrqwUHTjsoJ9QDsOmTRtQzOZIcmIwzg/48ZfcePCIoIrf8V02oW2HgczjmXm3vcgFGpTPLyCwtr6yuZdzG5tb2zv53b2m9EOBSQP7zBdtF0nCKCcNRUj7UAQ5LmMtNzR9cRvPRAhqc/vVBQ20MDTvsUI6UlJ39Zhd1A+D0nplUruY/5sZXAejFy6FOktRMrKc0FknrxMXVpyckXzLKZAi4Sa0YKYIa6kx93ez4OPcIVZkjKjmUGyo6RUBQzkuS6oSQBwiM0IB1NOfKItOP0ygQeaUH+7QjyuYqr8nYuRJGXmuTnpIDeW8NxH/8zqh6l/YMeVBqAjH0X9kEHlw0lsEcFwYpFmiAsqP4rxEMkEFa62JwuwZo/eZE0K2XrtFy5PSvUrmZ1ZMEBOARFYIFzUAM3oA4aAINn8Arewdh4Md6MD+NzGs0Ys5l98AfG9w9gk6SZ</latexit>

Emission probability Transition probability

A generative model of sequences

slide-8
SLIDE 8

x1 x2 x3 x4 y1 y2 y3 y4 y0 y5 x5

Graphical Model (HMMs)

slide-9
SLIDE 9

Limitations to HMMs

  • We are restricted to features that have a coherent “generative

story”

  • Why bother “modeling” x — it’s given! What we really care about

is p(y|x)

slide-10
SLIDE 10

Generative v discriminative

Generative Model joint distribution P(x,y)
 Can generate new “examples”
 To predict y, use Bayes’ rule

slide-11
SLIDE 11

Generative v discriminative

Generative Model joint distribution P(x,y)
 Can generate new “examples”
 To predict y, use Bayes’ rule Discriminative Model conditional distribution P(y|x)
 Not as amenable to semi-supervised settings; cannot readily “generate” new samples

slide-12
SLIDE 12

Enter Max Entropy Markov Models (MEMMs)

  • These extend standard log-linear models to capture structure in

the outputs.

  • A bit like the structured perceptron we introduced last time, but

explicitly model conditional probabilities of labels.

slide-13
SLIDE 13

Log-Linear Models

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>
slide-14
SLIDE 14

Log-Linear Models

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>

measures plausibility of y given x

slide-15
SLIDE 15

Log-likelihood

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>
slide-16
SLIDE 16

Log-likelihood

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>

LL(w) = X

i

log p(yi|xi, w)

<latexit sha1_base64="7mJHTCqAYDNbu6aBYQG92LgaqOc=">ACE3icbVC7SgNBFJ2Nrxhfq5Y2g0FIRMJuFLQRgjYWKSKYByRhmZ1MkiGzD2bumoQ1/2Djr9hYKGJrY+fOHkUmnjgwuGce7n3HjcUXIFlfRuJpeWV1bXkempjc2t7x9zdq6gkpSVaSACWXOJYoL7rAwcBKuFkhHPFazq9q7HfvWeScUD/w6GIWt6pOPzNqcEtOSYx8Vip/Fl7ihIs/hOG4AG0Asg4ejXCYGTr8YeDwE9zPOmbaylkT4EViz0gazVByzK9GK6CRx3ygihVt60QmjGRwKlgo1QjUiwktEc6rK6pTzymvHkpxE+0koLtwOpywc8UX9PxMRTaui5utMj0FXz3lj8z6tH0L5oxtwPI2A+nS5qRwJDgMcB4RaXjIYakKo5PpWTLtEgo6xpQOwZ5/eZFU8jn7NJe/PUsXrmZxJNEBOkQZKNzVEA3qITKiKJH9Ixe0ZvxZLwY78bHtDVhzGb20R8Ynz/IlJzd</latexit>
slide-17
SLIDE 17

Back to sequences

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

Want to model conditional probability of sequence of y

slide-18
SLIDE 18

Back to sequences

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

Want to model conditional probability of sequence of y

m

Y

i

p(yi|y1, ..., yi−1, x1, ...xm)

<latexit sha1_base64="DP+wYovuYi0G/eNDG2VCl/L9ey8=">ACGnicbZDLSsNAFIYnXmu9RV26GSxChRqSKuiy6MZlBXuBNobJdNoOnUnCzEQMsc/hxldx40IRd+LGt3HSZqGtPwx8/OczpzfjxiVyra/jYXFpeWV1cJacX1jc2vb3NltyjAWmDRwyELR9pEkjAakoahipB0JgrjPSMsfXWb1h0RkobBjUoi4nI0CGifYqS05ZlONxJhz6O3HEblxKMPiedUoGVZFZh4KT12xhV4n1sa+FHRM0u2ZU8E58HJoQRy1T3zs9sLcxJoDBDUnYcO1JuioSimJFxsRtLEiE8QgPS0RgTqSbTk4bw0Pt9GA/FPoFCk7c3xMp4lIm3NedHKmhnK1l5n+1Tqz6525KgyhWJMDTRf2YQRXCLCfYo4JgxRINCAuq/wrxEAmElU4zC8GZPXkemlXLObGq16el2kUeRwHsgwNQBg4AzVwBeqgATB4BM/gFbwZT8aL8W58TFsXjHxmD/yR8fUDSrGdNQ=</latexit>

i here used to denote “index”

slide-19
SLIDE 19

Back to sequences

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

Want to model conditional probability of sequence of y

m

Y

i

p(yi|y1, ..., yi−1, x1, ...xm)

<latexit sha1_base64="DP+wYovuYi0G/eNDG2VCl/L9ey8=">ACGnicbZDLSsNAFIYnXmu9RV26GSxChRqSKuiy6MZlBXuBNobJdNoOnUnCzEQMsc/hxldx40IRd+LGt3HSZqGtPwx8/OczpzfjxiVyra/jYXFpeWV1cJacX1jc2vb3NltyjAWmDRwyELR9pEkjAakoahipB0JgrjPSMsfXWb1h0RkobBjUoi4nI0CGifYqS05ZlONxJhz6O3HEblxKMPiedUoGVZFZh4KT12xhV4n1sa+FHRM0u2ZU8E58HJoQRy1T3zs9sLcxJoDBDUnYcO1JuioSimJFxsRtLEiE8QgPS0RgTqSbTk4bw0Pt9GA/FPoFCk7c3xMp4lIm3NedHKmhnK1l5n+1Tqz6525KgyhWJMDTRf2YQRXCLCfYo4JgxRINCAuq/wrxEAmElU4zC8GZPXkemlXLObGq16el2kUeRwHsgwNQBg4AzVwBeqgATB4BM/gFbwZT8aL8W58TFsXjHxmD/yR8fUDSrGdNQ=</latexit>

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

i here used to denote “index”

slide-20
SLIDE 20

p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

MEMMs

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>

Combine With ??? For:

slide-21
SLIDE 21

p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

MEMMs

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>

Combine With ??? For:

slide-22
SLIDE 22

p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

MEMMs

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>

Combine With ??? For:

slide-23
SLIDE 23

p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

MEMMs

p(y1, y2, ...ym|x1, x2, ..., xm)

<latexit sha1_base64="EDxdzt0VKTtiBPfgILfTG8I21gI=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCyUkVdBl0Y3LCvYCbQiT6bQdOpOEmYk0xD6DG1/FjQtF3Lpy59s4abPQ1h8GPv5zDmfO70eMSmXb38bK6tr6xmZhq7i9s7u3bx4ctmQYC0yaOGSh6PhIEkYD0lRUMdKJBEHcZ6Ttj6+zevueCEnD4E4lEXE5GgZ0QDFS2vLMSlROPKcKE69WhZlaeDwAU4yb5J7GfGKZ5Zsy54JLoOTQwnkanjmV68f4piTQGpOw6dqTcFAlFMSPTYi+WJEJ4jIakqzFAnEg3nZ0hafa6cNBKPQLFJy5vydSxKVMuK87OVIjuVjLzP9q3VgNLt2UBlGsSIDniwYxgyqEWT6wTwXBiUaEBZU/xXiERIK51iUYfgLJ68DK2a5ZxZtdvzUv0qj6MAjsEJKAMHXIA6uAEN0AQYPIJn8ArejCfjxXg3PuatK0Y+cwT+yPj8Ace6mSM=</latexit>

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

p(y|x, w) = exp(w · φ(x, y)) P

y02Y exp(w · φ(x, y0))

<latexit sha1_base64="Ww4HFIBGVrevhEotSmFMziMzX8Y=">ACWnicfVFNaxsxFNRu23w4/XCb3np51JR4IZjdtNBcCqG9JhAnaRYxmhlbSyi1arS28SLun+yl1DoXylEdnzIR+mAYJh5g/RGuVHSYZr+juJHj5+srW9sdraePnv+ovy1bGrasvFkFeqsqc5c0JLYoUYlTYwUrcyVO8vMvC/kQlgnK/0NGyPGJTvTspCcYZAm3R+m3/yc78JlAp+AFpZx7ymKOXoxN23bvwTKpxUCNTPZn+82SdJ6upy4psdoFIDLRnOFP+e9vC/6LQ7CQJQDvp9tJBugQ8JNmK9MgKh5PuLzqteF0KjVwx50ZanDsmUXJlWg7tHbCMH7OzsQoUM1K4cZ+WU0L74IyhaKy4WiEpXo74VnpXFPmYXKxiLvLcR/eaMai/2xl9rUKDS/uaioFWAFi5hKq3gqJpAGLcyvBX4jIV+MfxGJ5SQ3V/5ITneG2TvB3tH3oHn1d1bJA35C3pk4x8JAfkKzkQ8LJFfkbrUXr0Z84jfjrZvROFpltskdxK+vAcqFs6w=</latexit>

Combine With For:

slide-24
SLIDE 24

Training

What will our training examples look like?

p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>
slide-25
SLIDE 25

Training

What will our training examples look like?

p(y|yi1, x1, ..., xm, w) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

{y, [yi−1, x1, ..., xm]}

<latexit sha1_base64="z3zjSER6/HcImz3PzcrSqcjRDA=">ACHicbZDLSsNAFIYn9VbrLerShYNFcFDUgVdFt24rGAv0IQwmU7aoZMLMxMxhCzd+CpuXCji1kdw59s4abPQ1h8GPv5zDmfO78WMCma31plaXlda26XtvY3Nre0Xf3uiJKOCYdHLGI9z0kCKMh6UgqGenHnKDAY6TnTa6Leu+ecEGj8E6mMXECNAqpTzGSynL1QztLG3CQuhk9tfIGfHCtBjQMo6DAsXNXr5uGORVcBKuEOijVdvUvexjhJChxAwJMbDMWDoZ4pJiRvKanQgSIzxBIzJQGKACebHpLDY+UMoR9x9UIJp+7viQwFQqSBpzoDJMdivlaY/9UGifQvnYyGcSJiGeL/IRBGcEiFTiknGDJUgUIc6r+CvEYcYSlyq6mQrDmT16EbtOwzozm7Xm9dVXGUQUH4AicAtcgBa4AW3QARg8gmfwCt60J+1Fe9c+Zq0VrZzZB3+kf4A90yXYw=</latexit>
slide-26
SLIDE 26

Predicting

arg maxˆ

y∈Y p(y|x, w)

<latexit sha1_base64="m2lAb5FLSi5DqwxswnlPaKZTapc=">ACJnicbVBNS8NAEN34bf2qevSyWAQFKYkKehFELx4VrK0pUy23Zxswm7E2I+TVe/CtePCgi3vwpbj8Oan0w8Hhvhpl5QSyFQdf9dCYmp6ZnZufmCwuLS8srxdW1axMlmvEKi2SkawEYLoXiFRQoeS3WHMJA8mpwe9b3q3dcGxGpK0xj3giho0RbMEArNYvHmY+8hxnoDg2hl+fNzO8CZmlOfaGoHwJ2GcjsJrfCoJPmN5OH3q79H6nWSy5ZXcAOk68ESmRES6axVe/FbEk5AqZBGPqnhtjw25HwSTPC35ieAzsFjq8bqmCkJtGNngzp1tWadF2pG0pAP150QGoTFpGNjO/tnmr9cX/PqCbaPGplQcYJcseGidiIpRrSfGW0JzRnK1BJgWthbKeuCBoY2YINwfv78ji53it7+W9y4PSyekojmyQTbJNvHITkh5+SCVAgj+SZvJI358l5cd6dj2HrhDOaWSe/4Hx9A9Phpoc=</latexit>
slide-27
SLIDE 27

Predicting

arg maxˆ

y∈Y p(y|x, w)

<latexit sha1_base64="m2lAb5FLSi5DqwxswnlPaKZTapc=">ACJnicbVBNS8NAEN34bf2qevSyWAQFKYkKehFELx4VrK0pUy23Zxswm7E2I+TVe/CtePCgi3vwpbj8Oan0w8Hhvhpl5QSyFQdf9dCYmp6ZnZufmCwuLS8srxdW1axMlmvEKi2SkawEYLoXiFRQoeS3WHMJA8mpwe9b3q3dcGxGpK0xj3giho0RbMEArNYvHmY+8hxnoDg2hl+fNzO8CZmlOfaGoHwJ2GcjsJrfCoJPmN5OH3q79H6nWSy5ZXcAOk68ESmRES6axVe/FbEk5AqZBGPqnhtjw25HwSTPC35ieAzsFjq8bqmCkJtGNngzp1tWadF2pG0pAP150QGoTFpGNjO/tnmr9cX/PqCbaPGplQcYJcseGidiIpRrSfGW0JzRnK1BJgWthbKeuCBoY2YINwfv78ji53it7+W9y4PSyekojmyQTbJNvHITkh5+SCVAgj+SZvJI358l5cd6dj2HrhDOaWSe/4Hx9A9Phpoc=</latexit>

Viterbi (see last lecture)!

slide-28
SLIDE 28

HMMs v MEMMs

p(yi|yi−1)p(xi|yi)

<latexit sha1_base64="QRHI5XSTKvIjvWoYD+N4bumkyMY=">ACAnicbVDLSsNAFJ3UV62vqCtxM1iEdmFJqDLohuXFewD2hAm0k7dDIJMxMxOLGX3HjQhG3foU7/8ZpmoW2Hrhw5px7mXuPFzEqlWV9G4Wl5ZXVteJ6aWNza3vH3N1ryzAWmLRwyELR9ZAkjHLSUlQx0o0EQYHSMcbX039zh0Rkob8ViURcQI05NSnGCktueZBVElc+pC4KT2xJ1UYVe6zJ626ZtmqWRngIrFzUgY5mq751R+EOA4IV5ghKXu2FSknRUJRzMik1I8liRAeoyHpacpRQKSTZidM4LFWBtAPhS6uYKb+nkhRIGUSeLozQGok572p+J/Xi5V/4aSUR7EiHM8+8mMGVQinecABFQrlmiCsKB6V4hHSCsdGolHYI9f/Iiadr9mtfnNWblzmcRTBITgCFWCDc9A16AJWgCDR/AMXsGb8WS8GO/Gx6y1YOQz+APjM8fIOCWnA=</latexit>

HMM

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

MEMM permits richer representations!

φ

<latexit sha1_base64="U7EeqmYiKeTo/r/N2HofpG7Xero=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq7ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBPF7K2IRFhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D</latexit>
slide-29
SLIDE 29

Feature engineering

permits richer representations!

φ

<latexit sha1_base64="U7EeqmYiKeTo/r/N2HofpG7Xero=">AB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY9FLx4r2A9oQ9lsN83S3U3Y3Qgl9C948aCIV/+QN/+NmzYHbX0w8Hhvhpl5QcKZNq7ZQ2Nre2d8q7lb39g8Oj6vFJV8epIrRDYh6rfoA15UzSjmG036iKBYBp71gepf7vSeqNIvlo5kl1Bd4IlnICDa5NEwiNqrW3Lq7AFonXkFqUKA9qn4NxzFJBZWGcKz1wHMT42dYGUY4nVeGqaYJlM8oQNLJRZU+9ni1jm6sMoYhbGyJQ1aqL8nMiy0nonAdgpsIr3q5eJ/3iA14Y2fMZmkhkqyXBSmHJkY5Y+jMVOUGD6zBPF7K2IRFhYmw8FRuCt/ryOuk26t5VvfHQrLVuizjKcAbncAkeXEML7qENHSAQwTO8wpsjnBfn3flYtpacYuYU/sD5/AEU1o5D</latexit>

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">AAACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcrr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnFF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bbwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt22aisJYYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISss2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpHH9AhmiAWPYn2onfR+/hZfBB/jD9dpsbRuuYx+ivio1+Tp8Vi</latexit>

MEMM

Yesterday secretary of state Mike Pompeo meet with Ethiopia’s Prime Minister Abiy Ahmed

Consider NER: What are some potential features here?

slide-30
SLIDE 30

Feature engineering

Suppose we have some deep neural network that yields embeddings for each word; we could stack a MEMM on top of this. What would reasonable features be here?

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>
slide-31
SLIDE 31

The “label bias” problem

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>
slide-32
SLIDE 32

The “label bias” problem

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

Locally re-normalized; states are competing against each other

McCallum et al., 2001

slide-33
SLIDE 33

The “label bias” problem

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

Locally re-normalized; states are competing against each other In an extreme case, a particular y may always place all mass on some other y’ — ignoring the local input!

slide-34
SLIDE 34

The “label bias” problem

Figure from Awni Hannun, https://awni.github.io/label-bias/

slide-35
SLIDE 35

Example from Awni Hannun, https://awni.github.io/label-bias/

Y

<latexit sha1_base64="5liJAzaocy9CDCGFglLpK9GdXNs=">AB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRjcsK9iHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTAQ34LrfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt4CBYL9GMxKFg3XBym/vdJ6YNV/IBpgkLYjKSPOKUgJX8fkxgTInIHmeDas2tu3PgVeIVpIYKtAbVr/5Q0TRmEqgxviem0CQEQ2cCjar9FPDEkInZMR8SyWJmQmyeQZPrPKEdK2ycBz9XfGxmJjZnGoZ3MI5plLxf/8/wUousg4zJgUm6+ChKBQaF8/vxkGtGQUwtIVRzmxXTMdGEgm2pYkvwlk9eJZ1G3buoN+4va82bo4yOkGn6Bx56Ao10R1qoTaiSKFn9IreHBenHfnYzFacoqdY/QHzucPlhSRcw=</latexit>

Example

V

slide-36
SLIDE 36

Example from Awni Hannun, https://awni.github.io/label-bias/

x: cat sat N V y: V [cat, sat, the]

slide-37
SLIDE 37

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

Example from Awni Hannun, https://awni.github.io/label-bias/

x: cat sat N V y:

???

V [cat, sat, the]

slide-38
SLIDE 38

Example from Awni Hannun, https://awni.github.io/label-bias/

x: cat sat N V y:

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

V [cat, sat, the]

slide-39
SLIDE 39

Example from Awni Hannun, https://awni.github.io/label-bias/

x: cat sat N V y:

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

p(A, N|cat, sat) = 0.9 · 0.3 = 0.27

<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>

???

V [cat, sat, the]

slide-40
SLIDE 40

Example from Awni Hannun, https://awni.github.io/label-bias/

x: cat sat N V y:

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

p(A, N|cat, sat) = 0.9 · 0.3 = 0.27

<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>

V [cat, sat, the]

slide-41
SLIDE 41

Example from Awni Hannun, https://awni.github.io/label-bias/

Why is this happening?

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

p(A, N|cat, sat) = 0.9 · 0.3 = 0.27

<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>
slide-42
SLIDE 42

Example from Awni Hannun, https://awni.github.io/label-bias/

Why is this happening?

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

p(A, N|cat, sat) = 0.9 · 0.3 = 0.27

<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>

“cat” rarely seen as first word; poorly calibrated. But the mass has to go somewhere! Why?

slide-43
SLIDE 43

Example from Awni Hannun, https://awni.github.io/label-bias/

Why is this happening?

p(N, V|cat, sat) = 0.1 · 1.0 = 0.1

<latexit sha1_base64="KdbXUOQHD+0tJ0GKPAGKijWr2hk=">ACH3icbVDLSgMxFM3UV62vqks3wSJUKMNMFXUjFN24kgr2AZ1SMmnahmYeJHfEMs6fuPFX3LhQRNz1b0zbWjrgcC59zLzT1uKLgCyxobmaXldW17HpuY3Nreye/u1dXQSQpq9FABLpEsUE91kNOAjWDCUjnitYwx1eT/zGA5OKB/49jELW9kjf5z1OCWipkz8Li7ED7BHi2xKuJ8nTrNB2CSsCyTG+xJZpY4d2A8C2ac3qTr5gmdYUeJHYKSmgFNVO/tvpBjTymA9UEKVathVCOyYSOBUsyTmRYiGhQ9JnLU194jHVjqf3JfhIK13cC6R+PuCp+nsiJp5SI8/VnR6BgZr3JuJ/XiuC3kU75n4YAfPpbFEvEhgCPAkLd7lkFMRIE0Il13/FdEAkoaAjzekQ7PmTF0m9bNonZvnutFC5SuPIogN0iIrIRueogm5QFdUQRc/oFb2jD+PFeDM+ja9Za8ZIZ/bRHxjHziCn9s=</latexit>

p(A, N|cat, sat) = 0.9 · 0.3 = 0.27

<latexit sha1_base64="y8kKoGCAc2tiMF2g4+eVNm1UQYs=">ACIHicbVBNS0JBFJ1nX2ZfVs2QxIYiLyngbUIrDatwiA18InMG0cdnPfBzH2RvN5PadNfadOiNrVr2nUtyjtwMC59zLnXucQHAFpvlpBYWl5ZX0quZtfWNza3s9k5D+aGkrE594ctbhygmuMfqwEGw20Ay4jqCNZ3hxdhv3jGpuO/dwChgbZf0Pd7jlICWOtlKkI9sYPcQnRXwVRw/TAtF7AiEB/iU2wWT7BNuz5oVp7UpUonmzOL5gR4nlgJyaEtU720+76NHSZB1QpVqWGUA7IhI4FSzO2KFiAaFD0mctT3iMtWOJgfG+EArXdzpX4e4In6eyIirlIj19GdLoGBmvXG4n9eK4TecTviXhAC8+h0US8UGHw8Tgt3uWQUxEgTQiXf8V0QCShoDPN6BCs2ZPnSaNUtMrF0vVRrnqexJFGe2gf5ZGFKqiKLlEN1RFj+gZvaI348l4Md6Nj2lrykhmdtEfGN8/sC+gEg=</latexit>

“cat” rarely seen as first word; poorly calibrated. But the mass has to go somewhere! Why? Transitions are locally normalized

slide-44
SLIDE 44

Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores

slide-45
SLIDE 45

Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores Both are low! We are not confident about what to do with cat

slide-46
SLIDE 46

Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores

score(A, N|cat, sat) = 5 + 21 = 26

<latexit sha1_base64="OS4pNCHnCcV5NwRVoFqrt8IWOo=">ACInicbVDJSgNBEO1xjXGLevTSGARFCTNxPwhRL54kglEhCaGnU5M06VnorhHDON/ixV/x4kFRT4IfY2c5uD1oePVeFdX13EgKjb9Y2Mjo1PTGamstMzs3PzuYXFSx3GikOFhzJU1y7TIEUAFRQo4TpSwHxXwpXbOen5VzegtAiDC+xGUPdZKxCe4AyN1Mgd1BuMdE8VJCuJYPqaJOependoDCdm1QzTNfpId2hG7ToGFLcbeTydsHug/4lzpDkyRDlRu6t1gx57EOAXDKtq4dYT1hCgWXkGZrsYaI8Q5rQdXQgPmg60n/xJSuGqVJvVCZFyDtq98nEuZr3fVd0+kzbOvfXk/8z6vG6O3XExFEMULAB4u8WFIMaS8v2hQKOMquIYwrYf5KeZspxtGkmjUhOL9P/ksuiwVnq1A8386XjodxZMgyWSFrxCF7pEROSZlUCf35JE8kxfrwXqyXq3QeuINZxZIj9gfX4BVuaiIQ=</latexit>

score(N, V|cat, sat) = 3 + 100 = 103

<latexit sha1_base64="neHIvlPxb2nEFk2kxH1AH4/JLE=">ACJHicbVDJSgNBEO2Je9yiHr0BiGihBkjKIgevEkEUwiJEPo6dSYxp6F7hoxjPMxXvwVLx5c8ODFb7GzHDT6oOHVe1VU1/NiKTa9qeVm5icmp6ZncvPLywuLRdWVus6ShSHGo9kpK48pkGKEGoUMJVrIAFnoSGd3Pa9xu3oLSIwkvsxeAG7DoUvuAMjdQuHLYQ7jDVPFKQldJhdb5D61l2PyxM5w7VDLMtekQrdJs6tm2Y1fahaJdtgegf4kzIkUyQrVdeGt1Ip4ECKXTOumY8fopkyh4BKyfCvREDN+w6haWjIAtBuOjgyo5tG6VA/UuaFSAfqz4mUBVr3As90Bgy7etzri/95zQT9AzcVYZwghHy4yE8kxYj2E6MdoYCj7BnCuBLmr5R3mWIcTa5E4IzfvJfUt8tO5Xy7sVe8fhkFMcsWScbpEQcsk+OyRmpkhrh5IE8kRfyaj1az9a79TFszVmjmTXyC9bXN3CToqI=</latexit>
slide-47
SLIDE 47

Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores

score(A, N|cat, sat) = 5 + 21 = 26

<latexit sha1_base64="OS4pNCHnCcV5NwRVoFqrt8IWOo=">ACInicbVDJSgNBEO1xjXGLevTSGARFCTNxPwhRL54kglEhCaGnU5M06VnorhHDON/ixV/x4kFRT4IfY2c5uD1oePVeFdX13EgKjb9Y2Mjo1PTGamstMzs3PzuYXFSx3GikOFhzJU1y7TIEUAFRQo4TpSwHxXwpXbOen5VzegtAiDC+xGUPdZKxCe4AyN1Mgd1BuMdE8VJCuJYPqaJOependoDCdm1QzTNfpId2hG7ToGFLcbeTydsHug/4lzpDkyRDlRu6t1gx57EOAXDKtq4dYT1hCgWXkGZrsYaI8Q5rQdXQgPmg60n/xJSuGqVJvVCZFyDtq98nEuZr3fVd0+kzbOvfXk/8z6vG6O3XExFEMULAB4u8WFIMaS8v2hQKOMquIYwrYf5KeZspxtGkmjUhOL9P/ksuiwVnq1A8386XjodxZMgyWSFrxCF7pEROSZlUCf35JE8kxfrwXqyXq3QeuINZxZIj9gfX4BVuaiIQ=</latexit>

score(N, V|cat, sat) = 3 + 100 = 103

<latexit sha1_base64="neHIvlPxb2nEFk2kxH1AH4/JLE=">ACJHicbVDJSgNBEO2Je9yiHr0BiGihBkjKIgevEkEUwiJEPo6dSYxp6F7hoxjPMxXvwVLx5c8ODFb7GzHDT6oOHVe1VU1/NiKTa9qeVm5icmp6ZncvPLywuLRdWVus6ShSHGo9kpK48pkGKEGoUMJVrIAFnoSGd3Pa9xu3oLSIwkvsxeAG7DoUvuAMjdQuHLYQ7jDVPFKQldJhdb5D61l2PyxM5w7VDLMtekQrdJs6tm2Y1fahaJdtgegf4kzIkUyQrVdeGt1Ip4ECKXTOumY8fopkyh4BKyfCvREDN+w6haWjIAtBuOjgyo5tG6VA/UuaFSAfqz4mUBVr3As90Bgy7etzri/95zQT9AzcVYZwghHy4yE8kxYj2E6MdoYCj7BnCuBLmr5R3mWIcTa5E4IzfvJfUt8tO5Xy7sVe8fhkFMcsWScbpEQcsk+OyRmpkhrh5IE8kRfyaj1az9a79TFszVmjmTXyC9bXN3CToqI=</latexit>

Reversed!

slide-48
SLIDE 48

Example from Awni Hannun, https://awni.github.io/label-bias/ Hypothetical unnormalized edge scores

score(A, N|cat, sat) = 5 + 21 = 26

<latexit sha1_base64="OS4pNCHnCcV5NwRVoFqrt8IWOo=">ACInicbVDJSgNBEO1xjXGLevTSGARFCTNxPwhRL54kglEhCaGnU5M06VnorhHDON/ixV/x4kFRT4IfY2c5uD1oePVeFdX13EgKjb9Y2Mjo1PTGamstMzs3PzuYXFSx3GikOFhzJU1y7TIEUAFRQo4TpSwHxXwpXbOen5VzegtAiDC+xGUPdZKxCe4AyN1Mgd1BuMdE8VJCuJYPqaJOependoDCdm1QzTNfpId2hG7ToGFLcbeTydsHug/4lzpDkyRDlRu6t1gx57EOAXDKtq4dYT1hCgWXkGZrsYaI8Q5rQdXQgPmg60n/xJSuGqVJvVCZFyDtq98nEuZr3fVd0+kzbOvfXk/8z6vG6O3XExFEMULAB4u8WFIMaS8v2hQKOMquIYwrYf5KeZspxtGkmjUhOL9P/ksuiwVnq1A8386XjodxZMgyWSFrxCF7pEROSZlUCf35JE8kxfrwXqyXq3QeuINZxZIj9gfX4BVuaiIQ=</latexit>

score(N, V|cat, sat) = 3 + 100 = 103

<latexit sha1_base64="neHIvlPxb2nEFk2kxH1AH4/JLE=">ACJHicbVDJSgNBEO2Je9yiHr0BiGihBkjKIgevEkEUwiJEPo6dSYxp6F7hoxjPMxXvwVLx5c8ODFb7GzHDT6oOHVe1VU1/NiKTa9qeVm5icmp6ZncvPLywuLRdWVus6ShSHGo9kpK48pkGKEGoUMJVrIAFnoSGd3Pa9xu3oLSIwkvsxeAG7DoUvuAMjdQuHLYQ7jDVPFKQldJhdb5D61l2PyxM5w7VDLMtekQrdJs6tm2Y1fahaJdtgegf4kzIkUyQrVdeGt1Ip4ECKXTOumY8fopkyh4BKyfCvREDN+w6haWjIAtBuOjgyo5tG6VA/UuaFSAfqz4mUBVr3As90Bgy7etzri/95zQT9AzcVYZwghHy4yE8kxYj2E6MdoYCj7BnCuBLmr5R3mWIcTa5E4IzfvJfUt8tO5Xy7sVe8fhkFMcsWScbpEQcsk+OyRmpkhrh5IE8kRfyaj1az9a79TFszVmjmTXyC9bXN3CToqI=</latexit>

Reversed!

Note: Why add instead of multiply here?

slide-49
SLIDE 49

Example from Awni Hannun, https://awni.github.io/label-bias/

Label bias

  • Because transitions are locally normalized, MEMMs prefer

low entropy states 


  • Difficult to “recover” from mistakes
slide-50
SLIDE 50

s(x, y) = X

i

s(yi, xi, yi−1)

<latexit sha1_base64="NTiu2QC7zQWcXiqF/ZOg/1cknFU=">ACDnicbVDLSsNAFJ34rPUVdelmsBRaqCWpgm6EohuXFewD2hAm02k7dCYJMxNpCP0CN/6KGxeKuHXtzr9x0mahrQfu5XDOvczc4WMSmVZ38bK6tr6xmZuK7+9s7u3bx4ctmQCUyaOGCB6HhIEkZ90lRUMdIJBUHcY6TtjW9Sv/1AhKSBf6/ikDgcDX06oBgpLblmUZYmFRiX4RXsyYi7FMpS7NIKnKQtdhN6ak/LrlmwqtYMcJnYGSmADA3X/Or1Ax4ivMkJRd2wqVkyChKGZkmu9FkoQIj9GQdDX1ESfSWbnTGFRK304CIQuX8GZ+nsjQVzKmHt6kiM1koteKv7ndSM1uHQS6oeRIj6ePzSIGFQBTLOBfSoIVizWBGFB9V8hHiGBsNIJ5nUI9uLJy6RVq9pn1drdeaF+ncWRA8fgBJSADS5AHdyCBmgCDB7BM3gFb8aT8WK8Gx/z0RUj2zkCf2B8/gAs6pms</latexit>

Global scores

slide-51
SLIDE 51

s(x, y) = X

i

s(yi, xi, yi−1)

<latexit sha1_base64="NTiu2QC7zQWcXiqF/ZOg/1cknFU=">ACDnicbVDLSsNAFJ34rPUVdelmsBRaqCWpgm6EohuXFewD2hAm02k7dCYJMxNpCP0CN/6KGxeKuHXtzr9x0mahrQfu5XDOvczc4WMSmVZ38bK6tr6xmZuK7+9s7u3bx4ctmQCUyaOGCB6HhIEkZ90lRUMdIJBUHcY6TtjW9Sv/1AhKSBf6/ikDgcDX06oBgpLblmUZYmFRiX4RXsyYi7FMpS7NIKnKQtdhN6ak/LrlmwqtYMcJnYGSmADA3X/Or1Ax4ivMkJRd2wqVkyChKGZkmu9FkoQIj9GQdDX1ESfSWbnTGFRK304CIQuX8GZ+nsjQVzKmHt6kiM1koteKv7ndSM1uHQS6oeRIj6ePzSIGFQBTLOBfSoIVizWBGFB9V8hHiGBsNIJ5nUI9uLJy6RVq9pn1drdeaF+ncWRA8fgBJSADS5AHdyCBmgCDB7BM3gFb8aT8WK8Gx/z0RUj2zkCf2B8/gAs6pms</latexit>

Global scores

Why can’t we just maximize this, period?

slide-52
SLIDE 52

s(x, y) = X

i

s(yi, xi, yi−1)

<latexit sha1_base64="NTiu2QC7zQWcXiqF/ZOg/1cknFU=">ACDnicbVDLSsNAFJ34rPUVdelmsBRaqCWpgm6EohuXFewD2hAm02k7dCYJMxNpCP0CN/6KGxeKuHXtzr9x0mahrQfu5XDOvczc4WMSmVZ38bK6tr6xmZuK7+9s7u3bx4ctmQCUyaOGCB6HhIEkZ90lRUMdIJBUHcY6TtjW9Sv/1AhKSBf6/ikDgcDX06oBgpLblmUZYmFRiX4RXsyYi7FMpS7NIKnKQtdhN6ak/LrlmwqtYMcJnYGSmADA3X/Or1Ax4ivMkJRd2wqVkyChKGZkmu9FkoQIj9GQdDX1ESfSWbnTGFRK304CIQuX8GZ+nsjQVzKmHt6kiM1koteKv7ndSM1uHQS6oeRIj6ePzSIGFQBTLOBfSoIVizWBGFB9V8hHiGBsNIJ5nUI9uLJy6RVq9pn1drdeaF+ncWRA8fgBJSADS5AHdyCBmgCDB7BM3gFb8aT8WK8Gx/z0RUj2zkCf2B8/gAs6pms</latexit>

Global scores

Why can’t we just maximize this, period? No competition between different labels!

slide-53
SLIDE 53

Global normalization

p(y|x) = exp{P

i s(yi, xi, yi1)}

P

y0 exp{P i s(y0 i, xi, y0 i1)}

<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>
slide-54
SLIDE 54

Global normalization

p(y|x) = exp{P

i s(yi, xi, yi1)}

P

y0 exp{P i s(y0 i, xi, y0 i1)}

<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>

This is a linear-chain Conditional Random Field (CRF)

slide-55
SLIDE 55

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

MEMMs locally normalize, chain together transition probabilities:

MEMMs vs CRFs

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

p(y|x) =

<latexit sha1_base64="xBuP5C34Eg6p5fs4XVAek6u82uE=">AB73icbVBNS8NAEJ3Ur1q/qh69LBahXkrSCnoRil48VrAf0Iay2W7apZtN3N2IfZPePGgiFf/jf/jds2B219MPB4b4aZeV7EmdK2/W3lVlbX1jfym4Wt7Z3dveL+QUuFsS0SUIeyo6HFeVM0KZmtNOJCkOPE7b3vh6rcfqFQsFHc6iagb4KFgPiNYG6kTlZOnx1N02S+W7Io9A1omTkZKkKHRL371BiGJAyo04ViprmNH2k2x1IxwOin0YkUjTMZ4SLuGChxQ5azeyfoxCgD5IfSlNBopv6eSHGgVBJ4pjPAeqQWvan4n9eNtX/hpkxEsaCzBf5MUc6RNPn0YBJSjRPDMFEMnMrIiMsMdEmoIJwVl8eZm0qhWnVqnenpXqV1kceTiCYyiDA+dQhxtoQBMIcHiGV3iz7q0X6936mLfmrGzmEP7A+vwBGSuPWQ=</latexit>
slide-56
SLIDE 56

) = exp(w · φ(x1, ..., xm, yi1, yi)) P

y02Y exp(w · φ(x1, ..., xm, yi1, y0))

<latexit sha1_base64="lByY6hsewcjnF4/QMN2j0Hndpck=">ACmnicnVFNbxMxEPUuXyV8BThwgINFhJpI6Wq3IMGlUlUkPsSlSE1bFEcr+NtrNpey56lWZn9UfwVbvwbvGmEIOXESJae3pt545kpjBQO0vRnF+7fuPmra3bvTt3791/0H/46NhVtWV8wipZ2dOCOi6F5hMQIPmpsZyqQvKT4vxtp5985daJSh9BY/hM0TMtSsEoBCrvfzfD5luTe7GTtWO8zLMxTpKkQ2qML0Z4D5PSUuY9Ab4Ez5embYcXmLB5BZiYhRhu1Pz2anIxGrWeuFrlvtnGRGhMFIUFo9J/aVv8H5bwTHvD9IkXQW+CrI1GKB1HOb9H2ResVpxDUxS56ZamDmqQXBJG97pHbcUHZOz/g0QE0VdzO/Wm2LXwRmjsvKhqcBr9g/KzxVzjWqCJndbG5T68h/adMayjczL7SpgWt2aisJYKd3fCc2E5A9kEQJkV4a+YLWg4BYRr9sISs2Rr4Lj3SR7mex+fjXYP1ivYws9Rc/REGXoNdpH9AhmiAWPYn2onfR+/hZfB/jD9dpsbRuYx+ivio1+Tp8Vi</latexit>

p(y|x) = exp{P

i s(yi, xi, yi1)}

P

y0 exp{P i s(y0 i, xi, y0 i1)}

<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>

MEMMs locally normalize, chain together transition probabilities:

MEMMs vs CRFs

=

m

Y

i

p(yi|yi−1, x1, ...xm)

<latexit sha1_base64="NztfWNFsRwx0t+HV7rgPZsizE+U=">ACEnicbZDLSsNAFIYnXmu9RV26GSxCzUkVdCNUHTjsoK9QBvDZDJth84kYWYihthncOruHGhiFtX7nwbp5eFtv4w8PGfczhzfj9mVCrb/jYWFpeWV1Zza/n1jc2tbXNntyGjRGBSxGLRMtHkjAakrqipFWLAjiPiNf3A5qjfviJA0Cm9UGhOXo15IuxQjpS3PLJ3DTiyiwKO3HMbF1KMPqZfRI2dYhveU4aWZWngJeiZBduyx4Lz4EyhAKaqeZXJ4hwkmoMENSth07Vm6GhKYkWG+k0gSIzxAPdLWGCJOpJuNTxrCQ+0EsBsJ/UIFx+7viQxKVPu606OVF/O1kbmf7V2orpnbkbDOFEkxJNF3YRBFcFRPjCgmDFUg0IC6r/CnEfCYSVTjGvQ3BmT56HRsVyjq3K9UmhejGNIwf2wQEoAgecgiq4AjVQBxg8gmfwCt6MJ+PFeDc+Jq0LxnRmD/yR8fkDGFSbLQ=</latexit>

CRFs globally normalize

p(y|x) =

<latexit sha1_base64="xBuP5C34Eg6p5fs4XVAek6u82uE=">AB73icbVBNS8NAEJ3Ur1q/qh69LBahXkrSCnoRil48VrAf0Iay2W7apZtN3N2IfZPePGgiFf/jf/jds2B219MPB4b4aZeV7EmdK2/W3lVlbX1jfym4Wt7Z3dveL+QUuFsS0SUIeyo6HFeVM0KZmtNOJCkOPE7b3vh6rcfqFQsFHc6iagb4KFgPiNYG6kTlZOnx1N02S+W7Io9A1omTkZKkKHRL371BiGJAyo04ViprmNH2k2x1IxwOin0YkUjTMZ4SLuGChxQ5azeyfoxCgD5IfSlNBopv6eSHGgVBJ4pjPAeqQWvan4n9eNtX/hpkxEsaCzBf5MUc6RNPn0YBJSjRPDMFEMnMrIiMsMdEmoIJwVl8eZm0qhWnVqnenpXqV1kceTiCYyiDA+dQhxtoQBMIcHiGV3iz7q0X6936mLfmrGzmEP7A+vwBGSuPWQ=</latexit>
slide-57
SLIDE 57

Example from Awni Hannun, https://awni.github.io/label-bias/

p(y|x) = exp{P

i s(yi, xi, yi1)}

P

y0 exp{P i s(y0 i, xi, y0 i1)}

<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>

Y

<latexit sha1_base64="5liJAzaocy9CDCGFglLpK9GdXNs=">AB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRjcsK9iHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTAQ34LrfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt4CBYL9GMxKFg3XBym/vdJ6YNV/IBpgkLYjKSPOKUgJX8fkxgTInIHmeDas2tu3PgVeIVpIYKtAbVr/5Q0TRmEqgxviem0CQEQ2cCjar9FPDEkInZMR8SyWJmQmyeQZPrPKEdK2ycBz9XfGxmJjZnGoZ3MI5plLxf/8/wUousg4zJgUm6+ChKBQaF8/vxkGtGQUwtIVRzmxXTMdGEgm2pYkvwlk9eJZ1G3buoN+4va82bo4yOkGn6Bx56Ao10R1qoTaiSKFn9IreHBenHfnYzFacoqdY/QHzucPlhSRcw=</latexit>

V [cat, sat, the] x = cat sat

A, N A, V N, V

For simplicity assume no self-loops

N, A

Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =

<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>
slide-58
SLIDE 58

Example from Awni Hannun, https://awni.github.io/label-bias/

p(y|x) = exp{P

i s(yi, xi, yi1)}

P

y0 exp{P i s(y0 i, xi, y0 i1)}

<latexit sha1_base64="VG/tL6ULrk9Ed2v+W0vzBAdqIc=">ACX3icdVHPSxwxGM1Mtepq16mepJfQRXaFdpnRgl4E0UuPFroq7CxDJvuNGzbzg+Qb2SHOP9mb4MX/xMzugq3aBwmP975Hkpe4kEKj7z847oeV1Y9r6xutza1P7W3v86VzkvFYcBzmaubmGmQIoMBCpRwUyhgaSzhOp5eNP71HSgt8uw3VgWMUnabiURwhlaKvLuiV93PDugpDRPFuDEhwgwNzIq6pqEJdZlGgupeFYlvdNZsVWTE96A+COt6YZuqW9P/5rovwe5LMvI6ft+fg74lwZJ0yBKXkfcnHOe8TCFDLpnWw8AvcGSYQsEl1K2w1FAwPmW3MLQ0YynokZn3U9N9q4xpkiu7MqRz9e+EYanWVRrbyZThRL/2GvE9b1hicjIyIitKhIwvDkpKSTGnTdl0LBRwlJUljCth70r5hNme0X5Jy5YQvH7yW3J12A+O+oe/fnTOzpd1rJMv5CvpkYAckzPyk1ySAeHk0XGdTWfLeXLX3LbrLUZdZ5nZJf/A3XsGOue1cQ=</latexit>

Y

<latexit sha1_base64="5liJAzaocy9CDCGFglLpK9GdXNs=">AB8nicbVDLSgMxFM3UV62vqks3wSK4KjNV0GXRjcsK9iHToWTSTBuaSYbkjlCGfoYbF4q49Wvc+Tdm2lo64HA4Zx7ybknTAQ34LrfTmltfWNzq7xd2dnd2z+oHh51jEo1ZW2qhNK9kBgmuGRt4CBYL9GMxKFg3XBym/vdJ6YNV/IBpgkLYjKSPOKUgJX8fkxgTInIHmeDas2tu3PgVeIVpIYKtAbVr/5Q0TRmEqgxviem0CQEQ2cCjar9FPDEkInZMR8SyWJmQmyeQZPrPKEdK2ycBz9XfGxmJjZnGoZ3MI5plLxf/8/wUousg4zJgUm6+ChKBQaF8/vxkGtGQUwtIVRzmxXTMdGEgm2pYkvwlk9eJZ1G3buoN+4va82bo4yOkGn6Bx56Ao10R1qoTaiSKFn9IreHBenHfnYzFacoqdY/QHzucPlhSRcw=</latexit>

V [cat, sat, the] x = cat sat

A, N A, V N, V

For simplicity assume no self-loops

N, A

Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =

<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>

p(N, V ) ∼ 1

<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>
slide-59
SLIDE 59

A, N A, V N, V N, A

Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =

<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>

p(N, V ) ∼ 1

<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>

Regularization

  • This suggests maybe not great calibration — scores too large?

exp(103) is really big!


  • Important to regularize parameters
slide-60
SLIDE 60

A, N A, V N, V N, A

Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =

<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>

p(N, V ) ∼ 1

<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>

Prediction

  • Do we actually need to compute Z if we just want to make a

prediction?

slide-61
SLIDE 61

A, N A, V N, V N, A

Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =

<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>

p(N, V ) ∼ 1

<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>

Prediction

  • Do we actually need to compute Z if we just want to make a

prediction?

  • No; we just need argmax over y’. How can we compute

efficiently?

slide-62
SLIDE 62

A, N A, V N, V N, A

Z(x) = exp(5 + 21) + exp(5 + 20) + exp(3 + 100) + exp(3, 0) =

<latexit sha1_base64="10+bZ7LNe9gyob/7oSlUlhX7stw=">ACbXicbZFNSxtBGMdnt7bqau24qFWZGiQJihN7HoRQj14tGCeaFJCLOTZ3Vw9oWZ0vCsrd+Qm9+BS9+BWdjDjbJAwN/fs/bzH+CVAqNnvdo2e/W3n9Y39h0trY/ftpxP3/p6CRTHNo8kYnqBUyDFDG0UaCEXqARYGEbnB/Wea7f0FpkcQ3OE1hGLHbWISCMzRo5P7U53U6AXNBwgTzGSFkX1Jz2mDb92vIJ6NbqAmwb43gp+Qr1ysnM0CBXj+aoNRV6uL0Zuxat7s6DLwp+LCpnH9ch9GIwTnkUQI5dM67vpTjMmULBJRTOINOQMn7PbqFvZMwi0MN85lZBjwZ0zBR5sRIZ/RtR84iradRYCojhnd6MVfCVbl+huH5MBdxmiHE/HVRmEmKCS2tp2OhgKOcGsG4EuaulN8x4w2aD3KMCf7ik5dFp1H3m/XG79NK69fcjg3yjXwnVeKTM9IiV+SatAknT5ZrfbX2rWd7z6wD19LbWves0v+C/vHC/O/tKc=</latexit>

p(N, V ) ∼ 1

<latexit sha1_base64="BvjU8nSDQup7L641dnjNJmeku0w=">AB9HicbVBNSwMxEJ31s9avqkcvwSJUkLJbBT0WvXiSCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6387K6tr6xmZuK7+9s7u3Xzg4bOgoUYTWScQj1QqwpxJWjfMcNqKFcUi4LQZDG+nfnNElWaRfDTjmPoC9yULGcHGSn5cuj9vnKGOZgJ53ULRLbszoGXiZaQIGWrdwlenF5FEUGkIx1q3PTc2foqVYTSb6TaBpjMsR92rZUYkG1n86OnqBTq/RQGClb0qCZ+nsixULrsQhsp8BmoBe9qfif105MeO2nTMaJoZLMF4UJRyZC0wRQjylKDB9bgoli9lZEBlhYmxOeRuCt/jyMmlUyt5FufJwWazeZHk4BhOoAQeXEV7qAGdSDwBM/wCm/OyHlx3p2PeuKk80cwR84nz+MTJCn</latexit>

Prediction

  • Do we actually need to compute Z if we just want to make a

prediction?

  • No; we just need argmax over y’. How can we compute

efficiently? Dynamic programming (from last time)

slide-63
SLIDE 63

Parameter estimation for Linear-Chain CRFs (board)

slide-64
SLIDE 64

Example: OCR

https://pystruct.github.io/auto_examples/plot_letters.html

(Notebook)

slide-65
SLIDE 65

Beyond linear-chains

Logistic Regression Linear-chain CRFs

SEQUENCE

General CRFs

General GRAPHS

Figure from Sutton and McCallum, 2011

slide-66
SLIDE 66
slide-67
SLIDE 67

Beyond linear-chains

Logistic Regression Linear-chain CRFs

SEQUENCE

General CRFs

General GRAPHS

Figure from Sutton and McCallum, 2011

slide-68
SLIDE 68

Beyond linear-chains

Figure from Hugo Larochelle, http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html

Grid structure (pixels in image) General pair-wise structure (webpages sharing a link)

p(y|X) = 1 Z(X) Y

f

Ψf(y, X)

slide-69
SLIDE 69

Training general CRFs

Figure from Hugo Larochelle, http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html

  • ∂−log p(y(t)|X(t))

∂θ

= − ⇣P

f ∂ ∂θ log Ψf(y(t), X(t))

) − Ey hP

f ∂ ∂θ log Ψf(y, X(t))

  • X(t) i⌘

)}

)}

make y(t) more likely make everything less likely

Looks similar to what we had for linear-chain, but can no longer use dynamic programming to efficiently take expectation over y

* Here we denote parameters instead of w

θ

<latexit sha1_base64="VRbFNfU2yJrhxTioHNG9u2eQ2g=">AB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRI3g7GtzO/cS1EbF6wEnC/YgOlQgFo2ilVg9HGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYglWthbCRtRTRnagEo2BG/5VXSqlW9i2rt/rJSv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPHMfOJ8/pUWPLA=</latexit>

*

slide-70
SLIDE 70

Summary: Structured prediction

  • When labels y are correlated (and where for a given

instance x and y are both tensors) structured prediction models attempt to exploit this

slide-71
SLIDE 71

Summary: Structured prediction

  • When labels y are correlated (and where for a given

instance x and y are both tensors) structured prediction models attempt to exploit this

  • Hidden Markov Models (HMMs) are a generative

approach that model P(x, y)

slide-72
SLIDE 72

Summary: Structured prediction

  • When labels y are correlated (and where for a given

instance x and y are both tensors) structured prediction models attempt to exploit this

  • Hidden Markov Models (HMMs) are a generative

approach that model P(x, y)

  • Structured perceptrons, MEMMs, and CRFs are

conditional models model p(y|x)

slide-73
SLIDE 73

Summary: Structured prediction

  • When labels y are correlated (and where for a given

instance x and y are both tensors) structured prediction models attempt to exploit this

  • Hidden Markov Models (HMMs) are a generative

approach that model P(x, y)

  • Structured perceptrons, MEMMs, and CRFs are

conditional models model p(y|x)

  • For all we use dynamic programming (Viterbi) for efficient

argmaxing, and variants of this to efficiently compute normalization constants, etc.