Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, - - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 8: Viterbi, discriminative sequence labeling, NER Emma Strubell Announcements Project 1 is due tomorrow! You may submit up to 3 days late (out of a budget of 5 total for the semester).


slide-1
SLIDE 1

Emma Strubell

Algorithms for NLP

CS 11-711 · Fall 2020

Lecture 8: Viterbi, discriminative sequence labeling, NER

slide-2
SLIDE 2

Announcements

2

■ Project 1 is due tomorrow! You may submit up to 3 days late (out of a budget of

5 total for the semester).

■ No recitation tomorrow (Friday). Do your homework.

slide-3
SLIDE 3

Recap

Hidden Markov models (HMMs)

3

NN3 VB1 MD2

a22 a11 a12 a21 a13 a33 a32 a23 a31 P("aardvark" | NN)

...

P(“will” | NN)

...

P("the" | NN)

...

P(“back” | NN)

...

P("zebra" | NN) B3 P("aardvark" | VB)

...

P(“will” | VB)

...

P("the" | VB)

...

P(“back” | VB)

...

P("zebra" | VB) B1 P("aardvark" | MD)

...

P(“will” | MD)

...

P("the" | MD)

...

P(“back” | MD)

...

P("zebra" | MD) B2

slide-4
SLIDE 4

Recap

Hidden Markov models (HMMs)

4

Q = q1q2 ...qN a set of N states A = a11 ...aij ...aNN a transition probability matrix A, each ai j representing the probability

  • f moving from state i to state j, s.t. PN

j=1 ai j = 1 ∀i

O = o1o2 ...oT a sequence of T observations, each one drawn from a vocabulary V = v1,v2,...,vV B = bi(ot) a sequence of observation likelihoods, also called emission probabili- ties, each expressing the probability of an observation ot being generated from a state qi π = π1,π2,...,πN an initial probability distribution over states. πi is the probability that the Markov chain will start in state i. Some states j may have πj = 0, meaning that they cannot be initial states. Also, Pn

i=1 πi = 1

slide-5
SLIDE 5

Recap

Hidden Markov models (HMMs)

4

Q = q1q2 ...qN a set of N states A = a11 ...aij ...aNN a transition probability matrix A, each ai j representing the probability

  • f moving from state i to state j, s.t. PN

j=1 ai j = 1 ∀i

O = o1o2 ...oT a sequence of T observations, each one drawn from a vocabulary V = v1,v2,...,vV B = bi(ot) a sequence of observation likelihoods, also called emission probabili- ties, each expressing the probability of an observation ot being generated from a state qi π = π1,π2,...,πN an initial probability distribution over states. πi is the probability that the Markov chain will start in state i. Some states j may have πj = 0, meaning that they cannot be initial states. Also, Pn

i=1 πi = 1

slide-6
SLIDE 6

Recap

Hidden Markov models (HMMs)

4

Q = q1q2 ...qN a set of N states A = a11 ...aij ...aNN a transition probability matrix A, each ai j representing the probability

  • f moving from state i to state j, s.t. PN

j=1 ai j = 1 ∀i

O = o1o2 ...oT a sequence of T observations, each one drawn from a vocabulary V = v1,v2,...,vV B = bi(ot) a sequence of observation likelihoods, also called emission probabili- ties, each expressing the probability of an observation ot being generated from a state qi π = π1,π2,...,πN an initial probability distribution over states. πi is the probability that the Markov chain will start in state i. Some states j may have πj = 0, meaning that they cannot be initial states. Also, Pn

i=1 πi = 1

slide-7
SLIDE 7

Recap

Hidden Markov models (HMMs)

5

Forward Viterbi Forward-backward; Baum-Welch

slide-8
SLIDE 8

Recap

Hidden Markov models (HMMs)

5

Forward Viterbi Forward-backward; Baum-Welch

slide-9
SLIDE 9

HMM tagging as decoding

6

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn q1 q2 qn …

  • 1
  • 2
  • n

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 )

<latexit sha1_base64="kj8oX+bvtC3juVL/nWM6fABUd8=">ADvHicfVJb9MwFPYaLqNc1sEjL4Zq0kCoagoSvIAm4IEXRJHoNqkuleOcpFZ9iWynWxXlf/BreIW/wL/BScq0rBNHivPp8zk+37lEmeDWDYd/djrBjZu3bu/e6d69d/BXm/4bHVuWEwYVpocxpRC4IrmDjuBJxmBqiMBJxEyw/V/ckKjOVafXPrDGaSponFHnqXlv1CUL6gpXzsPvCr/FhJpU0vN54SqixOPDGmAieYzPKvhs3usPB8Pa8DYIN6CPNjae73cMiTXLJSjHBLV2Gg4zNyuocZwJKLskt5BRtqQpTD1UVIKdFXVxJT7wTIwTbfynHK7ZyxEFldauZeQ9JXULe/WuIq+7m+YueTMruMpyB4o1iZJcYKdx1SkcwPMibUHlBnutWK2oIYy5/vZPbicZgFiBa5dCJOzwiZ19pakSJbt4POm0C4xoOCMaSmpip8XJKGSi3UMCc2FKwtik3/4un69iFc8s5vWXTwpwBFteMoVFQISR6qjTfvfwpH67JKP4Adk4LNX/SUDQ502XkmzE6UfWEqekAr+z5OrC08P2UVtQBfTNUXnYEqyhoyoS2QKDU6z1qCt+Jrof4BmvgxNP7QDms8/JaGV3dyGxyPBuHLwejrq/7R+82+7qLH6Ck6RCF6jY7QJzRGE8TQD/QT/UK/g3dBHCwD2bh2djYxj1DLgtVfxFEqg=</latexit>
slide-10
SLIDE 10

HMM tagging as decoding

7

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 )

<latexit sha1_base64="kj8oX+bvtC3juVL/nWM6fABUd8=">ADvHicfVJb9MwFPYaLqNc1sEjL4Zq0kCoagoSvIAm4IEXRJHoNqkuleOcpFZ9iWynWxXlf/BreIW/wL/BScq0rBNHivPp8zk+37lEmeDWDYd/djrBjZu3bu/e6d69d/BXm/4bHVuWEwYVpocxpRC4IrmDjuBJxmBqiMBJxEyw/V/ckKjOVafXPrDGaSponFHnqXlv1CUL6gpXzsPvCr/FhJpU0vN54SqixOPDGmAieYzPKvhs3usPB8Pa8DYIN6CPNjae73cMiTXLJSjHBLV2Gg4zNyuocZwJKLskt5BRtqQpTD1UVIKdFXVxJT7wTIwTbfynHK7ZyxEFldauZeQ9JXULe/WuIq+7m+YueTMruMpyB4o1iZJcYKdx1SkcwPMibUHlBnutWK2oIYy5/vZPbicZgFiBa5dCJOzwiZ19pakSJbt4POm0C4xoOCMaSmpip8XJKGSi3UMCc2FKwtik3/4un69iFc8s5vWXTwpwBFteMoVFQISR6qjTfvfwpH67JKP4Adk4LNX/SUDQ502XkmzE6UfWEqekAr+z5OrC08P2UVtQBfTNUXnYEqyhoyoS2QKDU6z1qCt+Jrof4BmvgxNP7QDms8/JaGV3dyGxyPBuHLwejrq/7R+82+7qLH6Ck6RCF6jY7QJzRGE8TQD/QT/UK/g3dBHCwD2bh2djYxj1DLgtVfxFEqg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

P(w n

1 )

<latexit sha1_base64="9BSv32hbIdli0R5g8Y34vtwAIU=">AD+3icfVJNb9NAEN0k0Bbz0RSOXBaiSglCUVKQ4IJUAQcuiCRtlI2ROv1OFl1P8zuOm1k+dwQ1z5LYgfg8TaTqu6qRjJ6eZN7szbyZMBLduMPjTaLZu3d7a3rkT3L13/8Fue+/hkdWpYTBmWmhzElILgisYO+4EnCQGqAwFHIen74r48RKM5Vp9casEpLOFY85o867Zu1vwT5ZUJe5fDb8qvAbTKiZS3o+y1zhyPGoWwJMJI/wWQF7wSaLxIaybNQtCSW1DPTW2b38ItbLZ+3OoD8oDW+C4Rp0NpGs72mIZFmqQTlmKDWToaDxE0zahxnAvKApBYSyk7pHCYeKirBTrNSmxzve0+EY238pxwuvVczMiqtXcnQMyV1C3s9Vjhvik1SF7+eZlwlqQPFqofiVGCncSE0jrgB5sTKA8oM97VitqBeJufH4VW/8swCxBJcvREmp5mNy9drJYUyryefV40GxICM6alpCp6lpGYSi5WEcQ0FS7PiI0v8E16PY+WPLFr6S6vFOCINnzOFRUCYkeKo+72v4Uj5RmQ9+AHZOCjr/pTAoY6bXwl1a7kfmBz8oQU8H9Mri6ZHtbysoCfDOFLjoBleUlZEJbIOHc6DSpFbyRXxbqL6CxH0PFh3paxfBbOry+k5vg6KA/fNE/+Pyc/h2va876DF6irpoiF6hQ/QBjdAYMfQb/W1sNbZbet760frZ0VtNtY5j1DNWr/+AWJYWpg=</latexit>
slide-11
SLIDE 11

HMM tagging as decoding

7

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 )

<latexit sha1_base64="kj8oX+bvtC3juVL/nWM6fABUd8=">ADvHicfVJb9MwFPYaLqNc1sEjL4Zq0kCoagoSvIAm4IEXRJHoNqkuleOcpFZ9iWynWxXlf/BreIW/wL/BScq0rBNHivPp8zk+37lEmeDWDYd/djrBjZu3bu/e6d69d/BXm/4bHVuWEwYVpocxpRC4IrmDjuBJxmBqiMBJxEyw/V/ckKjOVafXPrDGaSponFHnqXlv1CUL6gpXzsPvCr/FhJpU0vN54SqixOPDGmAieYzPKvhs3usPB8Pa8DYIN6CPNjae73cMiTXLJSjHBLV2Gg4zNyuocZwJKLskt5BRtqQpTD1UVIKdFXVxJT7wTIwTbfynHK7ZyxEFldauZeQ9JXULe/WuIq+7m+YueTMruMpyB4o1iZJcYKdx1SkcwPMibUHlBnutWK2oIYy5/vZPbicZgFiBa5dCJOzwiZ19pakSJbt4POm0C4xoOCMaSmpip8XJKGSi3UMCc2FKwtik3/4un69iFc8s5vWXTwpwBFteMoVFQISR6qjTfvfwpH67JKP4Adk4LNX/SUDQ502XkmzE6UfWEqekAr+z5OrC08P2UVtQBfTNUXnYEqyhoyoS2QKDU6z1qCt+Jrof4BmvgxNP7QDms8/JaGV3dyGxyPBuHLwejrq/7R+82+7qLH6Ck6RCF6jY7QJzRGE8TQD/QT/UK/g3dBHCwD2bh2djYxj1DLgtVfxFEqg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

P(w n

1 )

<latexit sha1_base64="9BSv32hbIdli0R5g8Y34vtwAIU=">AD+3icfVJNb9NAEN0k0Bbz0RSOXBaiSglCUVKQ4IJUAQcuiCRtlI2ROv1OFl1P8zuOm1k+dwQ1z5LYgfg8TaTqu6qRjJ6eZN7szbyZMBLduMPjTaLZu3d7a3rkT3L13/8Fue+/hkdWpYTBmWmhzElILgisYO+4EnCQGqAwFHIen74r48RKM5Vp9casEpLOFY85o867Zu1vwT5ZUJe5fDb8qvAbTKiZS3o+y1zhyPGoWwJMJI/wWQF7wSaLxIaybNQtCSW1DPTW2b38ItbLZ+3OoD8oDW+C4Rp0NpGs72mIZFmqQTlmKDWToaDxE0zahxnAvKApBYSyk7pHCYeKirBTrNSmxzve0+EY238pxwuvVczMiqtXcnQMyV1C3s9Vjhvik1SF7+eZlwlqQPFqofiVGCncSE0jrgB5sTKA8oM97VitqBeJufH4VW/8swCxBJcvREmp5mNy9drJYUyryefV40GxICM6alpCp6lpGYSi5WEcQ0FS7PiI0v8E16PY+WPLFr6S6vFOCINnzOFRUCYkeKo+72v4Uj5RmQ9+AHZOCjr/pTAoY6bXwl1a7kfmBz8oQU8H9Mri6ZHtbysoCfDOFLjoBleUlZEJbIOHc6DSpFbyRXxbqL6CxH0PFh3paxfBbOry+k5vg6KA/fNE/+Pyc/h2va876DF6irpoiF6hQ/QBjdAYMfQb/W1sNbZbet760frZ0VtNtY5j1DNWr/+AWJYWpg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

<latexit sha1_base64="ywGkpQDZc7DefYAK7c3O6+bkg=">AEKXicfVLbtNAFHViCsU8msKSzUAUKQEUJaUINkgVsGCDCBJpK2VCNB5fJ6POw5oZp40sfwsfwNewA7b8CGPHreqmMJLHR/eM/cZJpwZOxj8ajT9G1s3b23fDu7cvXd/p7X74NCoVFMYU8WVPg6JAc4kjC2zHI4TDUSEHI7Ck3eF/2gJ2jAlv9hVAlNB5pLFjBLrTLPWt6CDF8RmNp8Nv0r0BmGi54KczTJbGHI06pYAYcEidFrAXtDZpOFYE5qNuiWj5JaOXiXv5e+Xh5cF+RfulmrPegPyoM2wbACba86o9luU+NI0VSAtJQTYybDQWKnGdGWUQ5gFMDCaEnZA4TByURYKZ2ckcdZwlQrHS7pMWldbLiowIY1YidExB7MJc9RXG63yT1MavpxmTSWpB0nWgOXIKlSMBUVMA7V85QChmrlcEV0Q1PrhudmdCnMAvgSbL0QKqaZicvotZRCkdfFZ+tCA6xBwilVQhAZPc1wTATjqwhiknKbZ9jE5/i6fj2PliwxVesunuRgsdJsziThHGKLi6tudr+FxeUd4PfgBqTho8v6UwKaWKVdJuvVyN3A5vgxLuD/mExeMB2sl5WVCbhir6oBGSWl5ByZQCHc63SpJbwhr5M1D1AYjeGNR/qsjXDbenw6k5ugsO9/vBFf+/zfvgbWv294j74nX9YbeK+/A+CNvLFHG1uNZ439xkv/u/D/+n/XlObjUrz0Ksd/89f32dpgA=</latexit>
slide-12
SLIDE 12

HMM tagging as decoding

7

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 )

<latexit sha1_base64="kj8oX+bvtC3juVL/nWM6fABUd8=">ADvHicfVJb9MwFPYaLqNc1sEjL4Zq0kCoagoSvIAm4IEXRJHoNqkuleOcpFZ9iWynWxXlf/BreIW/wL/BScq0rBNHivPp8zk+37lEmeDWDYd/djrBjZu3bu/e6d69d/BXm/4bHVuWEwYVpocxpRC4IrmDjuBJxmBqiMBJxEyw/V/ckKjOVafXPrDGaSponFHnqXlv1CUL6gpXzsPvCr/FhJpU0vN54SqixOPDGmAieYzPKvhs3usPB8Pa8DYIN6CPNjae73cMiTXLJSjHBLV2Gg4zNyuocZwJKLskt5BRtqQpTD1UVIKdFXVxJT7wTIwTbfynHK7ZyxEFldauZeQ9JXULe/WuIq+7m+YueTMruMpyB4o1iZJcYKdx1SkcwPMibUHlBnutWK2oIYy5/vZPbicZgFiBa5dCJOzwiZ19pakSJbt4POm0C4xoOCMaSmpip8XJKGSi3UMCc2FKwtik3/4un69iFc8s5vWXTwpwBFteMoVFQISR6qjTfvfwpH67JKP4Adk4LNX/SUDQ502XkmzE6UfWEqekAr+z5OrC08P2UVtQBfTNUXnYEqyhoyoS2QKDU6z1qCt+Jrof4BmvgxNP7QDms8/JaGV3dyGxyPBuHLwejrq/7R+82+7qLH6Ck6RCF6jY7QJzRGE8TQD/QT/UK/g3dBHCwD2bh2djYxj1DLgtVfxFEqg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

P(w n

1 )

<latexit sha1_base64="9BSv32hbIdli0R5g8Y34vtwAIU=">AD+3icfVJNb9NAEN0k0Bbz0RSOXBaiSglCUVKQ4IJUAQcuiCRtlI2ROv1OFl1P8zuOm1k+dwQ1z5LYgfg8TaTqu6qRjJ6eZN7szbyZMBLduMPjTaLZu3d7a3rkT3L13/8Fue+/hkdWpYTBmWmhzElILgisYO+4EnCQGqAwFHIen74r48RKM5Vp9casEpLOFY85o867Zu1vwT5ZUJe5fDb8qvAbTKiZS3o+y1zhyPGoWwJMJI/wWQF7wSaLxIaybNQtCSW1DPTW2b38ItbLZ+3OoD8oDW+C4Rp0NpGs72mIZFmqQTlmKDWToaDxE0zahxnAvKApBYSyk7pHCYeKirBTrNSmxzve0+EY238pxwuvVczMiqtXcnQMyV1C3s9Vjhvik1SF7+eZlwlqQPFqofiVGCncSE0jrgB5sTKA8oM97VitqBeJufH4VW/8swCxBJcvREmp5mNy9drJYUyryefV40GxICM6alpCp6lpGYSi5WEcQ0FS7PiI0v8E16PY+WPLFr6S6vFOCINnzOFRUCYkeKo+72v4Uj5RmQ9+AHZOCjr/pTAoY6bXwl1a7kfmBz8oQU8H9Mri6ZHtbysoCfDOFLjoBleUlZEJbIOHc6DSpFbyRXxbqL6CxH0PFh3paxfBbOry+k5vg6KA/fNE/+Pyc/h2va876DF6irpoiF6hQ/QBjdAYMfQb/W1sNbZbet760frZ0VtNtY5j1DNWr/+AWJYWpg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

<latexit sha1_base64="ywGkpQDZc7DefYAK7c3O6+bkg=">AEKXicfVLbtNAFHViCsU8msKSzUAUKQEUJaUINkgVsGCDCBJpK2VCNB5fJ6POw5oZp40sfwsfwNewA7b8CGPHreqmMJLHR/eM/cZJpwZOxj8ajT9G1s3b23fDu7cvXd/p7X74NCoVFMYU8WVPg6JAc4kjC2zHI4TDUSEHI7Ck3eF/2gJ2jAlv9hVAlNB5pLFjBLrTLPWt6CDF8RmNp8Nv0r0BmGi54KczTJbGHI06pYAYcEidFrAXtDZpOFYE5qNuiWj5JaOXiXv5e+Xh5cF+RfulmrPegPyoM2wbACba86o9luU+NI0VSAtJQTYybDQWKnGdGWUQ5gFMDCaEnZA4TByURYKZ2ckcdZwlQrHS7pMWldbLiowIY1YidExB7MJc9RXG63yT1MavpxmTSWpB0nWgOXIKlSMBUVMA7V85QChmrlcEV0Q1PrhudmdCnMAvgSbL0QKqaZicvotZRCkdfFZ+tCA6xBwilVQhAZPc1wTATjqwhiknKbZ9jE5/i6fj2PliwxVesunuRgsdJsziThHGKLi6tudr+FxeUd4PfgBqTho8v6UwKaWKVdJuvVyN3A5vgxLuD/mExeMB2sl5WVCbhir6oBGSWl5ByZQCHc63SpJbwhr5M1D1AYjeGNR/qsjXDbenw6k5ugsO9/vBFf+/zfvgbWv294j74nX9YbeK+/A+CNvLFHG1uNZ439xkv/u/D/+n/XlObjUrz0Ksd/89f32dpgA=</latexit>

simplifying assumptions:

slide-13
SLIDE 13

HMM tagging as decoding

7

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 )

<latexit sha1_base64="kj8oX+bvtC3juVL/nWM6fABUd8=">ADvHicfVJb9MwFPYaLqNc1sEjL4Zq0kCoagoSvIAm4IEXRJHoNqkuleOcpFZ9iWynWxXlf/BreIW/wL/BScq0rBNHivPp8zk+37lEmeDWDYd/djrBjZu3bu/e6d69d/BXm/4bHVuWEwYVpocxpRC4IrmDjuBJxmBqiMBJxEyw/V/ckKjOVafXPrDGaSponFHnqXlv1CUL6gpXzsPvCr/FhJpU0vN54SqixOPDGmAieYzPKvhs3usPB8Pa8DYIN6CPNjae73cMiTXLJSjHBLV2Gg4zNyuocZwJKLskt5BRtqQpTD1UVIKdFXVxJT7wTIwTbfynHK7ZyxEFldauZeQ9JXULe/WuIq+7m+YueTMruMpyB4o1iZJcYKdx1SkcwPMibUHlBnutWK2oIYy5/vZPbicZgFiBa5dCJOzwiZ19pakSJbt4POm0C4xoOCMaSmpip8XJKGSi3UMCc2FKwtik3/4un69iFc8s5vWXTwpwBFteMoVFQISR6qjTfvfwpH67JKP4Adk4LNX/SUDQ502XkmzE6UfWEqekAr+z5OrC08P2UVtQBfTNUXnYEqyhoyoS2QKDU6z1qCt+Jrof4BmvgxNP7QDms8/JaGV3dyGxyPBuHLwejrq/7R+82+7qLH6Ck6RCF6jY7QJzRGE8TQD/QT/UK/g3dBHCwD2bh2djYxj1DLgtVfxFEqg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

P(w n

1 )

<latexit sha1_base64="9BSv32hbIdli0R5g8Y34vtwAIU=">AD+3icfVJNb9NAEN0k0Bbz0RSOXBaiSglCUVKQ4IJUAQcuiCRtlI2ROv1OFl1P8zuOm1k+dwQ1z5LYgfg8TaTqu6qRjJ6eZN7szbyZMBLduMPjTaLZu3d7a3rkT3L13/8Fue+/hkdWpYTBmWmhzElILgisYO+4EnCQGqAwFHIen74r48RKM5Vp9casEpLOFY85o867Zu1vwT5ZUJe5fDb8qvAbTKiZS3o+y1zhyPGoWwJMJI/wWQF7wSaLxIaybNQtCSW1DPTW2b38ItbLZ+3OoD8oDW+C4Rp0NpGs72mIZFmqQTlmKDWToaDxE0zahxnAvKApBYSyk7pHCYeKirBTrNSmxzve0+EY238pxwuvVczMiqtXcnQMyV1C3s9Vjhvik1SF7+eZlwlqQPFqofiVGCncSE0jrgB5sTKA8oM97VitqBeJufH4VW/8swCxBJcvREmp5mNy9drJYUyryefV40GxICM6alpCp6lpGYSi5WEcQ0FS7PiI0v8E16PY+WPLFr6S6vFOCINnzOFRUCYkeKo+72v4Uj5RmQ9+AHZOCjr/pTAoY6bXwl1a7kfmBz8oQU8H9Mri6ZHtbysoCfDOFLjoBleUlZEJbIOHc6DSpFbyRXxbqL6CxH0PFh3paxfBbOry+k5vg6KA/fNE/+Pyc/h2va876DF6irpoiF6hQ/QBjdAYMfQb/W1sNbZbet760frZ0VtNtY5j1DNWr/+AWJYWpg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

<latexit sha1_base64="ywGkpQDZc7DefYAK7c3O6+bkg=">AEKXicfVLbtNAFHViCsU8msKSzUAUKQEUJaUINkgVsGCDCBJpK2VCNB5fJ6POw5oZp40sfwsfwNewA7b8CGPHreqmMJLHR/eM/cZJpwZOxj8ajT9G1s3b23fDu7cvXd/p7X74NCoVFMYU8WVPg6JAc4kjC2zHI4TDUSEHI7Ck3eF/2gJ2jAlv9hVAlNB5pLFjBLrTLPWt6CDF8RmNp8Nv0r0BmGi54KczTJbGHI06pYAYcEidFrAXtDZpOFYE5qNuiWj5JaOXiXv5e+Xh5cF+RfulmrPegPyoM2wbACba86o9luU+NI0VSAtJQTYybDQWKnGdGWUQ5gFMDCaEnZA4TByURYKZ2ckcdZwlQrHS7pMWldbLiowIY1YidExB7MJc9RXG63yT1MavpxmTSWpB0nWgOXIKlSMBUVMA7V85QChmrlcEV0Q1PrhudmdCnMAvgSbL0QKqaZicvotZRCkdfFZ+tCA6xBwilVQhAZPc1wTATjqwhiknKbZ9jE5/i6fj2PliwxVesunuRgsdJsziThHGKLi6tudr+FxeUd4PfgBqTho8v6UwKaWKVdJuvVyN3A5vgxLuD/mExeMB2sl5WVCbhir6oBGSWl5ByZQCHc63SpJbwhr5M1D1AYjeGNR/qsjXDbenw6k5ugsO9/vBFf+/zfvgbWv294j74nX9YbeK+/A+CNvLFHG1uNZ439xkv/u/D/+n/XlObjUrz0Ksd/89f32dpgA=</latexit>

simplifying assumptions:

P(w n

1 | tn 1) ≈ n

Y

i=1

P(wi | ti)

<latexit sha1_base64="/6Tz1U0XhxBSJEhcSTXJxC6YCDg=">AEYnicfVJdb9MwFE23AiN8bGWP8GCoJq0ITc1AgpehCXjgBVEkuk2au8pxblpr/ohsp2tl5X/COz8EJ82mZh1YinN07zn2vT43zjgzt/3drYbN+7/2DrYfjo8ZOn2zudZydG5ZrCkCqu9FlMDHAmYWiZ5XCWaSAi5nAaX34u86cz0IYp+dMuMhgJMpEsZRYHxrv/Ar38JRYZ4txdCHREcJETwSZj50tAwUa7FcAYcESdFXCXri3TsOpJtQN9itGxa0SvVreK65zveIu+b+FYZ1CKzkvzKt5gj7PRk7dhQVF7IksprFeuOdbv+gXy20DqIadIN6DcadDY0TRXMB0lJOjDmP+pkdOaItoxyKEOcGMkIvyQTOPZREgBm5yoIC7flIglKl/SctqKrCkeEMQsRe6Ygdmpu58rgXbnz3KYfRo7JLcg6fKiNOfIKlT6iRKmgVq+8IBQzXytiE6JN8N61725K9dMgc/ANhuhYuRMWt3eKCkWRVM8XzYaYg0SrqgSgsjktcMpEYwvEkhJzm3hsEmv8V3v9SaZsczUT3dzJAeLlWYTJgnkFpcbs2w/0trvYQfwFvkIZvurvGWhilfaVLEeq8IZN8Etcwv8xmbxhethsy1UF+GbKd1EZSFdUkHJlAMcTrfKsUfCavirUH0BSb8OSD03ZkuGnNLo9k+vg5PAgentw+ONd9/hTPa9bwfPgVbAfRMH74Dj4GgyCYUBbH1tJS7Tk5p92O60d5fUjVat2Q0aq/3iLzpsexo=</latexit>
slide-14
SLIDE 14

HMM tagging as decoding

7

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 )

<latexit sha1_base64="kj8oX+bvtC3juVL/nWM6fABUd8=">ADvHicfVJb9MwFPYaLqNc1sEjL4Zq0kCoagoSvIAm4IEXRJHoNqkuleOcpFZ9iWynWxXlf/BreIW/wL/BScq0rBNHivPp8zk+37lEmeDWDYd/djrBjZu3bu/e6d69d/BXm/4bHVuWEwYVpocxpRC4IrmDjuBJxmBqiMBJxEyw/V/ckKjOVafXPrDGaSponFHnqXlv1CUL6gpXzsPvCr/FhJpU0vN54SqixOPDGmAieYzPKvhs3usPB8Pa8DYIN6CPNjae73cMiTXLJSjHBLV2Gg4zNyuocZwJKLskt5BRtqQpTD1UVIKdFXVxJT7wTIwTbfynHK7ZyxEFldauZeQ9JXULe/WuIq+7m+YueTMruMpyB4o1iZJcYKdx1SkcwPMibUHlBnutWK2oIYy5/vZPbicZgFiBa5dCJOzwiZ19pakSJbt4POm0C4xoOCMaSmpip8XJKGSi3UMCc2FKwtik3/4un69iFc8s5vWXTwpwBFteMoVFQISR6qjTfvfwpH67JKP4Adk4LNX/SUDQ502XkmzE6UfWEqekAr+z5OrC08P2UVtQBfTNUXnYEqyhoyoS2QKDU6z1qCt+Jrof4BmvgxNP7QDms8/JaGV3dyGxyPBuHLwejrq/7R+82+7qLH6Ck6RCF6jY7QJzRGE8TQD/QT/UK/g3dBHCwD2bh2djYxj1DLgtVfxFEqg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

P(w n

1 )

<latexit sha1_base64="9BSv32hbIdli0R5g8Y34vtwAIU=">AD+3icfVJNb9NAEN0k0Bbz0RSOXBaiSglCUVKQ4IJUAQcuiCRtlI2ROv1OFl1P8zuOm1k+dwQ1z5LYgfg8TaTqu6qRjJ6eZN7szbyZMBLduMPjTaLZu3d7a3rkT3L13/8Fue+/hkdWpYTBmWmhzElILgisYO+4EnCQGqAwFHIen74r48RKM5Vp9casEpLOFY85o867Zu1vwT5ZUJe5fDb8qvAbTKiZS3o+y1zhyPGoWwJMJI/wWQF7wSaLxIaybNQtCSW1DPTW2b38ItbLZ+3OoD8oDW+C4Rp0NpGs72mIZFmqQTlmKDWToaDxE0zahxnAvKApBYSyk7pHCYeKirBTrNSmxzve0+EY238pxwuvVczMiqtXcnQMyV1C3s9Vjhvik1SF7+eZlwlqQPFqofiVGCncSE0jrgB5sTKA8oM97VitqBeJufH4VW/8swCxBJcvREmp5mNy9drJYUyryefV40GxICM6alpCp6lpGYSi5WEcQ0FS7PiI0v8E16PY+WPLFr6S6vFOCINnzOFRUCYkeKo+72v4Uj5RmQ9+AHZOCjr/pTAoY6bXwl1a7kfmBz8oQU8H9Mri6ZHtbysoCfDOFLjoBleUlZEJbIOHc6DSpFbyRXxbqL6CxH0PFh3paxfBbOry+k5vg6KA/fNE/+Pyc/h2va876DF6irpoiF6hQ/QBjdAYMfQb/W1sNbZbet760frZ0VtNtY5j1DNWr/+AWJYWpg=</latexit>

= argmax

tn

1

P(w n

1 | tn 1)P(tn 1)

<latexit sha1_base64="ywGkpQDZc7DefYAK7c3O6+bkg=">AEKXicfVLbtNAFHViCsU8msKSzUAUKQEUJaUINkgVsGCDCBJpK2VCNB5fJ6POw5oZp40sfwsfwNewA7b8CGPHreqmMJLHR/eM/cZJpwZOxj8ajT9G1s3b23fDu7cvXd/p7X74NCoVFMYU8WVPg6JAc4kjC2zHI4TDUSEHI7Ck3eF/2gJ2jAlv9hVAlNB5pLFjBLrTLPWt6CDF8RmNp8Nv0r0BmGi54KczTJbGHI06pYAYcEidFrAXtDZpOFYE5qNuiWj5JaOXiXv5e+Xh5cF+RfulmrPegPyoM2wbACba86o9luU+NI0VSAtJQTYybDQWKnGdGWUQ5gFMDCaEnZA4TByURYKZ2ckcdZwlQrHS7pMWldbLiowIY1YidExB7MJc9RXG63yT1MavpxmTSWpB0nWgOXIKlSMBUVMA7V85QChmrlcEV0Q1PrhudmdCnMAvgSbL0QKqaZicvotZRCkdfFZ+tCA6xBwilVQhAZPc1wTATjqwhiknKbZ9jE5/i6fj2PliwxVesunuRgsdJsziThHGKLi6tudr+FxeUd4PfgBqTho8v6UwKaWKVdJuvVyN3A5vgxLuD/mExeMB2sl5WVCbhir6oBGSWl5ByZQCHc63SpJbwhr5M1D1AYjeGNR/qsjXDbenw6k5ugsO9/vBFf+/zfvgbWv294j74nX9YbeK+/A+CNvLFHG1uNZ439xkv/u/D/+n/XlObjUrz0Ksd/89f32dpgA=</latexit>

simplifying assumptions:

P(w n

1 | tn 1) ≈ n

Y

i=1

P(wi | ti)

<latexit sha1_base64="/6Tz1U0XhxBSJEhcSTXJxC6YCDg=">AEYnicfVJdb9MwFE23AiN8bGWP8GCoJq0ITc1AgpehCXjgBVEkuk2au8pxblpr/ohsp2tl5X/COz8EJ82mZh1YinN07zn2vT43zjgzt/3drYbN+7/2DrYfjo8ZOn2zudZydG5ZrCkCqu9FlMDHAmYWiZ5XCWaSAi5nAaX34u86cz0IYp+dMuMhgJMpEsZRYHxrv/Ar38JRYZ4txdCHREcJETwSZj50tAwUa7FcAYcESdFXCXri3TsOpJtQN9itGxa0SvVreK65zveIu+b+FYZ1CKzkvzKt5gj7PRk7dhQVF7IksprFeuOdbv+gXy20DqIadIN6DcadDY0TRXMB0lJOjDmP+pkdOaItoxyKEOcGMkIvyQTOPZREgBm5yoIC7flIglKl/SctqKrCkeEMQsRe6Ygdmpu58rgXbnz3KYfRo7JLcg6fKiNOfIKlT6iRKmgVq+8IBQzXytiE6JN8N61725K9dMgc/ANhuhYuRMWt3eKCkWRVM8XzYaYg0SrqgSgsjktcMpEYwvEkhJzm3hsEmv8V3v9SaZsczUT3dzJAeLlWYTJgnkFpcbs2w/0trvYQfwFvkIZvurvGWhilfaVLEeq8IZN8Etcwv8xmbxhethsy1UF+GbKd1EZSFdUkHJlAMcTrfKsUfCavirUH0BSb8OSD03ZkuGnNLo9k+vg5PAgentw+ONd9/hTPa9bwfPgVbAfRMH74Dj4GgyCYUBbH1tJS7Tk5p92O60d5fUjVat2Q0aq/3iLzpsexo=</latexit>

P(tn

1) ≈ n

Y

i=1

P(ti | ti−1)

<latexit sha1_base64="D0yONuI54By5i/21zBrSv1JK6g=">AElnicfVLbtQwE3KAiXl0sILEi+GaqUuotWmIMFLUcVF8IJYJHqR6u3KcSa7Vn2JbKftysp/8Ap/xd/gZLNV020ZKc7RzDnjGc8kOWfG9vt/w6Vbndt37i7fi1buP3j4aHXt8b5RhawRxVX+jAhBjiTsGeZ5XCYayAi4XCQnHys4genoA1T8qed5jAUZCxZxix3jVaC1eiLp4Q62w5io8l2kGY6LEg5yNnK0eJBhs1QFiwFJ1VsBd1F2k404S6wUbNqLl1oNfIe+U81iuvk98sjLpNDF0Ken2ea3WOsD/TkWM7cXksKyJrWKwXzTPcRPakeU7HNuOyN1pd72/1a0OLIG7AetDYLS2pHGqaCFAWsqJMUdxP7dDR7RlEMZ4cJATugJGcORh5IMENXj61EXe9JUa0/6RFtfeywhFhzFQknimInZirscp5XeyosNm7oWMyLyxIOrsoKziyClU7gFKmgVo+9YBQzXytiE6In5/1m+L34dI1E+CnYNuNUDF0Jqtvb5WUiLItPp81GmENEs6oEoLI9KXDGRGMT1PISMFt6bDJ5vi693qVnrLcNE93kZKDxUqzMZOEc8gsro62/8mFtdnhD+BH5CGb7q7zloYpX2lcy2sPQDG+PnuIL/YzJ5wfSw3ZarC/DNVO+icpCurCHlygBOxloVeavgBX1dqE9AMj+GR/ashnDb2l8dScXwf72Vvx6a/vHm/XdD82+LgfPghfBRhAHb4Pd4GswCPYCGurwV/g7/N52nf+dz5MqMuhY3mSdCyzuAfNnqMSg=</latexit>
slide-15
SLIDE 15

HMM tagging as decoding

8

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 ) ≈ argmax tn

1

n

Y

i=1

P(wi | ti)P(ti | ti−1)

<latexit sha1_base64="YKrR6RB8BqJeExmb7CUdAbvYcI=">AFEXiclVLbhMxFJ2EAGV4tXQJSIYqUoOgyhQk2FSqgAUbRJDoQ6rTyPHcSaz6MbI9bSNrtvwAX8MOseUL+Aj+Ac9kWjJNWglL4zm695zr6+szTDkztv93Whea12/cXPpVnj7zt1795dXHuwalWkKO1RxpfeHxABnEnYsxz2Uw1EDnsDY/eFfm9Y9CGKfnFTlLoCzKSLGUWB8arDS+hm08JtbZfBAdSrSFMNEjQU4HzhaBHPXWS4CwYDE6KWAnbM/TcKIJdb31klFy0Snknfys1wnXyS/XBi2qxyaSXp9mp1irDf4FjW1F+KAsiq1isF7N9qyzo69iPJO+N+j+Ff64jwubwtOHmwvNbd6JYLzYOoAmtBtXqDlabGsaKZAGkpJ8YcRN3U9h3RlEOeYgzAymhR2QEBx5KIsD0XemYHLV9JEaJ0v6TFpXRWYUjwpiJGHqmIHZsLuaK4KLcQWaTN3HZJpZkHR6UJxZBUq7IdipoFaPvGAUM18r4iOibeO9Sb1Vpw5Zgz8Gz9IlT0nUnK02stDUVeF59OLxpiDRJOqBKCyPiZwkRjE9iSEjGbe6wSc7wonk9j49ZaqrRnZfkYLHSbMQk4RwSi4utHva/scXlHuL34B9Iw0f9acUNLFK+06mfsn9g43wE1zAq5hMnjM9rF/LlQ34yxRzUSlIl5eQcmUAD0daZWmt4Tl92agvQBL/DFM+1GVThndpdNGT82B3cyN6ubH5+dXa9tvKr0vBw+BpsB5EwetgO/gQ9IKdgDb+NFebj5qPW9a31s/Wj+n1Gaj0qwGtdX69ReAx7kA</latexit>

emission, B transition, A

slide-16
SLIDE 16

HMM tagging as decoding

8

■ Decoding: Given as input an HMM λ = (A, B) and sequence of observations

O = o1, o2, …, on, find the most probable sequence of states Q = q1, q2, …, qn

ˆ tn

1 = argmax tn

1

P(tn

1 | w n 1 ) ≈ argmax tn

1

n

Y

i=1

P(wi | ti)P(ti | ti−1)

<latexit sha1_base64="YKrR6RB8BqJeExmb7CUdAbvYcI=">AFEXiclVLbhMxFJ2EAGV4tXQJSIYqUoOgyhQk2FSqgAUbRJDoQ6rTyPHcSaz6MbI9bSNrtvwAX8MOseUL+Aj+Ac9kWjJNWglL4zm695zr6+szTDkztv93Whea12/cXPpVnj7zt1795dXHuwalWkKO1RxpfeHxABnEnYsxz2Uw1EDnsDY/eFfm9Y9CGKfnFTlLoCzKSLGUWB8arDS+hm08JtbZfBAdSrSFMNEjQU4HzhaBHPXWS4CwYDE6KWAnbM/TcKIJdb31klFy0Snknfys1wnXyS/XBi2qxyaSXp9mp1irDf4FjW1F+KAsiq1isF7N9qyzo69iPJO+N+j+Ff64jwubwtOHmwvNbd6JYLzYOoAmtBtXqDlabGsaKZAGkpJ8YcRN3U9h3RlEOeYgzAymhR2QEBx5KIsD0XemYHLV9JEaJ0v6TFpXRWYUjwpiJGHqmIHZsLuaK4KLcQWaTN3HZJpZkHR6UJxZBUq7IdipoFaPvGAUM18r4iOibeO9Sb1Vpw5Zgz8Gz9IlT0nUnK02stDUVeF59OLxpiDRJOqBKCyPiZwkRjE9iSEjGbe6wSc7wonk9j49ZaqrRnZfkYLHSbMQk4RwSi4utHva/scXlHuL34B9Iw0f9acUNLFK+06mfsn9g43wE1zAq5hMnjM9rF/LlQ34yxRzUSlIl5eQcmUAD0daZWmt4Tl92agvQBL/DFM+1GVThndpdNGT82B3cyN6ubH5+dXa9tvKr0vBw+BpsB5EwetgO/gQ9IKdgDb+NFebj5qPW9a31s/Wj+n1Gaj0qwGtdX69ReAx7kA</latexit>

emission, B transition, A How many possible choices?

slide-17
SLIDE 17

Part-of-speech tagging example

9

Slide credit: Noah Smith

slide-18
SLIDE 18

The Viterbi algorithm

10

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

slide-19
SLIDE 19

The Viterbi algorithm

11

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

slide-20
SLIDE 20

The Viterbi algorithm

11

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

previous Viterbi path probability

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

slide-21
SLIDE 21

The Viterbi algorithm

11

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

transition probability previous Viterbi path probability

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

slide-22
SLIDE 22

The Viterbi algorithm

11

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

transition probability state observation likelihood previous Viterbi path probability

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

slide-23
SLIDE 23

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

12

slide-24
SLIDE 24

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

12

initialization

slide-25
SLIDE 25

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

12

initialization recursion

slide-26
SLIDE 26

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

12

initialization recursion

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>
slide-27
SLIDE 27

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

12

initialization recursion

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

termination

slide-28
SLIDE 28

The Viterbi algorithm

13

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

NNP MD VB JJ NN RB DT <s> 0.2767 0.0006 0.0031 0.0453 0.0449 0.0510 0.2026 NNP 0.3777 0.0110 0.0009 0.0084 0.0584 0.0090 0.0025 MD 0.0008 0.0002 0.7968 0.0005 0.0008 0.1698 0.0041 VB 0.0322 0.0005 0.0050 0.0837 0.0615 0.0514 0.2231 JJ 0.0366 0.0004 0.0001 0.0733 0.4509 0.0036 0.0036 NN 0.0096 0.0176 0.0014 0.0086 0.1216 0.0177 0.0068 RB 0.0068 0.0102 0.1011 0.1012 0.0120 0.0728 0.0479 DT 0.1147 0.0021 0.0002 0.2157 0.4744 0.0102 0.0017

Figure 8.7 The A transition probabilities P t t computed from the WSJ corpus without

Janet will back the bill NNP 0.000032 0 0.000048 0 MD 0.308431 0 VB 0.000028 0.000672 0 0.000028 JJ 0.000340 0 NN 0.000200 0.000223 0 0.002337 RB 0.010446 0 DT 0.506099 0

Figure 8.8 Observation likelihoods B computed from the WSJ corpus without smoothing,

slide-29
SLIDE 29

The Viterbi algorithm

13

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

NNP MD VB JJ NN RB DT <s> 0.2767 0.0006 0.0031 0.0453 0.0449 0.0510 0.2026 NNP 0.3777 0.0110 0.0009 0.0084 0.0584 0.0090 0.0025 MD 0.0008 0.0002 0.7968 0.0005 0.0008 0.1698 0.0041 VB 0.0322 0.0005 0.0050 0.0837 0.0615 0.0514 0.2231 JJ 0.0366 0.0004 0.0001 0.0733 0.4509 0.0036 0.0036 NN 0.0096 0.0176 0.0014 0.0086 0.1216 0.0177 0.0068 RB 0.0068 0.0102 0.1011 0.1012 0.0120 0.0728 0.0479 DT 0.1147 0.0021 0.0002 0.2157 0.4744 0.0102 0.0017

Figure 8.7 The A transition probabilities P t t computed from the WSJ corpus without

Janet will back the bill NNP 0.000032 0 0.000048 0 MD 0.308431 0 VB 0.000028 0.000672 0 0.000028 JJ 0.000340 0 NN 0.000200 0.000223 0 0.002337 RB 0.010446 0 DT 0.506099 0

Figure 8.8 Observation likelihoods B computed from the WSJ corpus without smoothing,

A

slide-30
SLIDE 30

The Viterbi algorithm

13

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

vt(j) =

N

max

i=1 vt−1(i)aijbj(ot)

<latexit sha1_base64="1AySDs6kV/dnG9LXKNVYHEkDJo=">AFQnicnVLbtQwFE2HAcrwamGJhAxVpQmCalKQYFOpgi7YtAwSfUj1NHIcZ8atY0e2M+3I8nfxA/wEv8AOsWB40lL05mywFKco3vPuS/fpGBU6V7v+0LrRvmrduLdzp3791/8HBp+dGeEqXEZBcLJuRBghRhlJNdTUjB4UkKE8Y2U9OPlT+/TGRigr+RU8KMsjRkNOMYqSdKV5e+NpZhSOkjbZxdMTBoBIDnN0FhtdGSzodz0AMKcpOK1g2FmdpcFMImz6Xc/wXO8Ia3loz32hnSe/XthZrX3gktPpi0KMwDdncaGbkT2iFdEWrOoF/6b7VjnQ19FdnwP2bxN/bVgVxfGZiXehzr7nFYJfVRvG5n7MI5b5eGyJmObRKbY9sVsQ7jpZXeWs8fMAuiGqwE9enHy0JU4HLnHCNGVLqMOoVemCQ1BQzYjuwVKRA+AQNyaGDHOVEDYxfMQtWnSUFmZDu4xp462WFQblSkzxzBzpkbrq4zfIelzt4NDOVFqQnH0RZyYAWoNpXkFJsGYTBxCW1NUK8Ai5XdNuq917XUozImxMdLMRnA+Mynz2RklJbpvis2mjHSgJ6dY5Dni6QsDM5RTNklJhkqmrYEqO8fz5vUyHdNC1aO7CMmIhkLSIeWIMZJpWF1Ns/uNPR3B24R90CSbLuqPxVEIi2kq2S6X9Y92BA+q5bE/otJ+QXTwWZbxhfgmqnmIgrCjfUQM6EITIZSlEWj4Bm9L9QFQJl7himfNGVThtvS6OpOzoK9bXo9dr65zcrm+/rfV0MngTPg24QBW+DzeBj0A92A9x62tpqbd2t/aP9o/27+m1NZCrXkcNE79x/ES8qx</latexit>

NNP MD VB JJ NN RB DT <s> 0.2767 0.0006 0.0031 0.0453 0.0449 0.0510 0.2026 NNP 0.3777 0.0110 0.0009 0.0084 0.0584 0.0090 0.0025 MD 0.0008 0.0002 0.7968 0.0005 0.0008 0.1698 0.0041 VB 0.0322 0.0005 0.0050 0.0837 0.0615 0.0514 0.2231 JJ 0.0366 0.0004 0.0001 0.0733 0.4509 0.0036 0.0036 NN 0.0096 0.0176 0.0014 0.0086 0.1216 0.0177 0.0068 RB 0.0068 0.0102 0.1011 0.1012 0.0120 0.0728 0.0479 DT 0.1147 0.0021 0.0002 0.2157 0.4744 0.0102 0.0017

Figure 8.7 The A transition probabilities P t t computed from the WSJ corpus without

Janet will back the bill NNP 0.000032 0 0.000048 0 MD 0.308431 0 VB 0.000028 0.000672 0 0.000028 JJ 0.000340 0 NN 0.000200 0.000223 0 0.002337 RB 0.010446 0 DT 0.506099 0

Figure 8.8 Observation likelihoods B computed from the WSJ corpus without smoothing,

A B

slide-31
SLIDE 31

14

π

P(NNP|start) = .28

* P ( M D | M D ) =

* P ( M D | N N P ) . 9 * . 1 = . 9 e

  • 8

v1(2)= .0006 x 0 = v1(1) = .28* .000032 = .000009

t

MD q2 q1

  • 1

Janet bill will

  • 2
  • 3

back

VB JJ

v1(3)= .0031 x 0 = 0 v1(4)= . 045*0=0

  • 4

* P ( M D | V B ) = * P(MD|JJ) = 0 P(VB|start) = .0031 P ( J J | s t a r t ) = . 4 5

backtrace

q3 q4

the

NN q5 RB q6 DT q7

v2(2) = max * .308 = 2.772e-8

v2(5)= max * .0002

= .0000000001

v2(3)=

max * .000028 = 2.5e-13

v3(6)=

max * .0104

v3(5)=

max * .

000223 v3(4)=

max * .00034

v3(3)=

max * .00067

v1(5) v1(6) v1(7) v2(1) v2(4) v2(6) v2(7)

backtrace

* P ( R B | N N )

* P(NN|NN)

start start start start start

  • 5

NNP

P(MD|start) = .0006

slide-32
SLIDE 32

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

15

slide-33
SLIDE 33

function VITERBI(observations of len T,state-graph of len N) returns best-path, path-prob create a path probability matrix viterbi[N,T] for each state s from 1 to N do ; initialization step viterbi[s,1] πs ⇤ bs(o1) backpointer[s,1] 0 for each time step t from 2 to T do ; recursion step for each state s from 1 to N do viterbi[s,t]

N

max

s0=1 viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot)

backpointer[s,t]

N

argmax

s0=1

viterbi[s0,t 1] ⇤ as0,s ⇤ bs(ot) bestpathprob

N

max

s=1

viterbi[s,T] ; termination step bestpathpointer

N

argmax

s=1

viterbi[s,T] ; termination step bestpath the path starting at state bestpathpointer, that follows backpointer[] to states back in time return bestpath, bestpathprob

The Viterbi algorithm

15

Computational complexity in N and T?

slide-34
SLIDE 34

Beam search

16

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

slide-35
SLIDE 35

Hidden Markov models

17

Forward Viterbi Forward-backward; Baum-Welch

slide-36
SLIDE 36

Hidden Markov models

17

Forward Viterbi Forward-backward; Baum-Welch

slide-37
SLIDE 37

The forward algorithm

18

JJ NNP NNP NNP MD MD MD MD VB VB JJ JJ JJ NN NN RB RB RB RB DT DT DT DT NNP

Janet will back the bill

NN VB MD NN VB JJ RB NNP DT NN VB

■ Just sum instead of max!

αt(j) =

N

X

i=1

αt−1(i)aijbj(ot)

<latexit sha1_base64="tmjmwISJ3oCxl7E7JFkWxomW0Xs=">AFfXicnVNb9MwFM7KCiPcNnjkxTBVatComoE0XiZNwAMvQJHYBc1d5DhO681xItvpVlkWr/AT+Rv8AXDcdCzrtgcs1f10zvedm0/iglGp+v1fS61by+3bd1bu+vfuP3j4aHXt8Z7MS4HJLs5ZLg5iJAmjnOwqhg5KARBWczIfnzyrvLvT4iQNOdf1bQgwyNOE0pRsqaorWl34HjpHSykThEQfbACIxytBZpFVlMGDQdQDAjCbgtIKB31mkwVQgrAdx3Bc5whqeWDmvsBcJb9e6HdqH7jgtPqiEPkZgPZOIk23Q3PEKyKtWdQJb2Zb1jyopi9DE/zHLP7FvjyQ6ysDV6aeRKp7HFRZXRgn/DSx8ay7SwNkTcmjvSx6eaRCnyIWDFG5yJZnNR7blOGa2u93t9d8AiCGuw7tVnEK21BExyXGaEK8yQlIdhv1BDjYSimBHjw1KSAuETNCKHFnKUETnUbjsN6FhLAtJc2B9XwFkvKjTKpJxmsWVmSI3lZV9lvMp3WKr0zVBTXpSKcDxLlJYMqBxUqw4SKghWbGoBwoLaWgEeI7umyn4Q9qkvpBkTNiGq2QjOhlqmLnujpDgzTfHZrFEfCsLJKc6zDPHkhYpyibJiRFJVNGQ5nO8VXz2kgmtJD16M5DMqJgLuiIcsQYSRWsrqbZ/o0VdLcP3xP7QIJ8tFV/LohAKhe2ktlqGvtgI/isWi9zE5Pyc6aFzba0K8A2U80lLwjXxkHMcklgPBJ5WTQKXtC7Qm0AlNpnmPFJUzZj2C0NL+/kItjb7IWveptfXq/vK3dcV76j3ul7obXk73gdv4O16uPWt9b31o/Vz+U+7095o92bU1lKteI1TnvrL6qb3bA=</latexit>
slide-38
SLIDE 38

Problems with HMMs

19

slide-39
SLIDE 39

Problems with HMMs

19

■ HMMs have a lot in common with Naive Bayes classifiers, n-gram LMs:

slide-40
SLIDE 40

Problems with HMMs

19

■ HMMs have a lot in common with Naive Bayes classifiers, n-gram LMs: ■ Need smoothing to improve generalization

slide-41
SLIDE 41

Problems with HMMs

19

■ HMMs have a lot in common with Naive Bayes classifiers, n-gram LMs: ■ Need smoothing to improve generalization ■ How to handle unknown words?

slide-42
SLIDE 42

Problems with HMMs

19

■ HMMs have a lot in common with Naive Bayes classifiers, n-gram LMs: ■ Need smoothing to improve generalization ■ How to handle unknown words? ■ Would like to easily add arbitrary features that might help model probabilities.

slide-43
SLIDE 43

Problems with HMMs

19

■ HMMs have a lot in common with Naive Bayes classifiers, n-gram LMs: ■ Need smoothing to improve generalization ■ How to handle unknown words? ■ Would like to easily add arbitrary features that might help model probabilities. ■ Previously we solved some of these problems by training a discriminative model

like logistic regression to predict rather than count.

slide-44
SLIDE 44

Problems with HMMs

19

■ HMMs have a lot in common with Naive Bayes classifiers, n-gram LMs: ■ Need smoothing to improve generalization ■ How to handle unknown words? ■ Would like to easily add arbitrary features that might help model probabilities. ■ Previously we solved some of these problems by training a discriminative model

like logistic regression to predict rather than count.

■ How to use logistic regression for sequence labeling?

slide-45
SLIDE 45

Maximum entropy Markov models (MEMMs)

20

■ Simply predict the label for each word using logistic regression, using the label

assigned to the previous word as a feature (along with any other useful features)! P(ti | wi, ti1) = exp(θ · f (ti, w i+j

ij , ti1 ik))

P

t02Y

exp(θ · f (t0, w i+j

ij , ti1 ik))

<latexit sha1_base64="Rl61uDKpuzo1hvhob8vcbSy82E=">AGl3icnVTbhMxEF0aCBAKtPCEeDFUEQmEKgEkeEgQIgXIEi9gOp05fV6E6fei2xv2sjyh/AKX8XfMN7dtkmTFglL8U5mzhmfGc9ukAmudLf759JK7fKV+tVr1xs3Vm/eur2fmdHpbmkbJumIpXfA6KY4Anb1lwL9j2TjMSBYLvBwXsX350wqXiabOlpxgYxGSY84pRocPnrtdVGE4+INtr6vf0EvUaYyGFMjnyjncOifqswEI5iA6d2W40F2E4koSafqtAFNgi0K7obXsca9tl9POJjWYVQzNB4GeZTI8Qhj30DX/ds/uJA/IKxQvixWhAHSc1/GnPtv+jF6e5zbkfGVo6dETX7fGbXdqkaYgfplAPgi3eJuAa2wD34xtK/W10pENiInLJXHx6wqcgHVlblZ0rcKvXyGWkguXOq7yAy1HcPWZHWQvrEdMEYRqmGkWO0gE4IMd23/AnY1sRD9xfR4fBcMqx4DHXCiQ/whzGgOgRJcL8sHZp2kedfyR1lz8t1eIgNof2jMz+iLfKQKcATC0Isf7aRnezWy0aPQqY8OrVt9fX5E4TGkes0RTQZTa63UzPTBEak4Fsw2cK5YRekCGbA/MhMRMDUzx3lrUBE+IolTCL9Go8M4yDImVmsYBIF1D1NmYcy6L7eU6ejUwPMlyzRJaHhTlAukUuY8ACrlkVIspGIRKDloRHRHojoZPBUzHzDEjJiZMzxdC4FRUXH6nKQgtvPko7LQBpYsYc0jWOShI8NjkjMxTRkEcmFdgMQHdvL+tUJzxTVetOUgqmcSr5kCdECBZp7LZ5NzxGhd7A39gcEGSfQbVXzMmiU4lKCmn38KFDfED9+LZi5A8OUGCOV+WKQRAMa4vacYSY8sxFqliOBjKNM/mBC/wC6GQgERwDSWezdNKBExp7+xMLho7zZ7zefXux8fZdNa/XvPveQ6/l9byX3lvk9f3tj1aU7WftV+13/V79Tf1j/VPJXTlUsW5682t+re/621ALA=</latexit>

ˆ T = argmax

T

Y

i

P(ti | w i+j

i−j , ti−1 i−k)))

<latexit sha1_base64="n95aMDbgXepW8IPRq1g025JqVU=">AGqXicnVTbhMxEN02Eq4tfCIhAxV1ISGKilI8FKpAh54AVKpN6jTldfrTZx6L/J6c5HlR76GV/gY/gbuwlJk7YSK8U7mTlnfGY8Xi9hNBXN5p+V1dKt2+U7a3cr9+4/ePhofePxcRpnHJMjHLOYn3oJYxG5EhQwchpwgkKPUZOvIsPJn4yIDylcXQoxgnphKgb0YBiJLTL3Sg9q1RhDwkplNs6j8AegIh3QzRypTAOBdo1awAYUh8MjVmvVBdhMOAIy3bNIizWBuoFva4msbpaRr+aWKkWMTAT1Pwk4fEIQL36rqR7LXUeGSAtUNQSr0dr1CSpK9aqv4fvfiX+3JDrlYGlm49cEWtXze72jSW+GWg8+lwjdaRdvWV58q+qsWuMFoRS3poykqzcMIqIldSbZWHaqbCw1wunVE2NMr6lzS7b5q5EIvzF8jt160d4qljWkpJq2dBkhGSQ2KHhEIQOzHAgSG0rghtZKmFshoSEWqi9iCVA8GEj2MmPym1NK0W40bkhq941wt9EI5VJdktnu0lgcaFjBWohy1zebO037gEWjVRibTvG03Y1VDv0YZyGJBGYoTc9azUR0JOKCYkZUBWYpSRC+QF1yps0IhSTtSHuTFahqjw+CmOtfJID1zjIkCtN0HoaRqSXo4Z57LYWSaCdx1JoyQTJML5RkHGgIiB+SwAn3KCBRtrA2FOtVaAe0h3R+iPhx61mW16hA2ImC8Ehx2ZBnb3OUleqObJo7zQCuQkIkMchyGK/JcSBikbOyTAGVMmAEIJvayfjX8AU3SonXTlIwIGHPapRFijAQCmXerV89Ae1agR+JPiBOPmvVXxPCkYi5VpJfCKUPrAufm6uorkPSaIrU5nxZ0grQxZi+xAmJpMrHmMUpgV6Xx1kyJ3iBb4XqBCjQx5DjyTwtR+gpbV2eyUXjeHen9Xpn9+DN5v7Yl7XnKfOC6fmtJy3zr7zyWk7Rw4u/Sj9LP0q/S5vlw/Kp+XvOXR1peA8ceaeMv4L5V1HbA=</latexit>
slide-46
SLIDE 46

Maximum entropy Markov models (MEMMs)

20

■ Simply predict the label for each word using logistic regression, using the label

assigned to the previous word as a feature (along with any other useful features)!

Janet will back the bill <s>

P(ti | wi, ti1) = exp(θ · f (ti, w i+j

ij , ti1 ik))

P

t02Y

exp(θ · f (t0, w i+j

ij , ti1 ik))

<latexit sha1_base64="Rl61uDKpuzo1hvhob8vcbSy82E=">AGl3icnVTbhMxEF0aCBAKtPCEeDFUEQmEKgEkeEgQIgXIEi9gOp05fV6E6fei2xv2sjyh/AKX8XfMN7dtkmTFglL8U5mzhmfGc9ukAmudLf759JK7fKV+tVr1xs3Vm/eur2fmdHpbmkbJumIpXfA6KY4Anb1lwL9j2TjMSBYLvBwXsX350wqXiabOlpxgYxGSY84pRocPnrtdVGE4+INtr6vf0EvUaYyGFMjnyjncOifqswEI5iA6d2W40F2E4koSafqtAFNgi0K7obXsca9tl9POJjWYVQzNB4GeZTI8Qhj30DX/ds/uJA/IKxQvixWhAHSc1/GnPtv+jF6e5zbkfGVo6dETX7fGbXdqkaYgfplAPgi3eJuAa2wD34xtK/W10pENiInLJXHx6wqcgHVlblZ0rcKvXyGWkguXOq7yAy1HcPWZHWQvrEdMEYRqmGkWO0gE4IMd23/AnY1sRD9xfR4fBcMqx4DHXCiQ/whzGgOgRJcL8sHZp2kedfyR1lz8t1eIgNof2jMz+iLfKQKcATC0Isf7aRnezWy0aPQqY8OrVt9fX5E4TGkes0RTQZTa63UzPTBEak4Fsw2cK5YRekCGbA/MhMRMDUzx3lrUBE+IolTCL9Go8M4yDImVmsYBIF1D1NmYcy6L7eU6ejUwPMlyzRJaHhTlAukUuY8ACrlkVIspGIRKDloRHRHojoZPBUzHzDEjJiZMzxdC4FRUXH6nKQgtvPko7LQBpYsYc0jWOShI8NjkjMxTRkEcmFdgMQHdvL+tUJzxTVetOUgqmcSr5kCdECBZp7LZ5NzxGhd7A39gcEGSfQbVXzMmiU4lKCmn38KFDfED9+LZi5A8OUGCOV+WKQRAMa4vacYSY8sxFqliOBjKNM/mBC/wC6GQgERwDSWezdNKBExp7+xMLho7zZ7zefXux8fZdNa/XvPveQ6/l9byX3lvk9f3tj1aU7WftV+13/V79Tf1j/VPJXTlUsW5682t+re/621ALA=</latexit>

ˆ T = argmax

T

Y

i

P(ti | w i+j

i−j , ti−1 i−k)))

<latexit sha1_base64="n95aMDbgXepW8IPRq1g025JqVU=">AGqXicnVTbhMxEN02Eq4tfCIhAxV1ISGKilI8FKpAh54AVKpN6jTldfrTZx6L/J6c5HlR76GV/gY/gbuwlJk7YSK8U7mTlnfGY8Xi9hNBXN5p+V1dKt2+U7a3cr9+4/ePhofePxcRpnHJMjHLOYn3oJYxG5EhQwchpwgkKPUZOvIsPJn4yIDylcXQoxgnphKgb0YBiJLTL3Sg9q1RhDwkplNs6j8AegIh3QzRypTAOBdo1awAYUh8MjVmvVBdhMOAIy3bNIizWBuoFva4msbpaRr+aWKkWMTAT1Pwk4fEIQL36rqR7LXUeGSAtUNQSr0dr1CSpK9aqv4fvfiX+3JDrlYGlm49cEWtXze72jSW+GWg8+lwjdaRdvWV58q+qsWuMFoRS3poykqzcMIqIldSbZWHaqbCw1wunVE2NMr6lzS7b5q5EIvzF8jt160d4qljWkpJq2dBkhGSQ2KHhEIQOzHAgSG0rghtZKmFshoSEWqi9iCVA8GEj2MmPym1NK0W40bkhq941wt9EI5VJdktnu0lgcaFjBWohy1zebO037gEWjVRibTvG03Y1VDv0YZyGJBGYoTc9azUR0JOKCYkZUBWYpSRC+QF1yps0IhSTtSHuTFahqjw+CmOtfJID1zjIkCtN0HoaRqSXo4Z57LYWSaCdx1JoyQTJML5RkHGgIiB+SwAn3KCBRtrA2FOtVaAe0h3R+iPhx61mW16hA2ImC8Ehx2ZBnb3OUleqObJo7zQCuQkIkMchyGK/JcSBikbOyTAGVMmAEIJvayfjX8AU3SonXTlIwIGHPapRFijAQCmXerV89Ae1agR+JPiBOPmvVXxPCkYi5VpJfCKUPrAufm6uorkPSaIrU5nxZ0grQxZi+xAmJpMrHmMUpgV6Xx1kyJ3iBb4XqBCjQx5DjyTwtR+gpbV2eyUXjeHen9Xpn9+DN5v7Yl7XnKfOC6fmtJy3zr7zyWk7Rw4u/Sj9LP0q/S5vlw/Kp+XvOXR1peA8ceaeMv4L5V1HbA=</latexit>
slide-47
SLIDE 47

Maximum entropy Markov models (MEMMs)

20

■ Simply predict the label for each word using logistic regression, using the label

assigned to the previous word as a feature (along with any other useful features)!

Janet

VB?

will back the bill <s>

P(ti | wi, ti1) = exp(θ · f (ti, w i+j

ij , ti1 ik))

P

t02Y

exp(θ · f (t0, w i+j

ij , ti1 ik))

<latexit sha1_base64="Rl61uDKpuzo1hvhob8vcbSy82E=">AGl3icnVTbhMxEF0aCBAKtPCEeDFUEQmEKgEkeEgQIgXIEi9gOp05fV6E6fei2xv2sjyh/AKX8XfMN7dtkmTFglL8U5mzhmfGc9ukAmudLf759JK7fKV+tVr1xs3Vm/eur2fmdHpbmkbJumIpXfA6KY4Anb1lwL9j2TjMSBYLvBwXsX350wqXiabOlpxgYxGSY84pRocPnrtdVGE4+INtr6vf0EvUaYyGFMjnyjncOifqswEI5iA6d2W40F2E4koSafqtAFNgi0K7obXsca9tl9POJjWYVQzNB4GeZTI8Qhj30DX/ds/uJA/IKxQvixWhAHSc1/GnPtv+jF6e5zbkfGVo6dETX7fGbXdqkaYgfplAPgi3eJuAa2wD34xtK/W10pENiInLJXHx6wqcgHVlblZ0rcKvXyGWkguXOq7yAy1HcPWZHWQvrEdMEYRqmGkWO0gE4IMd23/AnY1sRD9xfR4fBcMqx4DHXCiQ/whzGgOgRJcL8sHZp2kedfyR1lz8t1eIgNof2jMz+iLfKQKcATC0Isf7aRnezWy0aPQqY8OrVt9fX5E4TGkes0RTQZTa63UzPTBEak4Fsw2cK5YRekCGbA/MhMRMDUzx3lrUBE+IolTCL9Go8M4yDImVmsYBIF1D1NmYcy6L7eU6ejUwPMlyzRJaHhTlAukUuY8ACrlkVIspGIRKDloRHRHojoZPBUzHzDEjJiZMzxdC4FRUXH6nKQgtvPko7LQBpYsYc0jWOShI8NjkjMxTRkEcmFdgMQHdvL+tUJzxTVetOUgqmcSr5kCdECBZp7LZ5NzxGhd7A39gcEGSfQbVXzMmiU4lKCmn38KFDfED9+LZi5A8OUGCOV+WKQRAMa4vacYSY8sxFqliOBjKNM/mBC/wC6GQgERwDSWezdNKBExp7+xMLho7zZ7zefXux8fZdNa/XvPveQ6/l9byX3lvk9f3tj1aU7WftV+13/V79Tf1j/VPJXTlUsW5682t+re/621ALA=</latexit>

ˆ T = argmax

T

Y

i

P(ti | w i+j

i−j , ti−1 i−k)))

<latexit sha1_base64="n95aMDbgXepW8IPRq1g025JqVU=">AGqXicnVTbhMxEN02Eq4tfCIhAxV1ISGKilI8FKpAh54AVKpN6jTldfrTZx6L/J6c5HlR76GV/gY/gbuwlJk7YSK8U7mTlnfGY8Xi9hNBXN5p+V1dKt2+U7a3cr9+4/ePhofePxcRpnHJMjHLOYn3oJYxG5EhQwchpwgkKPUZOvIsPJn4yIDylcXQoxgnphKgb0YBiJLTL3Sg9q1RhDwkplNs6j8AegIh3QzRypTAOBdo1awAYUh8MjVmvVBdhMOAIy3bNIizWBuoFva4msbpaRr+aWKkWMTAT1Pwk4fEIQL36rqR7LXUeGSAtUNQSr0dr1CSpK9aqv4fvfiX+3JDrlYGlm49cEWtXze72jSW+GWg8+lwjdaRdvWV58q+qsWuMFoRS3poykqzcMIqIldSbZWHaqbCw1wunVE2NMr6lzS7b5q5EIvzF8jt160d4qljWkpJq2dBkhGSQ2KHhEIQOzHAgSG0rghtZKmFshoSEWqi9iCVA8GEj2MmPym1NK0W40bkhq941wt9EI5VJdktnu0lgcaFjBWohy1zebO037gEWjVRibTvG03Y1VDv0YZyGJBGYoTc9azUR0JOKCYkZUBWYpSRC+QF1yps0IhSTtSHuTFahqjw+CmOtfJID1zjIkCtN0HoaRqSXo4Z57LYWSaCdx1JoyQTJML5RkHGgIiB+SwAn3KCBRtrA2FOtVaAe0h3R+iPhx61mW16hA2ImC8Ehx2ZBnb3OUleqObJo7zQCuQkIkMchyGK/JcSBikbOyTAGVMmAEIJvayfjX8AU3SonXTlIwIGHPapRFijAQCmXerV89Ae1agR+JPiBOPmvVXxPCkYi5VpJfCKUPrAufm6uorkPSaIrU5nxZ0grQxZi+xAmJpMrHmMUpgV6Xx1kyJ3iBb4XqBCjQx5DjyTwtR+gpbV2eyUXjeHen9Xpn9+DN5v7Yl7XnKfOC6fmtJy3zr7zyWk7Rw4u/Sj9LP0q/S5vlw/Kp+XvOXR1peA8ceaeMv4L5V1HbA=</latexit>
slide-48
SLIDE 48

Maximum entropy Markov models (MEMMs)

20

■ Simply predict the label for each word using logistic regression, using the label

assigned to the previous word as a feature (along with any other useful features)!

Janet

NNP MD VB?

will back the bill <s> ti-2 ti-1 wi+1 wi wi-1 wi-2

P(ti | wi, ti1) = exp(θ · f (ti, w i+j

ij , ti1 ik))

P

t02Y

exp(θ · f (t0, w i+j

ij , ti1 ik))

<latexit sha1_base64="Rl61uDKpuzo1hvhob8vcbSy82E=">AGl3icnVTbhMxEF0aCBAKtPCEeDFUEQmEKgEkeEgQIgXIEi9gOp05fV6E6fei2xv2sjyh/AKX8XfMN7dtkmTFglL8U5mzhmfGc9ukAmudLf759JK7fKV+tVr1xs3Vm/eur2fmdHpbmkbJumIpXfA6KY4Anb1lwL9j2TjMSBYLvBwXsX350wqXiabOlpxgYxGSY84pRocPnrtdVGE4+INtr6vf0EvUaYyGFMjnyjncOifqswEI5iA6d2W40F2E4koSafqtAFNgi0K7obXsca9tl9POJjWYVQzNB4GeZTI8Qhj30DX/ds/uJA/IKxQvixWhAHSc1/GnPtv+jF6e5zbkfGVo6dETX7fGbXdqkaYgfplAPgi3eJuAa2wD34xtK/W10pENiInLJXHx6wqcgHVlblZ0rcKvXyGWkguXOq7yAy1HcPWZHWQvrEdMEYRqmGkWO0gE4IMd23/AnY1sRD9xfR4fBcMqx4DHXCiQ/whzGgOgRJcL8sHZp2kedfyR1lz8t1eIgNof2jMz+iLfKQKcATC0Isf7aRnezWy0aPQqY8OrVt9fX5E4TGkes0RTQZTa63UzPTBEak4Fsw2cK5YRekCGbA/MhMRMDUzx3lrUBE+IolTCL9Go8M4yDImVmsYBIF1D1NmYcy6L7eU6ejUwPMlyzRJaHhTlAukUuY8ACrlkVIspGIRKDloRHRHojoZPBUzHzDEjJiZMzxdC4FRUXH6nKQgtvPko7LQBpYsYc0jWOShI8NjkjMxTRkEcmFdgMQHdvL+tUJzxTVetOUgqmcSr5kCdECBZp7LZ5NzxGhd7A39gcEGSfQbVXzMmiU4lKCmn38KFDfED9+LZi5A8OUGCOV+WKQRAMa4vacYSY8sxFqliOBjKNM/mBC/wC6GQgERwDSWezdNKBExp7+xMLho7zZ7zefXux8fZdNa/XvPveQ6/l9byX3lvk9f3tj1aU7WftV+13/V79Tf1j/VPJXTlUsW5682t+re/621ALA=</latexit>

ˆ T = argmax

T

Y

i

P(ti | w i+j

i−j , ti−1 i−k)))

<latexit sha1_base64="n95aMDbgXepW8IPRq1g025JqVU=">AGqXicnVTbhMxEN02Eq4tfCIhAxV1ISGKilI8FKpAh54AVKpN6jTldfrTZx6L/J6c5HlR76GV/gY/gbuwlJk7YSK8U7mTlnfGY8Xi9hNBXN5p+V1dKt2+U7a3cr9+4/ePhofePxcRpnHJMjHLOYn3oJYxG5EhQwchpwgkKPUZOvIsPJn4yIDylcXQoxgnphKgb0YBiJLTL3Sg9q1RhDwkplNs6j8AegIh3QzRypTAOBdo1awAYUh8MjVmvVBdhMOAIy3bNIizWBuoFva4msbpaRr+aWKkWMTAT1Pwk4fEIQL36rqR7LXUeGSAtUNQSr0dr1CSpK9aqv4fvfiX+3JDrlYGlm49cEWtXze72jSW+GWg8+lwjdaRdvWV58q+qsWuMFoRS3poykqzcMIqIldSbZWHaqbCw1wunVE2NMr6lzS7b5q5EIvzF8jt160d4qljWkpJq2dBkhGSQ2KHhEIQOzHAgSG0rghtZKmFshoSEWqi9iCVA8GEj2MmPym1NK0W40bkhq941wt9EI5VJdktnu0lgcaFjBWohy1zebO037gEWjVRibTvG03Y1VDv0YZyGJBGYoTc9azUR0JOKCYkZUBWYpSRC+QF1yps0IhSTtSHuTFahqjw+CmOtfJID1zjIkCtN0HoaRqSXo4Z57LYWSaCdx1JoyQTJML5RkHGgIiB+SwAn3KCBRtrA2FOtVaAe0h3R+iPhx61mW16hA2ImC8Ehx2ZBnb3OUleqObJo7zQCuQkIkMchyGK/JcSBikbOyTAGVMmAEIJvayfjX8AU3SonXTlIwIGHPapRFijAQCmXerV89Ae1agR+JPiBOPmvVXxPCkYi5VpJfCKUPrAufm6uorkPSaIrU5nxZ0grQxZi+xAmJpMrHmMUpgV6Xx1kyJ3iBb4XqBCjQx5DjyTwtR+gpbV2eyUXjeHen9Xpn9+DN5v7Yl7XnKfOC6fmtJy3zr7zyWk7Rw4u/Sj9LP0q/S5vlw/Kp+XvOXR1peA8ceaeMv4L5V1HbA=</latexit>
slide-49
SLIDE 49

Maximum entropy Markov models (MEMMs)

20

■ Simply predict the label for each word using logistic regression, using the label

assigned to the previous word as a feature (along with any other useful features)!

Janet

NNP MD VB?

will back the bill <s> ti-2 ti-1 wi+1 wi wi-1 wi-2

Lexical fi±{0,1,2,3}, (mi−2,i−1), (mi−1,i), (mi−1,i+1), (mi,i+1), (mi+1,i+2), (mi−2,i−1,i), (mi−1,i,i+1), (mi,i+1,i+2), (mi−2,i−1,i+1), (mi−1,i+1,i+2) POS pi−{3,2,1}, ai+{0,1,2,3}, (pi−2,i−1), (ai+1,i+2), (pi−1, ai+1), (pi−2, pi−1, ai), (pi−2, pi−1, ai+1), (pi−1, ai, ai+1), (pi−1, ai+1, ai+2) Affix c:1, c:2, c:3, cn:, cn−1:, cn−2:, cn−3: Binary initial uppercase, all uppercase/lowercase, contains 1/2+ capital(s) not at the beginning, contains a (period/number/hyphen)

Table from: Choi and Palmer. Fast and Robust Part-of-Speech Tagging using Dynamic Model Selection. ACL 2012.

P(ti | wi, ti1) = exp(θ · f (ti, w i+j

ij , ti1 ik))

P

t02Y

exp(θ · f (t0, w i+j

ij , ti1 ik))

<latexit sha1_base64="Rl61uDKpuzo1hvhob8vcbSy82E=">AGl3icnVTbhMxEF0aCBAKtPCEeDFUEQmEKgEkeEgQIgXIEi9gOp05fV6E6fei2xv2sjyh/AKX8XfMN7dtkmTFglL8U5mzhmfGc9ukAmudLf759JK7fKV+tVr1xs3Vm/eur2fmdHpbmkbJumIpXfA6KY4Anb1lwL9j2TjMSBYLvBwXsX350wqXiabOlpxgYxGSY84pRocPnrtdVGE4+INtr6vf0EvUaYyGFMjnyjncOifqswEI5iA6d2W40F2E4koSafqtAFNgi0K7obXsca9tl9POJjWYVQzNB4GeZTI8Qhj30DX/ds/uJA/IKxQvixWhAHSc1/GnPtv+jF6e5zbkfGVo6dETX7fGbXdqkaYgfplAPgi3eJuAa2wD34xtK/W10pENiInLJXHx6wqcgHVlblZ0rcKvXyGWkguXOq7yAy1HcPWZHWQvrEdMEYRqmGkWO0gE4IMd23/AnY1sRD9xfR4fBcMqx4DHXCiQ/whzGgOgRJcL8sHZp2kedfyR1lz8t1eIgNof2jMz+iLfKQKcATC0Isf7aRnezWy0aPQqY8OrVt9fX5E4TGkes0RTQZTa63UzPTBEak4Fsw2cK5YRekCGbA/MhMRMDUzx3lrUBE+IolTCL9Go8M4yDImVmsYBIF1D1NmYcy6L7eU6ejUwPMlyzRJaHhTlAukUuY8ACrlkVIspGIRKDloRHRHojoZPBUzHzDEjJiZMzxdC4FRUXH6nKQgtvPko7LQBpYsYc0jWOShI8NjkjMxTRkEcmFdgMQHdvL+tUJzxTVetOUgqmcSr5kCdECBZp7LZ5NzxGhd7A39gcEGSfQbVXzMmiU4lKCmn38KFDfED9+LZi5A8OUGCOV+WKQRAMa4vacYSY8sxFqliOBjKNM/mBC/wC6GQgERwDSWezdNKBExp7+xMLho7zZ7zefXux8fZdNa/XvPveQ6/l9byX3lvk9f3tj1aU7WftV+13/V79Tf1j/VPJXTlUsW5682t+re/621ALA=</latexit>

ˆ T = argmax

T

Y

i

P(ti | w i+j

i−j , ti−1 i−k)))

<latexit sha1_base64="n95aMDbgXepW8IPRq1g025JqVU=">AGqXicnVTbhMxEN02Eq4tfCIhAxV1ISGKilI8FKpAh54AVKpN6jTldfrTZx6L/J6c5HlR76GV/gY/gbuwlJk7YSK8U7mTlnfGY8Xi9hNBXN5p+V1dKt2+U7a3cr9+4/ePhofePxcRpnHJMjHLOYn3oJYxG5EhQwchpwgkKPUZOvIsPJn4yIDylcXQoxgnphKgb0YBiJLTL3Sg9q1RhDwkplNs6j8AegIh3QzRypTAOBdo1awAYUh8MjVmvVBdhMOAIy3bNIizWBuoFva4msbpaRr+aWKkWMTAT1Pwk4fEIQL36rqR7LXUeGSAtUNQSr0dr1CSpK9aqv4fvfiX+3JDrlYGlm49cEWtXze72jSW+GWg8+lwjdaRdvWV58q+qsWuMFoRS3poykqzcMIqIldSbZWHaqbCw1wunVE2NMr6lzS7b5q5EIvzF8jt160d4qljWkpJq2dBkhGSQ2KHhEIQOzHAgSG0rghtZKmFshoSEWqi9iCVA8GEj2MmPym1NK0W40bkhq941wt9EI5VJdktnu0lgcaFjBWohy1zebO037gEWjVRibTvG03Y1VDv0YZyGJBGYoTc9azUR0JOKCYkZUBWYpSRC+QF1yps0IhSTtSHuTFahqjw+CmOtfJID1zjIkCtN0HoaRqSXo4Z57LYWSaCdx1JoyQTJML5RkHGgIiB+SwAn3KCBRtrA2FOtVaAe0h3R+iPhx61mW16hA2ImC8Ehx2ZBnb3OUleqObJo7zQCuQkIkMchyGK/JcSBikbOyTAGVMmAEIJvayfjX8AU3SonXTlIwIGHPapRFijAQCmXerV89Ae1agR+JPiBOPmvVXxPCkYi5VpJfCKUPrAufm6uorkPSaIrU5nxZ0grQxZi+xAmJpMrHmMUpgV6Xx1kyJ3iBb4XqBCjQx5DjyTwtR+gpbV2eyUXjeHen9Xpn9+DN5v7Yl7XnKfOC6fmtJy3zr7zyWk7Rw4u/Sj9LP0q/S5vlw/Kp+XvOXR1peA8ceaeMv4L5V1HbA=</latexit>
slide-50
SLIDE 50

Part-of-speech tagging features

21

wi contains a particular prefix (from all prefixes of length  4) wi contains a particular suffix (from all suffixes of length  4) wi contains a number wi contains an upper-case letter wi contains a hyphen wi is all upper case wi’s word shape wi’s short word shape wi is upper case and has a digit and a dash (like CFC-12) wi is upper case and followed within 3 words by Co., Inc., etc.

slide-51
SLIDE 51

Part-of-speech tagging features

21

wi contains a particular prefix (from all prefixes of length  4) wi contains a particular suffix (from all suffixes of length  4) wi contains a number wi contains an upper-case letter wi contains a hyphen wi is all upper case wi’s word shape wi’s short word shape wi is upper case and has a digit and a dash (like CFC-12) wi is upper case and followed within 3 words by Co., Inc., etc.

slide-52
SLIDE 52

Features for NER

22

prefix(wi) = L suffix(wi) = tane prefix(wi) = L’ suffix(wi) = ane prefix(wi) = L’O suffix(wi) = ne prefix(wi) = L’Oc suffix(wi) = e word-shape(wi) = X’Xxxxxxxx short-word-shape(wi) = X’Xx

■ Example extracted features for the token L’occitane

slide-53
SLIDE 53

Features for NER

22

prefix(wi) = L suffix(wi) = tane prefix(wi) = L’ suffix(wi) = ane prefix(wi) = L’O suffix(wi) = ne prefix(wi) = L’Oc suffix(wi) = e word-shape(wi) = X’Xxxxxxxx short-word-shape(wi) = X’Xx

■ Example extracted features for the token L’occitane

slide-54
SLIDE 54

Label bias problem

23

man the

?? DT

boat

NN

slide-55
SLIDE 55

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

man the

?? DT

boat

NN

slide-56
SLIDE 56

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

man the

?? DT

boat

NN

slide-57
SLIDE 57

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

  • ld

man the the

?? ?? DT DT

boat

NN

slide-58
SLIDE 58

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

  • ld

man the the

?? ?? DT DT

boat

NN NN? ADJ?

slide-59
SLIDE 59

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

  • ld

man the the

?? ?? DT DT

boat

NN NN? ADJ?

slide-60
SLIDE 60

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

  • ld

man the the

?? ?? DT DT

boat

NN NN? VB? NN? ADJ?

slide-61
SLIDE 61

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

  • ld

man the the

?? ?? DT DT

boat

NN NN? VB? NN? ADJ?

slide-62
SLIDE 62

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal.

  • ld

man the the

Are HMMs subject to the label bias problem?

?? ?? DT DT

boat

NN NN? VB? NN? ADJ?

slide-63
SLIDE 63

Label bias problem

23

■ Greediness: Transitions leaving a given state compete only against each other, rather

than against all other transitions in the model.

■ Leads to locally optimal decisions that are not globally optimal. ■ All locally normalized models (vs. globally normalized) suffer from this problem.

  • ld

man the the

Are HMMs subject to the label bias problem?

?? ?? DT DT

boat

NN NN? VB? NN? ADJ?

slide-64
SLIDE 64

Conditional random fields (CRFs)

24

slide-65
SLIDE 65

Conditional random fields (CRFs)

24

■ CRFs: Globally-normalized discriminative sequence labeling models!

slide-66
SLIDE 66

Conditional random fields (CRFs)

24

■ CRFs: Globally-normalized discriminative sequence labeling models! ■ A family of graphical models where the probability of a configuration of variables

is proportional to a product of scores across pairs (or cliques) of variables (suitable for structured prediction, not just sequence labeling).

slide-67
SLIDE 67

Conditional random fields (CRFs)

24

■ CRFs: Globally-normalized discriminative sequence labeling models! ■ A family of graphical models where the probability of a configuration of variables

is proportional to a product of scores across pairs (or cliques) of variables (suitable for structured prediction, not just sequence labeling).

■ In NLP

, usually we mean a linear-chain CRF where pairs of variables are labels for adjacent tokens.

slide-68
SLIDE 68

Conditional random fields (CRFs)

24

■ CRFs: Globally-normalized discriminative sequence labeling models! ■ A family of graphical models where the probability of a configuration of variables

is proportional to a product of scores across pairs (or cliques) of variables (suitable for structured prediction, not just sequence labeling).

■ In NLP

, usually we mean a linear-chain CRF where pairs of variables are labels for adjacent tokens.

■ Semi-Markov CRFs [Sarawagi and Cohen 2004] are also useful for sequence

labeling in NLP: feature functions, normalization over possible segments.

slide-69
SLIDE 69

Conditional random fields (CRFs)

25

Logistic Regression HMMs Linear-chain CRFs Naive Bayes

SEQUENCE SEQUENCE CONDITIONAL CONDITIONAL

Generative directed models General CRFs

CONDITIONAL General GRAPHS General GRAPHS

Image from: Sutton and McCallum. An Introduction to Conditional Random Fields for Relational Learning. In Introduction to Statistical Relational Learning. 2012.

slide-70
SLIDE 70

■ CRFs: Globally-normalized discriminative sequence labeling models!

P(y | w) = exp(Ψ(w, y)) P

y0∈Y(w)

exp(Ψ(w, y0))

<latexit sha1_base64="Hl4SzGDfq/v1/mRkJxB+iyWEgs4=">AH9nicnVTPb9s2FayZG69X0137IVbEFha3cDqCmyXAkW3wy7tPMBpOoSOQFGUTYeUBIpybBD8V3obdt3fstv+kx1HUrJrOU4CTICl5/e+7+P3HgnGBaOlHAz+2dv/5ODw086Dh93Pv/iy68eHT1+V+aVwOQM5ywX72NUEkYzciapZOR9IQjiMSPn8dVPtn4+J6KkeTaSy4KMOZpkNKUYSZOKjg6uidwiqSOgovM/ASQCQmHC0iJW1Cg6HvAgA5TcC1DYPuyU0YTAXCaug7hMO6QtDQA72qBXoX/XZi96SpgY2i4ReFyBcAmncSKfoy1JeZBdIGR3xbrRBrUQVfRbq4H/M4qP29kBudwZ2Lj2PpD8L7KpOxhHfzo2eKfs0QCY103GkZtrPI2m9IlZM0ZpVnzFaip3UG2bI73R4qj2SzesXVtrM32p6NOZ7tdOr+xf6zdo5rvG0v6FyvrjgMki8KHckokAhAnuQSpfTvkdbKNgMZ5VSWposepOZkIDnFiKnftd4p2+vfI2r9jmq359sWhyX1z/tgtL30qAfM2mBzcdBm9KzwR357I9RbDQtTur9n0IjAmKtrU7XfpV7Jrf1wp/rmagtoB6BbmZgM6leKywjbl+KG/k+4EF36NeibgI1atcYthxsT8Sle3p7LA0t0LcI9axS9Oh4cDpwD7gZhE1w7DXPMDraFzDJcVJjFDZXkRDgo5VkhIihnRXViVpED4Ck3IhQkzxEk5Vu5a1ODEZBKQ5sL8MglcdpOhEC/LJY8N0jZSbtdsclftopLpj2NFs6KSJMP1QmnFgMyBvWNBQgXBki1NgLCgxivAU2TGLM1NbDZ6Y5kpYXMi241gPlZl6lZvWYq5bpMXdaNdKEhGrnHOcqS7xRMEadsmZAUVUza/UtX8a59ZM5LcpmdGtJRiTMBZ3QDFGUgntq502n6mE7t2FPxOzQYK8Ma5/LYhAMhfGSX25aLNhE/iNvdf0XUiarZEmbLelnAHTjJ1LXpBM6fr4sbwkMJ6IvCpahm/wnVEjgFKzDTWetGk1wpzScPtM3gzePT8Nvz9/tuL41evm/P6wHvifev5Xuj94L3yfvG3pmHD/4+Pdw73C/s+h86PzR+bOG7u81nK+91tP56z9G0Lzs</latexit>

Conditional random fields (CRFs)

26

Ψ(w, y) =

M+1

X

m=1

θ · f(w, ym, ym−1, m)

<latexit sha1_base64="6QgzZybjFg9uG/YPephMOPdGJY8=">AHcnicnVTdbts2FZTr+68nzbdXvDLghirV5gdQXWmwJFt4vdNPMAp+kQOgJFUTYdUhJIyolBcO+5J9gL7AFGUrJrOW4CTICp43PO953vHFJMSkalGg7/vrd3v/PFg+7DL3tf3Nt48e7z/5ItKYHKC1aIjwmShNGcnCqGPlYCoJ4wshZcvmLi58tiJC0yMdqWZIJR9OcZhQjZV3x/v1/eodwhpRWJo4ucvAGQCSmHF3HWjmHAaO+NwDkNAVXzgx7hzfTYCYQ1qO+z/C5PhA28NCsYqHZBf8sHfYxMBG0OLUhTXANo1jTV9E5mL3CXSJot64O3ZNmtFqumPkQn/xyw+cW8P5PKwM7Si1j156Gr6mk8GRh+Wy4T0NkXOTxHpu+kWsnFbEyhlao2TFV6gmcgvUtTk2Gy2Oa710Q9qVkzY3F5q+mJtBrfTS/XV6w2a+61w6WPfiaP1xgOS67EM1IwoBiNCgcxBndQG+2agYxyqTt4ghSezKQmHE9J/G7KQ9GtxB6vSOa7Vn2xJHkvbPBmC8Xp8BGxtsFkctBFHjvgTvr0R+sTA0obu7hn0PAdMuL6yQfdemhXbWg73pO9fRMYl1BMwzQicJzNrhmXM3aK5ZR8AHsaPD4bHQ/+Am0bUGAdB84zi/T0B0wJXnOQKMyTleTQs1UQjoShmxPRgJUmJ8CWaknNr5ogTOdH+SjLg0HpSkBXC/nIFvHcToRGXcskTm+lmK7djzrkrdl6p7PVE07ysFMlxXSirGFAFcPcbSKkgWLGlNRAW1GoFeIbsTit7C9qd2igzI2xBVLsRzCdaZr56S1LCTRt8XTfag4Lk5AoXnKM8/UHDHKlinJUMWUO03Zyt41r0G6oKVsRremZETBQtApzRFjJFPQLW23fc0U9GsP/krsBgny3qr+vSQCqUJYJfWHbeyGTeFzd6eY2zJpvs60Zrst7QXYZtxcipLk2tRfBCskgclUFXZEnwD74VaApTZbajzSRtWZ9hTGm2fyZvGh5fH0U/HL/94dfD2XNeHwbPgu+DfhAFPwdvg9+CUXAa4M5JR3VM568H/3afdp93m8O9d6/BfBe0nu7gPxLUizA=</latexit>
slide-71
SLIDE 71

■ CRFs: Globally-normalized discriminative sequence labeling models!

P(y | w) = exp(Ψ(w, y)) P

y0∈Y(w)

exp(Ψ(w, y0))

<latexit sha1_base64="Hl4SzGDfq/v1/mRkJxB+iyWEgs4=">AH9nicnVTPb9s2FayZG69X0137IVbEFha3cDqCmyXAkW3wy7tPMBpOoSOQFGUTYeUBIpybBD8V3obdt3fstv+kx1HUrJrOU4CTICl5/e+7+P3HgnGBaOlHAz+2dv/5ODw086Dh93Pv/iy68eHT1+V+aVwOQM5ywX72NUEkYzciapZOR9IQjiMSPn8dVPtn4+J6KkeTaSy4KMOZpkNKUYSZOKjg6uidwiqSOgovM/ASQCQmHC0iJW1Cg6HvAgA5TcC1DYPuyU0YTAXCaug7hMO6QtDQA72qBXoX/XZi96SpgY2i4ReFyBcAmncSKfoy1JeZBdIGR3xbrRBrUQVfRbq4H/M4qP29kBudwZ2Lj2PpD8L7KpOxhHfzo2eKfs0QCY103GkZtrPI2m9IlZM0ZpVnzFaip3UG2bI73R4qj2SzesXVtrM32p6NOZ7tdOr+xf6zdo5rvG0v6FyvrjgMki8KHckokAhAnuQSpfTvkdbKNgMZ5VSWposepOZkIDnFiKnftd4p2+vfI2r9jmq359sWhyX1z/tgtL30qAfM2mBzcdBm9KzwR357I9RbDQtTur9n0IjAmKtrU7XfpV7Jrf1wp/rmagtoB6BbmZgM6leKywjbl+KG/k+4EF36NeibgI1atcYthxsT8Sle3p7LA0t0LcI9axS9Oh4cDpwD7gZhE1w7DXPMDraFzDJcVJjFDZXkRDgo5VkhIihnRXViVpED4Ck3IhQkzxEk5Vu5a1ODEZBKQ5sL8MglcdpOhEC/LJY8N0jZSbtdsclftopLpj2NFs6KSJMP1QmnFgMyBvWNBQgXBki1NgLCgxivAU2TGLM1NbDZ6Y5kpYXMi241gPlZl6lZvWYq5bpMXdaNdKEhGrnHOcqS7xRMEadsmZAUVUza/UtX8a59ZM5LcpmdGtJRiTMBZ3QDFGUgntq502n6mE7t2FPxOzQYK8Ma5/LYhAMhfGSX25aLNhE/iNvdf0XUiarZEmbLelnAHTjJ1LXpBM6fr4sbwkMJ6IvCpahm/wnVEjgFKzDTWetGk1wpzScPtM3gzePT8Nvz9/tuL41evm/P6wHvifev5Xuj94L3yfvG3pmHD/4+Pdw73C/s+h86PzR+bOG7u81nK+91tP56z9G0Lzs</latexit>

Conditional random fields (CRFs)

26

y’ is an entire sequence

Ψ(w, y) =

M+1

X

m=1

θ · f(w, ym, ym−1, m)

<latexit sha1_base64="6QgzZybjFg9uG/YPephMOPdGJY8=">AHcnicnVTdbts2FZTr+68nzbdXvDLghirV5gdQXWmwJFt4vdNPMAp+kQOgJFUTYdUhJIyolBcO+5J9gL7AFGUrJrOW4CTICp43PO953vHFJMSkalGg7/vrd3v/PFg+7DL3tf3Nt48e7z/5ItKYHKC1aIjwmShNGcnCqGPlYCoJ4wshZcvmLi58tiJC0yMdqWZIJR9OcZhQjZV3x/v1/eodwhpRWJo4ucvAGQCSmHF3HWjmHAaO+NwDkNAVXzgx7hzfTYCYQ1qO+z/C5PhA28NCsYqHZBf8sHfYxMBG0OLUhTXANo1jTV9E5mL3CXSJot64O3ZNmtFqumPkQn/xyw+cW8P5PKwM7Si1j156Gr6mk8GRh+Wy4T0NkXOTxHpu+kWsnFbEyhlao2TFV6gmcgvUtTk2Gy2Oa710Q9qVkzY3F5q+mJtBrfTS/XV6w2a+61w6WPfiaP1xgOS67EM1IwoBiNCgcxBndQG+2agYxyqTt4ghSezKQmHE9J/G7KQ9GtxB6vSOa7Vn2xJHkvbPBmC8Xp8BGxtsFkctBFHjvgTvr0R+sTA0obu7hn0PAdMuL6yQfdemhXbWg73pO9fRMYl1BMwzQicJzNrhmXM3aK5ZR8AHsaPD4bHQ/+Am0bUGAdB84zi/T0B0wJXnOQKMyTleTQs1UQjoShmxPRgJUmJ8CWaknNr5ogTOdH+SjLg0HpSkBXC/nIFvHcToRGXcskTm+lmK7djzrkrdl6p7PVE07ysFMlxXSirGFAFcPcbSKkgWLGlNRAW1GoFeIbsTit7C9qd2igzI2xBVLsRzCdaZr56S1LCTRt8XTfag4Lk5AoXnKM8/UHDHKlinJUMWUO03Zyt41r0G6oKVsRremZETBQtApzRFjJFPQLW23fc0U9GsP/krsBgny3qr+vSQCqUJYJfWHbeyGTeFzd6eY2zJpvs60Zrst7QXYZtxcipLk2tRfBCskgclUFXZEnwD74VaApTZbajzSRtWZ9hTGm2fyZvGh5fH0U/HL/94dfD2XNeHwbPgu+DfhAFPwdvg9+CUXAa4M5JR3VM568H/3afdp93m8O9d6/BfBe0nu7gPxLUizA=</latexit>
slide-72
SLIDE 72

■ CRFs: Globally-normalized discriminative sequence labeling models!

P(y | w) = exp(Ψ(w, y)) P

y0∈Y(w)

exp(Ψ(w, y0))

<latexit sha1_base64="Hl4SzGDfq/v1/mRkJxB+iyWEgs4=">AH9nicnVTPb9s2FayZG69X0137IVbEFha3cDqCmyXAkW3wy7tPMBpOoSOQFGUTYeUBIpybBD8V3obdt3fstv+kx1HUrJrOU4CTICl5/e+7+P3HgnGBaOlHAz+2dv/5ODw086Dh93Pv/iy68eHT1+V+aVwOQM5ywX72NUEkYzciapZOR9IQjiMSPn8dVPtn4+J6KkeTaSy4KMOZpkNKUYSZOKjg6uidwiqSOgovM/ASQCQmHC0iJW1Cg6HvAgA5TcC1DYPuyU0YTAXCaug7hMO6QtDQA72qBXoX/XZi96SpgY2i4ReFyBcAmncSKfoy1JeZBdIGR3xbrRBrUQVfRbq4H/M4qP29kBudwZ2Lj2PpD8L7KpOxhHfzo2eKfs0QCY103GkZtrPI2m9IlZM0ZpVnzFaip3UG2bI73R4qj2SzesXVtrM32p6NOZ7tdOr+xf6zdo5rvG0v6FyvrjgMki8KHckokAhAnuQSpfTvkdbKNgMZ5VSWposepOZkIDnFiKnftd4p2+vfI2r9jmq359sWhyX1z/tgtL30qAfM2mBzcdBm9KzwR357I9RbDQtTur9n0IjAmKtrU7XfpV7Jrf1wp/rmagtoB6BbmZgM6leKywjbl+KG/k+4EF36NeibgI1atcYthxsT8Sle3p7LA0t0LcI9axS9Oh4cDpwD7gZhE1w7DXPMDraFzDJcVJjFDZXkRDgo5VkhIihnRXViVpED4Ck3IhQkzxEk5Vu5a1ODEZBKQ5sL8MglcdpOhEC/LJY8N0jZSbtdsclftopLpj2NFs6KSJMP1QmnFgMyBvWNBQgXBki1NgLCgxivAU2TGLM1NbDZ6Y5kpYXMi241gPlZl6lZvWYq5bpMXdaNdKEhGrnHOcqS7xRMEadsmZAUVUza/UtX8a59ZM5LcpmdGtJRiTMBZ3QDFGUgntq502n6mE7t2FPxOzQYK8Ma5/LYhAMhfGSX25aLNhE/iNvdf0XUiarZEmbLelnAHTjJ1LXpBM6fr4sbwkMJ6IvCpahm/wnVEjgFKzDTWetGk1wpzScPtM3gzePT8Nvz9/tuL41evm/P6wHvifev5Xuj94L3yfvG3pmHD/4+Pdw73C/s+h86PzR+bOG7u81nK+91tP56z9G0Lzs</latexit>

Conditional random fields (CRFs)

26

y’ is an entire sequence

Ψ(w, y) =

M+1

X

m=1

θ · f(w, ym, ym−1, m)

<latexit sha1_base64="6QgzZybjFg9uG/YPephMOPdGJY8=">AHcnicnVTdbts2FZTr+68nzbdXvDLghirV5gdQXWmwJFt4vdNPMAp+kQOgJFUTYdUhJIyolBcO+5J9gL7AFGUrJrOW4CTICp43PO953vHFJMSkalGg7/vrd3v/PFg+7DL3tf3Nt48e7z/5ItKYHKC1aIjwmShNGcnCqGPlYCoJ4wshZcvmLi58tiJC0yMdqWZIJR9OcZhQjZV3x/v1/eodwhpRWJo4ucvAGQCSmHF3HWjmHAaO+NwDkNAVXzgx7hzfTYCYQ1qO+z/C5PhA28NCsYqHZBf8sHfYxMBG0OLUhTXANo1jTV9E5mL3CXSJot64O3ZNmtFqumPkQn/xyw+cW8P5PKwM7Si1j156Gr6mk8GRh+Wy4T0NkXOTxHpu+kWsnFbEyhlao2TFV6gmcgvUtTk2Gy2Oa710Q9qVkzY3F5q+mJtBrfTS/XV6w2a+61w6WPfiaP1xgOS67EM1IwoBiNCgcxBndQG+2agYxyqTt4ghSezKQmHE9J/G7KQ9GtxB6vSOa7Vn2xJHkvbPBmC8Xp8BGxtsFkctBFHjvgTvr0R+sTA0obu7hn0PAdMuL6yQfdemhXbWg73pO9fRMYl1BMwzQicJzNrhmXM3aK5ZR8AHsaPD4bHQ/+Am0bUGAdB84zi/T0B0wJXnOQKMyTleTQs1UQjoShmxPRgJUmJ8CWaknNr5ogTOdH+SjLg0HpSkBXC/nIFvHcToRGXcskTm+lmK7djzrkrdl6p7PVE07ysFMlxXSirGFAFcPcbSKkgWLGlNRAW1GoFeIbsTit7C9qd2igzI2xBVLsRzCdaZr56S1LCTRt8XTfag4Lk5AoXnKM8/UHDHKlinJUMWUO03Zyt41r0G6oKVsRremZETBQtApzRFjJFPQLW23fc0U9GsP/krsBgny3qr+vSQCqUJYJfWHbeyGTeFzd6eY2zJpvs60Zrst7QXYZtxcipLk2tRfBCskgclUFXZEnwD74VaApTZbajzSRtWZ9hTGm2fyZvGh5fH0U/HL/94dfD2XNeHwbPgu+DfhAFPwdvg9+CUXAa4M5JR3VM568H/3afdp93m8O9d6/BfBe0nu7gPxLUizA=</latexit>

logistic regression, plus weights over pairs of labels

slide-73
SLIDE 73

■ CRFs: Globally-normalized discriminative sequence labeling models!

P(y | w) = exp(Ψ(w, y)) P

y0∈Y(w)

exp(Ψ(w, y0))

<latexit sha1_base64="Hl4SzGDfq/v1/mRkJxB+iyWEgs4=">AH9nicnVTPb9s2FayZG69X0137IVbEFha3cDqCmyXAkW3wy7tPMBpOoSOQFGUTYeUBIpybBD8V3obdt3fstv+kx1HUrJrOU4CTICl5/e+7+P3HgnGBaOlHAz+2dv/5ODw086Dh93Pv/iy68eHT1+V+aVwOQM5ywX72NUEkYzciapZOR9IQjiMSPn8dVPtn4+J6KkeTaSy4KMOZpkNKUYSZOKjg6uidwiqSOgovM/ASQCQmHC0iJW1Cg6HvAgA5TcC1DYPuyU0YTAXCaug7hMO6QtDQA72qBXoX/XZi96SpgY2i4ReFyBcAmncSKfoy1JeZBdIGR3xbrRBrUQVfRbq4H/M4qP29kBudwZ2Lj2PpD8L7KpOxhHfzo2eKfs0QCY103GkZtrPI2m9IlZM0ZpVnzFaip3UG2bI73R4qj2SzesXVtrM32p6NOZ7tdOr+xf6zdo5rvG0v6FyvrjgMki8KHckokAhAnuQSpfTvkdbKNgMZ5VSWposepOZkIDnFiKnftd4p2+vfI2r9jmq359sWhyX1z/tgtL30qAfM2mBzcdBm9KzwR357I9RbDQtTur9n0IjAmKtrU7XfpV7Jrf1wp/rmagtoB6BbmZgM6leKywjbl+KG/k+4EF36NeibgI1atcYthxsT8Sle3p7LA0t0LcI9axS9Oh4cDpwD7gZhE1w7DXPMDraFzDJcVJjFDZXkRDgo5VkhIihnRXViVpED4Ck3IhQkzxEk5Vu5a1ODEZBKQ5sL8MglcdpOhEC/LJY8N0jZSbtdsclftopLpj2NFs6KSJMP1QmnFgMyBvWNBQgXBki1NgLCgxivAU2TGLM1NbDZ6Y5kpYXMi241gPlZl6lZvWYq5bpMXdaNdKEhGrnHOcqS7xRMEadsmZAUVUza/UtX8a59ZM5LcpmdGtJRiTMBZ3QDFGUgntq502n6mE7t2FPxOzQYK8Ma5/LYhAMhfGSX25aLNhE/iNvdf0XUiarZEmbLelnAHTjJ1LXpBM6fr4sbwkMJ6IvCpahm/wnVEjgFKzDTWetGk1wpzScPtM3gzePT8Nvz9/tuL41evm/P6wHvifev5Xuj94L3yfvG3pmHD/4+Pdw73C/s+h86PzR+bOG7u81nK+91tP56z9G0Lzs</latexit>

Conditional random fields (CRFs)

26

y’ is an entire sequence

Ψ(w, y) =

M+1

X

m=1

θ · f(w, ym, ym−1, m)

<latexit sha1_base64="6QgzZybjFg9uG/YPephMOPdGJY8=">AHcnicnVTdbts2FZTr+68nzbdXvDLghirV5gdQXWmwJFt4vdNPMAp+kQOgJFUTYdUhJIyolBcO+5J9gL7AFGUrJrOW4CTICp43PO953vHFJMSkalGg7/vrd3v/PFg+7DL3tf3Nt48e7z/5ItKYHKC1aIjwmShNGcnCqGPlYCoJ4wshZcvmLi58tiJC0yMdqWZIJR9OcZhQjZV3x/v1/eodwhpRWJo4ucvAGQCSmHF3HWjmHAaO+NwDkNAVXzgx7hzfTYCYQ1qO+z/C5PhA28NCsYqHZBf8sHfYxMBG0OLUhTXANo1jTV9E5mL3CXSJot64O3ZNmtFqumPkQn/xyw+cW8P5PKwM7Si1j156Gr6mk8GRh+Wy4T0NkXOTxHpu+kWsnFbEyhlao2TFV6gmcgvUtTk2Gy2Oa710Q9qVkzY3F5q+mJtBrfTS/XV6w2a+61w6WPfiaP1xgOS67EM1IwoBiNCgcxBndQG+2agYxyqTt4ghSezKQmHE9J/G7KQ9GtxB6vSOa7Vn2xJHkvbPBmC8Xp8BGxtsFkctBFHjvgTvr0R+sTA0obu7hn0PAdMuL6yQfdemhXbWg73pO9fRMYl1BMwzQicJzNrhmXM3aK5ZR8AHsaPD4bHQ/+Am0bUGAdB84zi/T0B0wJXnOQKMyTleTQs1UQjoShmxPRgJUmJ8CWaknNr5ogTOdH+SjLg0HpSkBXC/nIFvHcToRGXcskTm+lmK7djzrkrdl6p7PVE07ysFMlxXSirGFAFcPcbSKkgWLGlNRAW1GoFeIbsTit7C9qd2igzI2xBVLsRzCdaZr56S1LCTRt8XTfag4Lk5AoXnKM8/UHDHKlinJUMWUO03Zyt41r0G6oKVsRremZETBQtApzRFjJFPQLW23fc0U9GsP/krsBgny3qr+vSQCqUJYJfWHbeyGTeFzd6eY2zJpvs60Zrst7QXYZtxcipLk2tRfBCskgclUFXZEnwD74VaApTZbajzSRtWZ9hTGm2fyZvGh5fH0U/HL/94dfD2XNeHwbPgu+DfhAFPwdvg9+CUXAa4M5JR3VM568H/3afdp93m8O9d6/BfBe0nu7gPxLUizA=</latexit>

■ Decoding: Viterbi algorithm.

logistic regression, plus weights over pairs of labels

slide-74
SLIDE 74

■ CRFs: Globally-normalized discriminative sequence labeling models!

P(y | w) = exp(Ψ(w, y)) P

y0∈Y(w)

exp(Ψ(w, y0))

<latexit sha1_base64="Hl4SzGDfq/v1/mRkJxB+iyWEgs4=">AH9nicnVTPb9s2FayZG69X0137IVbEFha3cDqCmyXAkW3wy7tPMBpOoSOQFGUTYeUBIpybBD8V3obdt3fstv+kx1HUrJrOU4CTICl5/e+7+P3HgnGBaOlHAz+2dv/5ODw086Dh93Pv/iy68eHT1+V+aVwOQM5ywX72NUEkYzciapZOR9IQjiMSPn8dVPtn4+J6KkeTaSy4KMOZpkNKUYSZOKjg6uidwiqSOgovM/ASQCQmHC0iJW1Cg6HvAgA5TcC1DYPuyU0YTAXCaug7hMO6QtDQA72qBXoX/XZi96SpgY2i4ReFyBcAmncSKfoy1JeZBdIGR3xbrRBrUQVfRbq4H/M4qP29kBudwZ2Lj2PpD8L7KpOxhHfzo2eKfs0QCY103GkZtrPI2m9IlZM0ZpVnzFaip3UG2bI73R4qj2SzesXVtrM32p6NOZ7tdOr+xf6zdo5rvG0v6FyvrjgMki8KHckokAhAnuQSpfTvkdbKNgMZ5VSWposepOZkIDnFiKnftd4p2+vfI2r9jmq359sWhyX1z/tgtL30qAfM2mBzcdBm9KzwR357I9RbDQtTur9n0IjAmKtrU7XfpV7Jrf1wp/rmagtoB6BbmZgM6leKywjbl+KG/k+4EF36NeibgI1atcYthxsT8Sle3p7LA0t0LcI9axS9Oh4cDpwD7gZhE1w7DXPMDraFzDJcVJjFDZXkRDgo5VkhIihnRXViVpED4Ck3IhQkzxEk5Vu5a1ODEZBKQ5sL8MglcdpOhEC/LJY8N0jZSbtdsclftopLpj2NFs6KSJMP1QmnFgMyBvWNBQgXBki1NgLCgxivAU2TGLM1NbDZ6Y5kpYXMi241gPlZl6lZvWYq5bpMXdaNdKEhGrnHOcqS7xRMEadsmZAUVUza/UtX8a59ZM5LcpmdGtJRiTMBZ3QDFGUgntq502n6mE7t2FPxOzQYK8Ma5/LYhAMhfGSX25aLNhE/iNvdf0XUiarZEmbLelnAHTjJ1LXpBM6fr4sbwkMJ6IvCpahm/wnVEjgFKzDTWetGk1wpzScPtM3gzePT8Nvz9/tuL41evm/P6wHvifev5Xuj94L3yfvG3pmHD/4+Pdw73C/s+h86PzR+bOG7u81nK+91tP56z9G0Lzs</latexit>

Conditional random fields (CRFs)

26

y’ is an entire sequence

Ψ(w, y) =

M+1

X

m=1

θ · f(w, ym, ym−1, m)

<latexit sha1_base64="6QgzZybjFg9uG/YPephMOPdGJY8=">AHcnicnVTdbts2FZTr+68nzbdXvDLghirV5gdQXWmwJFt4vdNPMAp+kQOgJFUTYdUhJIyolBcO+5J9gL7AFGUrJrOW4CTICp43PO953vHFJMSkalGg7/vrd3v/PFg+7DL3tf3Nt48e7z/5ItKYHKC1aIjwmShNGcnCqGPlYCoJ4wshZcvmLi58tiJC0yMdqWZIJR9OcZhQjZV3x/v1/eodwhpRWJo4ucvAGQCSmHF3HWjmHAaO+NwDkNAVXzgx7hzfTYCYQ1qO+z/C5PhA28NCsYqHZBf8sHfYxMBG0OLUhTXANo1jTV9E5mL3CXSJot64O3ZNmtFqumPkQn/xyw+cW8P5PKwM7Si1j156Gr6mk8GRh+Wy4T0NkXOTxHpu+kWsnFbEyhlao2TFV6gmcgvUtTk2Gy2Oa710Q9qVkzY3F5q+mJtBrfTS/XV6w2a+61w6WPfiaP1xgOS67EM1IwoBiNCgcxBndQG+2agYxyqTt4ghSezKQmHE9J/G7KQ9GtxB6vSOa7Vn2xJHkvbPBmC8Xp8BGxtsFkctBFHjvgTvr0R+sTA0obu7hn0PAdMuL6yQfdemhXbWg73pO9fRMYl1BMwzQicJzNrhmXM3aK5ZR8AHsaPD4bHQ/+Am0bUGAdB84zi/T0B0wJXnOQKMyTleTQs1UQjoShmxPRgJUmJ8CWaknNr5ogTOdH+SjLg0HpSkBXC/nIFvHcToRGXcskTm+lmK7djzrkrdl6p7PVE07ysFMlxXSirGFAFcPcbSKkgWLGlNRAW1GoFeIbsTit7C9qd2igzI2xBVLsRzCdaZr56S1LCTRt8XTfag4Lk5AoXnKM8/UHDHKlinJUMWUO03Zyt41r0G6oKVsRremZETBQtApzRFjJFPQLW23fc0U9GsP/krsBgny3qr+vSQCqUJYJfWHbeyGTeFzd6eY2zJpvs60Zrst7QXYZtxcipLk2tRfBCskgclUFXZEnwD74VaApTZbajzSRtWZ9hTGm2fyZvGh5fH0U/HL/94dfD2XNeHwbPgu+DfhAFPwdvg9+CUXAa4M5JR3VM568H/3afdp93m8O9d6/BfBe0nu7gPxLUizA=</latexit>

■ Decoding: Viterbi algorithm. ■ Likelihood: Forward algorithm.

logistic regression, plus weights over pairs of labels

slide-75
SLIDE 75

■ CRFs: Globally-normalized discriminative sequence labeling models!

P(y | w) = exp(Ψ(w, y)) P

y0∈Y(w)

exp(Ψ(w, y0))

<latexit sha1_base64="Hl4SzGDfq/v1/mRkJxB+iyWEgs4=">AH9nicnVTPb9s2FayZG69X0137IVbEFha3cDqCmyXAkW3wy7tPMBpOoSOQFGUTYeUBIpybBD8V3obdt3fstv+kx1HUrJrOU4CTICl5/e+7+P3HgnGBaOlHAz+2dv/5ODw086Dh93Pv/iy68eHT1+V+aVwOQM5ywX72NUEkYzciapZOR9IQjiMSPn8dVPtn4+J6KkeTaSy4KMOZpkNKUYSZOKjg6uidwiqSOgovM/ASQCQmHC0iJW1Cg6HvAgA5TcC1DYPuyU0YTAXCaug7hMO6QtDQA72qBXoX/XZi96SpgY2i4ReFyBcAmncSKfoy1JeZBdIGR3xbrRBrUQVfRbq4H/M4qP29kBudwZ2Lj2PpD8L7KpOxhHfzo2eKfs0QCY103GkZtrPI2m9IlZM0ZpVnzFaip3UG2bI73R4qj2SzesXVtrM32p6NOZ7tdOr+xf6zdo5rvG0v6FyvrjgMki8KHckokAhAnuQSpfTvkdbKNgMZ5VSWposepOZkIDnFiKnftd4p2+vfI2r9jmq359sWhyX1z/tgtL30qAfM2mBzcdBm9KzwR357I9RbDQtTur9n0IjAmKtrU7XfpV7Jrf1wp/rmagtoB6BbmZgM6leKywjbl+KG/k+4EF36NeibgI1atcYthxsT8Sle3p7LA0t0LcI9axS9Oh4cDpwD7gZhE1w7DXPMDraFzDJcVJjFDZXkRDgo5VkhIihnRXViVpED4Ck3IhQkzxEk5Vu5a1ODEZBKQ5sL8MglcdpOhEC/LJY8N0jZSbtdsclftopLpj2NFs6KSJMP1QmnFgMyBvWNBQgXBki1NgLCgxivAU2TGLM1NbDZ6Y5kpYXMi241gPlZl6lZvWYq5bpMXdaNdKEhGrnHOcqS7xRMEadsmZAUVUza/UtX8a59ZM5LcpmdGtJRiTMBZ3QDFGUgntq502n6mE7t2FPxOzQYK8Ma5/LYhAMhfGSX25aLNhE/iNvdf0XUiarZEmbLelnAHTjJ1LXpBM6fr4sbwkMJ6IvCpahm/wnVEjgFKzDTWetGk1wpzScPtM3gzePT8Nvz9/tuL41evm/P6wHvifev5Xuj94L3yfvG3pmHD/4+Pdw73C/s+h86PzR+bOG7u81nK+91tP56z9G0Lzs</latexit>

Conditional random fields (CRFs)

26

y’ is an entire sequence

Ψ(w, y) =

M+1

X

m=1

θ · f(w, ym, ym−1, m)

<latexit sha1_base64="6QgzZybjFg9uG/YPephMOPdGJY8=">AHcnicnVTdbts2FZTr+68nzbdXvDLghirV5gdQXWmwJFt4vdNPMAp+kQOgJFUTYdUhJIyolBcO+5J9gL7AFGUrJrOW4CTICp43PO953vHFJMSkalGg7/vrd3v/PFg+7DL3tf3Nt48e7z/5ItKYHKC1aIjwmShNGcnCqGPlYCoJ4wshZcvmLi58tiJC0yMdqWZIJR9OcZhQjZV3x/v1/eodwhpRWJo4ucvAGQCSmHF3HWjmHAaO+NwDkNAVXzgx7hzfTYCYQ1qO+z/C5PhA28NCsYqHZBf8sHfYxMBG0OLUhTXANo1jTV9E5mL3CXSJot64O3ZNmtFqumPkQn/xyw+cW8P5PKwM7Si1j156Gr6mk8GRh+Wy4T0NkXOTxHpu+kWsnFbEyhlao2TFV6gmcgvUtTk2Gy2Oa710Q9qVkzY3F5q+mJtBrfTS/XV6w2a+61w6WPfiaP1xgOS67EM1IwoBiNCgcxBndQG+2agYxyqTt4ghSezKQmHE9J/G7KQ9GtxB6vSOa7Vn2xJHkvbPBmC8Xp8BGxtsFkctBFHjvgTvr0R+sTA0obu7hn0PAdMuL6yQfdemhXbWg73pO9fRMYl1BMwzQicJzNrhmXM3aK5ZR8AHsaPD4bHQ/+Am0bUGAdB84zi/T0B0wJXnOQKMyTleTQs1UQjoShmxPRgJUmJ8CWaknNr5ogTOdH+SjLg0HpSkBXC/nIFvHcToRGXcskTm+lmK7djzrkrdl6p7PVE07ysFMlxXSirGFAFcPcbSKkgWLGlNRAW1GoFeIbsTit7C9qd2igzI2xBVLsRzCdaZr56S1LCTRt8XTfag4Lk5AoXnKM8/UHDHKlinJUMWUO03Zyt41r0G6oKVsRremZETBQtApzRFjJFPQLW23fc0U9GsP/krsBgny3qr+vSQCqUJYJfWHbeyGTeFzd6eY2zJpvs60Zrst7QXYZtxcipLk2tRfBCskgclUFXZEnwD74VaApTZbajzSRtWZ9hTGm2fyZvGh5fH0U/HL/94dfD2XNeHwbPgu+DfhAFPwdvg9+CUXAa4M5JR3VM568H/3afdp93m8O9d6/BfBe0nu7gPxLUizA=</latexit>

■ Decoding: Viterbi algorithm. ■ Likelihood: Forward algorithm. ■ Training: Forward-Backward (get backward for free w/ autodiff!)

logistic regression, plus weights over pairs of labels

slide-76
SLIDE 76

Sequence labeling in practice: 
 Named entity recognition

27

slide-77
SLIDE 77

Named entity recognition (NER)

28

Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. [ORG United], a unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco].

slide-78
SLIDE 78

Named entity recognition: typical tags

29

Type Tag Sample Categories Example sentences People

PER people, characters

Turing is a giant of computer science. Organization ORG companies, sports teams The IPCC warned about the cyclone. Location

LOC regions, mountains, seas

The Mt. Sanitas loop is in Sunshine Canyon. Geo-Political Entity

GPE countries, states, provinces

Palo Alto is raising the fees for parking. Facility

FAC bridges, buildings, airports

Consider the Golden Gate Bridge. Vehicles

VEH planes, trains, automobiles

It was a classic Ford Falcon.

Figure 18.1 A list of generic named entity types with the kinds of entities they refer to.

slide-79
SLIDE 79

Named entity recognition (NER)

30

Example from PubTator: https://www.ncbi.nlm.nih.gov/research/pubtator/?view=publication&pmid=32939514

chemical disease species gene

slide-80
SLIDE 80

Named entity recognition (NER)

30

Example from PubTator: https://www.ncbi.nlm.nih.gov/research/pubtator/?view=publication&pmid=32939514

chemical disease species gene

slide-81
SLIDE 81

Ambiguity in NER

31

Name Possible Categories Washington Person, Location, Political Entity, Organization, Vehicle Downing St. Location, Organization IRA Person, Organization, Monetary Instrument Louis Vuitton Person, Organization, Commercial Product

PER

[ORG Washington] went up 2 games to 1 in the four-game series. Blair arrived in [LOC Washington] for what may well be his last state visit. In June, [GPE Washington] passed a primary seatbelt law. The [VEH Washington] had proved to be a leaky ship, every passage I made...

slide-82
SLIDE 82

NER as sequence labeling

32

[ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said. Words IOB Label IO Label American B-ORG I-ORG Airlines I-ORG I-ORG , O O a O O unit O O

  • f

O O AMR B-ORG I-ORG Corp. I-ORG I-ORG , O O immediately O O matched O O

(or, BIO) also, IOBES / BILOU:

Cu

B-Material

foils

L-Material

were

O

placed

U-Operation

inside

O

a

O

quartz

B-Apparatus

tube

I-Apparatus

furnace

L-Apparatus

slide-83
SLIDE 83

Features for NER

33

identity of wi, identity of neighboring words embeddings for wi, embeddings for neighboring words part of speech of wi, part of speech of neighboring words base-phrase syntactic chunk label of wi and neighboring words presence of wi in a gazetteer wi contains a particular prefix (from all prefixes of length ≤ 4) wi contains a particular suffix (from all suffixes of length ≤ 4) wi is all upper case word shape of wi, word shape of neighboring words short word shape of wi, short word shape of neighboring words presence of hyphen

slide-84
SLIDE 84

Features for NER

33

identity of wi, identity of neighboring words embeddings for wi, embeddings for neighboring words part of speech of wi, part of speech of neighboring words base-phrase syntactic chunk label of wi and neighboring words presence of wi in a gazetteer wi contains a particular prefix (from all prefixes of length ≤ 4) wi contains a particular suffix (from all suffixes of length ≤ 4) wi is all upper case word shape of wi, word shape of neighboring words short word shape of wi, short word shape of neighboring words presence of hyphen

slide-85
SLIDE 85

Evaluating named entity recognition

34

■ NER is typically evaluated using segment-level F1 score, meaning at the level of

multi-token entities, not single words. American Airlines said Friday …

slide-86
SLIDE 86

Evaluating named entity recognition

34

■ NER is typically evaluated using segment-level F1 score, meaning at the level of

multi-token entities, not single words. American Airlines said Friday …

B-ORG O O O

slide-87
SLIDE 87

Neural sequence labeling

35

■ Next class: CRFs, neural network models for sequence labeling, and how to

combine them.

Image from: Lample et al. Neural Architectures for Named Entity Recognition. NAACL 2016.

slide-88
SLIDE 88

Announcements

36

■ Project 1 is due tomorrow! You may submit up to 3 days late (out of a budget of

5 total for the semester).

■ No recitation tomorrow (Friday). Do your homework.