L101: Introduction to Structured Prediction Ryan Cotterell What is - - PowerPoint PPT Presentation

l101 introduction to structured prediction
SMART_READER_LITE
LIVE PREVIEW

L101: Introduction to Structured Prediction Ryan Cotterell What is - - PowerPoint PPT Presentation

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction? Its just multi-class classification! Definition: Structured Something in the problem is exponentially large Definition: Structured


slide-1
SLIDE 1

L101: Introduction to Structured Prediction

Ryan Cotterell

slide-2
SLIDE 2

What is structured prediction?

  • It’s just multi-class classification!
  • Definition: Structured
  • Something in the problem is exponentially large
  • Definition: Structured Prediction:
  • The output space of the prediction problem is exponentially large
slide-3
SLIDE 3

Recall Logistic Regression

  • Goal: to construct a probability distribution
  • The major question: What if |Y| is really, really big?
  • Can we find an efficient algorithm for computing that sum?

p(y | x) = exp{score(y, x)} P

y02Y exp{score(y0, x)}

<latexit sha1_base64="hNljSnMJNJYiU+LQK4eaDv9/dw=">ACVHicbVHLSgMxFM2Mr1pfVZdugkWsIGWq4mMhFNy4VLA+aErJpHc0NJMZkox0CPORuhD8EjcuzNQivg4EDufec5N7EqaCaxMEr54/NT0zO1eZry4sLi2v1FbXrnWSKQYdlohE3YZUg+ASOoYbAbepAhqHAm7C4VlZv3kEpXkir0yeQi+m95JHnFHjpH5tmDZyTGI+wKMdfIpJpCizBEYpscTAyE20miUKika+61pIUVis7hv821MuHRWah4YFfauKPD/vu3SiJ2zX6sHzWAM/Je0JqSOJrjo157JIGFZDNIwQbXutoLU9CxVhjMBRZVkGlLKhvQeuo5KGoPu2XEoBd5ygBHiXJHGjxWvzsjbXO49B1ljvo37VS/K/WzUx03LNcpkByT4vijKBTYLhPGAK2BG5I5Qprh7K2YP1OVq3D9UxyGclDj8Wvkvud5rtvab+5cH9fbdJI4K2kCbqIFa6Ai10Tm6QB3E0BN685DneS/euz/lz3y2+t7Es45+wF/+AFpctBQ=</latexit>

We will define this later

slide-4
SLIDE 4

Structured Prediction in a Meme

Sentiment Analysis: Is sentiment positive or negative? Movie Genre Prediction: Which genre is this script? Part-of-Speech Tagging: This sentence has which part-of-speech-tag sequence?

|Y| = 2

<latexit sha1_base64="BdJPE7QsldfNtL3wbHY3Oy6AGpk=">AB+nicbVDLSsNAFL2pr1pfqS7dDBbBVUlb8bEQim5cVrAPaUOZTCft0MkzEyUkvZT3LhQxK1f4s6/MUmDqPXAwOGce7lnjhNwprRlfRq5peWV1bX8emFjc2t7xyzutpQfSkKbxOe+7DhYUc4EbWqmOe0EkmLP4bTtjK8Sv31PpWK+uNWTgNoeHgrmMoJ1LPXN4rTnYT0imEd3sym6QNW+WbLKVgq0SCoZKUGRt/86A18EnpUaMKxUt2KFWg7wlIzwums0AsVDTAZ4yHtxlRgjyo7SqP0GsDJDry/gJjVL150aEPaUmnhNPJjnVXy8R/O6oXbP7IiJINRUkPkhN+RI+yjpAQ2YpETzSUwkSzOisgIS0x03FYhLeE8wcn3lxdJq1qu1Mq1m+NS/TKrIw/7cABHUIFTqM1NKAJB7gEZ7hxZgaT8ar8TYfzRnZzh78gvH+BaYUk7o=</latexit>

|Y| = n

<latexit sha1_base64="sEHNemEVkNeD5/hsbIKHubHJoL0=">AB+nicbVDLSgMxFL1TX7W+Wl26CRbBVZmq+FgIRTcuK9iHtEPJpJk2NJMZkoxSpv0UNy4UceuXuPNvzEwHUeuBwOGce7knxw05U9q2P63cwuLS8kp+tbC2vrG5VSxtN1UQSUIbJOCBbLtYUc4EbWimOW2HkmLf5bTljq4Sv3VPpWKBuNXjkDo+HgjmMYK1kXrF0qTrYz0kmMd30wm6QEYr2xU7BZon1YyUIUO9V/zo9gMS+VRowrFSnaodaifGUjPC6bTQjRQNMRnhAe0YKrBPlROn0ado3yh95AXSPKFRqv7ciLGv1Nh3zWSU/31EvE/rxNp78yJmQgjTQWZHfIijnSAkh5Qn0lKNB8bgolkJisiQywx0atQlrCeYKT7y/Pk+ZhpXpUObo5LtcuszrysAt7cABVOIUaXEMdGkDgAR7hGV6sifVkvVpvs9Gcle3swC9Y718BE5P2</latexit>

|Y| = 2n

<latexit sha1_base64="VdY3pceih/fCEJChDbG3fkgqc=">AB/HicbVDLSgMxFL3js9ZXtUs3wSK4KjOt+FgIRTcuK9iHtGPJpJk2NPMgyQjDtP6KGxeKuPVD3Pk3ZqZF1HogcDjnXu7JcULOpDLNT2NhcWl5ZTW3l/f2NzaLuzsNmUQCUIbJOCBaDtYUs582lBMcdoOBcWew2nLGV2mfueCskC/0bFIbU9PCZywhWuoViuOuh9WQYJ7cTsboHFXutFoy2YGNE+sGSnBDPVe4aPbD0jkUV8RjqXsWGao7AQLxQink3w3kjTEZIQHtKOpjz0q7SQLP0EHWukjNxD6+Qpl6s+NBHtSxp6jJ9Ok8q+Xiv95nUi5p3bC/DBS1CfTQ27EkQpQ2gTqM0GJ4rEmAimsyIyxAITpfvKZyWcpTj+/vI8aVbKVrVcvT4q1S5mdeRgD/bhECw4gRpcQR0aQCGR3iGF+PBeDJejbfp6Ix2ynCLxjvXzJklJo=</latexit>
slide-5
SLIDE 5

Predict Trees!

  • Predict dependency parses from raw text
  • Classic problem in NLP
slide-6
SLIDE 6

Predict Subsets!

  • Determinantal Point Processes
  • A distribution over subsets
slide-7
SLIDE 7

Why isn’t Structured Prediction just Statistics?

  • Computer scientists develop combinatorial algorithms professionally
  • Minimum spanning tree, shortest path problems, maximum flow, LP

relaxations

  • Structured prediction is the intersection of algorithms and high-

dimensional statistics

Statistics

(Theoretical) Computer Science

slide-8
SLIDE 8

Deep Dive into Discriminative Tagging

  • Assign each word in a sentence a coarse-grained grammatical category
  • Noun, Verb, Adjective, Adverb, Determiner, etc…
  • Arguably, the simplest structured prediction problem in NLP
slide-9
SLIDE 9

Back in 2001…

slide-10
SLIDE 10

What is a score function for tagging?

  • Arbitrary function that takes a word sequence and a tag as input and

tell you how good they are together

score (w, t) = “goodness” (w, t)

<latexit sha1_base64="9KvgiK/zWN+qKGLCTC8oO5S57g=">ACVnicfVFdSxwxFM2Mn10/OupjX4JL0UJZmup+iAIvio0FXLzrJmsnd2g5lkSO6oyzB/Ul/an+KLmFnH4hceCBzOuYfcnMSZFBbD8J/nT03PzM7Nf2osLC4tfw5WVk+szg2HDtdSm7OYWZBCQcFSjLDLA0lnAaXxU/uklGCu0+o3jDHopGyqRCM7QSf0gjRCuXa6wXBsoIwkJbkYpw1GcFfld/rEsYyMGI7wG92jT5nz86HWAwXWbmx8GKV1th80w1Y4AX1L2jVpkhpH/eAmGmiep6CQS2Ztx1m2CuYQcElI0ot5AxfsG0HVUsRsr5jUtKvThnQRBt3FNKJ+jxRsNTacRq7yWpT+9qrxPe8bo7JTq8QKsRFH+8KMklRU2rjulAGOAox4wboTblfIRM4yj+4nGpITdCr/+P/ktOfnRam+1to5/Nvf/1HXMky9knWySNtkm+SQHJEO4eSW3Hm+N+X9e79GX/ucdT36swaeQE/eAUdrZ/</latexit>

p(t | w) ∝ exp{score(w, t)}

<latexit sha1_base64="1OPlgygso5+F1juU8nAq34DA=">ACPnicbVDLahsxFNU4bR5umzjJMhtRU3ChmHEc8tgEQzdZulA/wGOMRr6TCGtGQrqTxAzZdnkG7rspsEkK3XVZju3Vb94Dg6Nz3CbUFn3/q1dae/FyfWNzq/zq9ZvtncruXteq1HDocCWV6YfMghQJdFCghL42wOJQi+cfCzivWswVqjkM041DGN2mYhIcIZOGlU6uhbEDK/CKMOcBrEY01/m/w9DbRGhUN4FYHWYBw62ZklisDeW2Z+IEum7ifFSp+nV/BrpKGgtSJQu0R5UvwVjxNIYEuWTWDhq+xmHGDAouIS8HqQXN+IRdwsDRhMVgh9ns/Jy+c8qYRsq4lyCdqX9WZCy2dhqHLrPY0v4bK8T/xQYpRqfDTCQ6RUj4fFCUSuoMKbykY2GAo5w6wrgRblfKr5hHJ3j5ZkJZwWOf5+8SrqH9Uaz3vx0VG2dL+zYJAfkLamRBjkhLXJB2qRDOLkj38gjefLuvQfv2fs+Ty15i5p98he8Hz8BTYSxHQ=</latexit>
slide-11
SLIDE 11

You score function can be any function!

score (w, t) = θ · f(w, t)

<latexit sha1_base64="jU6959q+oSYD68OzEntITv/8EU=">ACX3icdVFda9swFJW9dkmTLPO2p7EXsTBoQRnLd32UAjsZY8dNE1HFIsXyeismWk62zB+E/2rdCX/ZPJaVr6tQNCh3Pula6OolxJi2F45fkvtrZfNpo7rXbnVfd18ObtmdWFETASWmlzHnELSmYwQokKznMDPI0UjKOL7U/XoKxUmenuMphmvJ5JhMpODpFiwZwh/XV1qhDVRMQYK7LOW4iJLyd7VPbzlWzMj5AvfoMS1ZpFVsV6nbKMFIK8oE7HGh1ZS/eovVnQC/vhGvQpGWxIj2xwMgsuWaxFkUKGQnFrJ4Mwx2nJDUqhoGqxwkLOxQWfw8TRjKdgp+U6n4p+ckpME23cypCu1fsdJU9tPbKrEe0j71afM6bFJh8nZYywuETNxclBSKoqZ12DSWBgSqlSNcGOlmpWLBDRfovqS1DuFbjaO7Jz8lZ5/7g4P+wc/D3vDXJo4m+UA+kl0yIF/IkPwgJ2REBLn2fK/tdby/fsPv+sFNqe9tet6RB/Df/wPNG7hI</latexit>

score (w, t) = Neural-Network(w, t)

<latexit sha1_base64="AFd/hOApGgbQmNSrOhsHbpkCWNY=">ACTHicdVDLSiNBFK2O74wzRl26KQyCghM6Kj4WguDGVYhgfJAOobpyOylS3dVU3VZD0x/oxoU7v8KNC2UYsDpG8TFzoOBwzj3cW8ePpTDouvdOYWx8YnJqeqb4Y/bnr7nS/MKpUYnm0OBKn3uMwNSRNBAgRLOYw0s9CWc+f3D3D+7BG2Eik5wEMrZN1IBIztFK7xD2Ea5tLDVcaMk9CgKteyLDnB+lVtk7fOGaeFt0ertF9+papQaKZ/F0DvFK6n/0nt9Yuld2KOwT9TqojUiYj1NulO6+jeBJChFwyY5pVN8ZWyjQKLiEreomBmPE+60LT0oiFYFrpsIyMrlilQwOl7YuQDtWPiZSFxgxC307mJ5qvXi7+y2smGOy2UhHFCULEXxcFiaSoaN4s7QgNHOXAEsa1sLdS3mOacbT9F4cl7OXYfv/yd3K6UaluVjaPt8oHF6M6pskSWSarpEp2yAE5InXSIJzckAfyRJ6dW+fR+eP8fR0tOKPMIvmEwuQLNxi2hA=</latexit>

Linear Function (dot product of a weight vector and a feature function) Non-linear Function (neural network)

slide-12
SLIDE 12

N V A D

Time flies like an arrow

N V A D N V A D N V A D N V A D

N

  • u

n V e r b A d v D e t w

score(w, N, V, A, D, N)

slide-13
SLIDE 13

How do we normalize?

  • Why is this hard?
  • There are an exponential number of summands
  • T is the set of tags (typically about 12)
  • Naïve algorithm runs in
  • The normalizer is terms the partition function (Zustandssumme)

|T n| = |T |n

<latexit sha1_base64="iUmWof06X3MD0EAZxpNQlCf89Pk=">ACnicbVDLSgMxFL3js9ZX1aWbaBFclakVHwuh4MZlhb6knZMmrahmcyQZIQy7dqNv+LGhSJu/QJ3/o2ZdpBqPRA4nHMufe4AWdK2/aXtbC4tLymlpLr29sbm1ndnaryg8loRXic1/WXawoZ4JWNOc1gNJsedyWnMH17Ffu6dSMV+U9TCgjod7gnUZwdpI7czBqOlh3SeYR+VxS4zQFZpVRi2Tydo5ewI0T/IJyUKCUjvz2ez4JPSo0IRjpRp5O9BOhKVmhNxuhkqGmAywD3aMFRgjyonmpwyRkdG6aCuL80TGk3U2YkIe0oNPdck4y3VXy8W/Maoe5eOBETQaipINOPuiFH2kdxL6jDJCWaDw3BRDKzKyJ9LDHRpr30pITLGc/J8+T6kuX8gVbk+zxbukjhTswyEcQx7OoQg3UIKEHiAJ3iBV+vRerberPdpdMFKZvbgF6yPb/1m0w=</latexit>

O (|T |n)

<latexit sha1_base64="OpDjQ/QvtoXNYqTR8mHTmDbLJI=">ACEHicbVDLSsNAFJ3UV62vqEs3wSLWTUmt+NgV3LizQl/S1DKZTtqhk0mYuRFKmk9w46+4caGIW5fu/BuTtBS1HrhwOde7r3H9jlTYJpfWmZhcWl5JbuaW1vf2NzSt3caygskoXicU+2bKwoZ4LWgQGnLV9S7NqcNu3hZeI376lUzBM1GPm04+K+YA4jGKpqx9aLoYBwTy8jixOHSiMZ0otGt8JS7L+AI6et4smimMeVKakjyaotrVP62eRwKXCiAcK9UumT50QiyBEU6jnBUo6mMyxH3ajqnALlWdMH0oMg5ipWc4noxLgJGqPydC7Co1cu24MzlW/fUS8T+vHYBz3gmZ8AOgkwWOQE3wDOSdIwek5QAH8UE8niWw0ywBITiDPMpSFcJDidvTxPGsfFUrlYvjnJV26ncWTRHtpHBVRCZ6iCrlAV1RFBD+gJvaBX7VF71t6090lrRpvO7KJf0D6+ARNxnhQ=</latexit>

p (t | w) = exp{score(t, w)} P

t02T n exp{score(t0, w)}O (|T |n)

<latexit sha1_base64="4bvyp1Do5BRki97fWNMj1s7m/k=">ACyXicfVFLbxMxEPZueZTwCnDkYhGhthKNi2icECqQEJIPbRITVspTiOvM5tY9Xq39myb4O6Jf8iNGz8Fb3YpKUWMZOnzHzfvOJcSYtR9CMIV27dvnN39V7r/oOHjx63nzw9tFlhBPRFpjJzHMLSmro0QFx7kBnsYKjuLTj1X86ByMlZk+wHkOw5RPtEyk4Ohdo/bPnClIcJ2lHKdx4rCkLJVj+vt/UTIjJ1PcoO8pSwXjsEsZ4hzHw5Z0VmoFyiv1qiblBWlo7ZIh25PxlrlEldZwmu3EF5on3R/6uXZP1q0r/l5ZT3C5pHh5opu2W6N2J+pGC6M3Qa8BHdLY/qj9nY0zUaSgUShu7aAX5Th03KAUCnzhwkLOxSmfwMBDzVOwQ7e4RElfes+YJpnxTyNdeJcZjqfWztPYZ1bd2r9jlfNfsUGByduhkzovELSoCyWFopjR6qx0LA0IVHMPuDS90rFlPt7oT9+vYR3lb25GvkmONzs9ra6W19ed3Y+NOtYJc/JC7JOemSb7JDPZJ/0iQg+BSogvNwNzwLZ+HXOjUMGs4zcs3Cb78AqKTj1w=</latexit>

Z(w) = X

t02T n

exp{score(t0, w)}

<latexit sha1_base64="s9jhtMHKXTsl+NMqkQHCvDAU5ik=">ACT3icbVHLThsxFPUEWkL6CnTJxmpUNZWqaAIVjwUSg1LKhFAjdPI49wBC49nZN+BRNb8IZuy4zfYsABV9YShSkuPZOn43KePo0xJi2F4E9Tm5l+8XKgvNl69fvP2XNp+cimuRHQE6lKzUnELSipoYcSFZxkBngSKTiOzvfK+PEFGCtTfYiTDAYJP9UyloKjl4bN+HubJRzPothdFp/pNmU2T4buScPiE2VS0+ldcOUOix+6oAzGXMYexHOitSA0V7puYLnWnKCtoYNlthJ5yCPifdirRIhYNh85qNUpEnoFEobm2/G2Y4cNygFAqKBstZFyc81Poe6p5Anbgpn4U9KNXRjROjT8a6VSdrXA8sXaSRD6zXNP+GyvF/8X6OcabAyd1liNo8TgozhXFlJbm0pE0IFBNPOHCSL8rFWfcIH+Cx5N2Cqx/ufJz8nRaqe71ln79rW1s1vZUScr5ANpky7ZIDtknxyQHhHkitySe/IQ/Azugl+1KrUWVOQ9+Qu1xd9WfbVf</latexit>
slide-14
SLIDE 14

What if we assume structure?

  • Does the problem get any easier?
  • Yes! And we’ll see how
  • Structured prediction is about exploiting combinatorial structure for

great good

slide-15
SLIDE 15

Example of Structure: Additively Decomposable Score Functions

tag bigram

score (w, t) =

n

X

i=1

score (w, ti−1, ti)

<latexit sha1_base64="Pk8nYAc6PVDkDE6Q3JFzsG+FJ0E=">ACXichVFNS+RAEO1kdXRdUc9ePDSOAgKu0OyiroHQfTiUcFRYTIOnZ7KTGOnE7or6hDyJ73pxb9iJ0bxC3zQ8Hivqr6dZhKYdDz7h3x8Tkz6npmcbs3K/5382FxTOTZJpDhycy0RchMyCFg4KlHCRamBxKOE8vDos/fNr0EYk6hTHKfRiNlQiEpyhlfpNDBubV9ueKhCREuB7EDEdhlN8Uf+gLxyLQYjCDbpHA5PF/Vzs+cWlot8NQFv516+IqEf0my2v7VWgn4lfkxapcdxv3gWDhGcxKOSGdP1vR7OdMouISiEWQGUsav2BC6lioWg+nlVToFXbPKgEaJtkchrdS3HTmLjRnHoa0s1zYfvVL8yutmGO32cqHSDEHx54uiTFJMaBk1HQgNHOXYEsa1sLtSPmKacbQf0qhC+F9i+/XJn8nZv7a/2d482WrtH9RxTJMVskrWiU92yD45IsekQzh5cIgz4zScR3fSnXPn0tdp+5ZIu/gLj8BSJS25w=</latexit> word sequence

(sentence) of length n tag sequence of length n is start-of-tagging symbol t0

<latexit sha1_base64="MlSI7MQNjprMJ4Skt7igfG6CnY=">AB6nicbVDLSgNBEOyNrxhfUY9eBoPgKewa8XGLePEY0ZhAsoTZyWwyZPbBTK8QloA/4MWDIl79Im/+jbObIGosaCiqunu8mIpNr2p1VYWFxaXimultbWNza3yts7dzpKFONFslItT2quRQhb6JAydux4jTwJG95o8vMb91zpU3uI45m5AB6HwBaNopBvs2b1yxa7aOcg8cWakAjM0euWPbj9iScBDZJq3XHsGN2UKhRM8kmpm2geUzaiA94xNKQB126anzohB0bpEz9SpkIkufpzIqWB1uPAM50BxaH+62Xif14nQf/MTUYJ8hDNl3kJ5JgRLK/SV8ozlCODaFMCXMrYUOqKEOTikP4TzDyfL8+TuqOrUqrXr40r94mEaRxH2YB8OwYFTqMVNKAJDAbwCM/wYknryXq13qatBWsW4S78gvX+BUKPjlE=</latexit>
slide-16
SLIDE 16

N V A D

Time flies like an arrow

N V A D N V A D N V A D N V A D

N

  • u

n V e r b A d v D e t

score(w, N, V)

w

score(w, V, A) score(w, A, D) s c

  • r

e ( w , D , N )

slide-17
SLIDE 17

Additive Decomposable Score Functions

score(w, N, V, A, D, N) =

score(w, N, V) + score(w, V, A) + score(w, A, D) + score(w, D, N)

Probability Distribution:

p(t | w) ∝ exp {score(w, t)} = exp ( n X

i=1

score(w, ti−1, ti) ) =

n

Y

i=1

exp {score(w, ti−1, ti)}

<latexit sha1_base64="+D5Ruwl/rXeS/ECbg+MNKXMFU0I=">ADKHiclVLbtQwFHXCqwyvKSzZWIyophKMEop4LIoq2LAsEtNWGg8jx7mZseo4ln0DHUX5HDb8ChuEQKhbvgQnTItRUcydLxfZ17r50YJR1G0WEQXrh46fKVlauda9dv3LzVXb294rSChiKQhV2L+EOlNQwRIkK9owFnicKdpP9V41/9z1YJwv9FucGxjmfaplJwdGbJqvBC9NnOcdZklVYU5bLlB7fP9TrdI0ZWxgsKIMDwxRkyCqGcOClKicKC3V/Gf6ALkutU2bldIbMF2Udei7WNk+Wp8yV+aSm3H9TtNzxNAHPYxbIv9Xzk+VLjX+abZjuVNqk24vGkQt6FkSL0iPLA96X5laSHKHDQKxZ0bxZHBcUtSqGg7rDSgeFin09h5KnmObhx1T50Te97S0qzwvqjkbWkxkVz52b54mPbBp3f/oa498oxKzZ+NKalMiaHEklJWK+odvfg1NpQWBau4JF1b6XqmYcsF+r/VaZfwvMGT3yOfJTuPBvHGYOPN497Wy8U6Vshdco/0SUyeki3ymyTIRHBx+Bz8C34Hn4Kv4Q/wsOj0DBY5NwhpxD+/AUcG/c2</latexit>
slide-18
SLIDE 18

How do we Compute Z?

Z(w) = X

tn

1 ∈T n

n

Y

i=1

exp {score (w, ti−1, ti)} = X

tn−1

1

∈T n−1

X

tn∈T n

Y

i=1

exp {score (w, ti−1, ti)} = X

tn−1

1

∈T n−1 n−1

Y

i=1

exp {score (w, ti−1, ti)} × X

tn∈T

exp {score (w, tn−1, tn)} = X

t1∈T

exp {score (w, t0, t1)} × X

t2∈T

exp {score (w, t1, t2)} × · · · × X

tn∈T

exp {score (w, tn−1, tn)} = X

t1∈T

exp {score (w, t0, t1)} × β(w, t1) = β(w, t0)

<latexit sha1_base64="EnPhGM4QW9DIzQB4Ov+IkrlFf0=">AGHic3VRNbxMxEHXLpTwlcKRi0UEaiSIdlPEx6GoEheORWraijhdeR1vYtXrXdmz0Mjan8GFv8KFAwhx7Y1/gzfZoNBURagcECPZHr15o/dmDo5zKQwEwfeV1Uuef/nK2tXGtes3bt5qrt/eM1mhGe+xTGb6IKaGS6F4DwRIfpBrTtNY8v346GV3/LtRGZ2oVJzgcpHSmRCEbBQdG613mzQVIK4zix78o2frCFiSnSyM5BKPwUGEi3KkgRqXdLQ9ViUmus2FkxVZYVgR+nBPJEyAWE+DHzos1LNPcESt4QeUhBtf2KJwmgmgxGkMbz17i6KRxtgurXM+Skyla1nyIlqz+c0YX3NTEv+bItYuUm3OX8YdiqhZT54PUXhqaC6wt/O1L2wUFhd3eV5ah02zMD815uMOdDTO2nPXZxRDNqNeUTNVtAJpoGXk7BOWqiOnah5QoYZK1KugElqTD8MchYqkEwycsGKQzPKTuiI953qaLO4cBOP7YS3fIECeZdkcBnqKLHZamxkzS2DErw+Z0rQLPqvULSJ4NrFB5AVyxmVBSAwZrn5JPBSaM5ATl1CmhfOK2ZhqysD9pbMlPK/iyc+Rl5O9bifc7Gy+ftzaflGvYw3dRfQBgrRU7SNXqEd1EPMe+9D57X/wP/if/q/9tRl1dqXvuoF/CP/kBkYUXBQ=</latexit>
slide-19
SLIDE 19

A Simple Dynamic Program

  • You’ve seen this before! It’s just the backward algorithm from HMMs
  • Same algorithm, because the lattice structure is the same
  • Runtime complexity? Space complexity?

β(w, tn) = 1 β(w, ti) = X

ti+1∈T

exp {score (w, ti, ti+1)} × β(w, ti+1)

<latexit sha1_base64="7Lc4uUbPAomp3ujEiV7CpE1R7w0=">ACr3icbVFdb9MwFHXCx0b4KvDIi0UFWsVUJWzi4wFpghceB1rXSXUJjnvTWnOcyL6BVb+Hj9gb/s3OGmEYOuVbB+dc+/18XVWKWkxjq+C8NbtO3d3du9F9x8fPR48OTpqS1rI2AiSlWas4xbUFLDBCUqOKsM8CJTM3OP7f69CcYK0t9gusK5gVfaplLwdFT6eA3ywD5His4rLc/Wr2KaZ6RF9pAlLNomy05mti5Sh6mTr5OGMqlplyW4cieNJ+CiYgpyZI4yhAvzVlRGvBS1/v2W59KyOXKxz1J/MEygIs3WalqxhFUToYxuO4C3oTJD0Ykj6O08ElW5SiLkCjUNzaWRJXOHfcoBQKmojVFiouzvkSZh5q7h3MXTfvhr70zILmpfFLI+3YfyscL6xdF5nPbN3a61pLbtNmNebv507qkbQYnNRXiuKJW0/jy6kAYFq7QEXRnqvVKy4QL9F2+G8KGNt3+fBOcvhknB+ODr4fDo0/9OHbJc/KC7JGEvCNH5As5JhMigv3gWzALWJiE0/B7+GOTGgZ9zTPyX4TyD+3a0eE=</latexit>
slide-20
SLIDE 20

Graphical Representation

  • That was a lot of algebra, huh!
  • To gain insight insight into this problem, let’s consider a related

problem of finding the highest-scoring tagging

  • Note that multiplication distributes over max (for non-negative values) just

like sum does, so the same derivation holds!

γ(w, tn) = 1 γ(w, ti) = max

ti+1∈T exp {score (w, ti, ti+1)} × γ(w, ti+1)

<latexit sha1_base64="zyua6sfRDtzinLD6ircQmf7ZmA=">ACsnicbVFNb9QwEHVSPkr4WsqRi8UK1BVoSWgF5YBUwYVjkbptxXqJHO9k16rjRPaE7srKD+TKjX+Dk40QtDuS7af3ZsbP46xS0mIc/w7CnVu379zdvRfdf/Dw0ePBk70zW9ZGwESUqjQXGbegpIYJSlRwURngRabgPLv83OrnP8BYWepTXFcwK/hCy1wKjp5KBz/ZghcF32cFx2Wu6vmNcVUj+jLjzShjEVbdnpnlulDlMnXyUNZVLTLktw5U4bT8CqYgpyZI4yhJV356woDXipa/3bLe+lZGLJY76k3kCZQGWbvXSlYyiKB0M43HcBb0Jkh4MSR8n6eAXm5eiLkCjUNzaRJXOHPcoBQKmojVFiouLvkCph5q7i3MXDfyhr7wzJzmpfFLI+3YfyscL6xdF5nPbN3a61pLbtOmNeZHMyd1VSNosbkorxXFkrb/R+fSgEC19oALI71XKpbcIH+lzdD+NDGu79PvgnO3o6Tg/HB18Ph8ad+HLvkGXlO9klC3pNj8oWckAkRwZtgEnwP0vAw/BbyUGxSw6CveUr+i1D9Afzo0yU=</latexit>
slide-21
SLIDE 21

N V A D

Time flies like an arrow

N V A D N V A D N V A D N V A D

N

  • u

n V e r b A d v D e t w Every arc has a weight score(w, t, t’)

slide-22
SLIDE 22

N V A D

Time flies like an arrow

N V A D N V A D N V A D N V A D

N

  • u

n V e r b A d v D e t w

score(w, N, V) score(w, V, A) s c

  • r

e ( w , A , D ) score(w, D, N)

slide-23
SLIDE 23

Shortest Path in a Graph

  • Find the shortest path in a (weighted) graph
  • A very common problem in computer science
  • In every computer science textbook
  • We just reduced (MAP) inference to a problem you already knew!

Bellman Ford Viterbi Dijkstra

slide-24
SLIDE 24

One Problem, Many Algorithms

  • Tagging lattices are acyclic, so we don’t have to worry about special

solutions for cycles

  • Dealing with cycles is a focal point in algorithms classes
  • Dijkstra’s can incorporate a heuristic to make it faster
  • This is known as A* search
slide-25
SLIDE 25

What does “short” mean anyway?

  • You just need to redefine ”short” to be a bit more abstract
  • The computation of the partition function Z is a shortest-path

problem in the abstract

  • Can we go more general?
  • We will parameterize the dynamic program by a semiring
  • This gives us a family of algorithms
slide-26
SLIDE 26

Introduction to Semirings

  • A semiring is an

algebraic structure

  • The above is

tedious! So, let’s come up with a better intuition

slide-27
SLIDE 27

A Semiring is object where this Derivation Holds

= M

tn

1 ∈T n

n

O

i=1

exp {score (w, ti−1, ti)} = M

tn−1

1

∈T n−1

M

tn∈T n

O

i=1

exp {score (w, ti−1, ti)} = M

tn−1

1

∈T n−1 n−1

O

i=1

exp {score (w, ti−1, ti)} ⊗ M

tn∈T

exp {score (w, tn−1, tn)} = M

t1∈T

exp {score (w, t0, t1)} ⊗ M

t2∈T

exp {score (w, t1, t2)} ⊗ · · · ⊗ M

tn∈T

exp {score (w, tn−1, tn)} β(w, t0) = M

t1∈T

exp {score (w, t0, t1)} ⊗ β(w, t1)

<latexit sha1_base64="A+rcliZkie8SKXGlMHFS8WlLDlc=">AGUHiczVTPb9MwFPZGWkb51cGRi0XFtEpQxR3ix6FoEheOQ1q3SXUXOa7TWnOcKH6BVH+RC678Xdw4QACJ81QaUADWk8yfbT5+/5fe872I+VNOC6H9fWrzmN5vWNG62bt27fudvevHdgojThYsgjFSVHPjNCS2GIEGJozgRLPSVOPRPXhf3h+9EYmSk92Eei3HIploGkjOwkLfpBFsD6stpFKvUeBkNGcz8IPcI8caU2lXAXGmsv38WOe4JIMhWXLAckLljiNqRIB0AxTEKdWRmZ4lAjLuDt81f548x2LInpEwkTeR0Bl28OKml09bWAP9OT6ZtYU1TiebLReDVlP/nupd1Vex/pg3TxfMXO/SXLXVUl9sB3jk0v3cYiN/OF7/0u1IsfXro51345MIzNV4S30BbNWdLr4Ky+tCSLeQuBrYa3fcnlsGriekSjqoij2vfUYnEU9DoYErZsyIuDGM5aA5ErkLZoaETN+wqZiZFPNrKBxVn6IOX5kQkOosQuDbhElysyFhozD3LMSb1bsC/NXdKIXgxTiTOk5BaL5oFKQKQ4SL3xVPZCI4qLlNGE+k1Yr5jCWMg/2DW6UJL4t49mPkenLQ75Gd3s7bp53dV5UdG+gBeoi2EUHP0S56g/bQEHng/PJ+eJ8bZw1Pje+NdcW1PXqRPfRT9FsfQdHzCz3</latexit>

s . M . max . g s . O . Y . g

<latexit sha1_base64="livBHVnPTEoIeqR53fPeTyTEcFs=">ACOnicbVA7T8MwGHR4lvIKMLJYVCmklLEY6vKwthK9CE1VeW4bmrVsSPbQVRfxcLv4KNgYUBhFj5AThpVAHlJEvnu/v02eFjCrtOM/WwuLS8spqbi2/vrG5tW3v7DaViCQmDSyYkG0PKcIoJw1NSPtUBIUeIy0vNF14rfuiFRU8Fs9Dk3QD6nA4qRNlLPriu3Sv2TI+h61BchixRMBegG6D6j0Ieum8+MLKlpQGbRUIo+zC5+zy4RScFnCeljBRAhlrPfnL7AkcB4RozpFSn5IS6GyOpKWZkncjRUKER8gnHUM5Mpu7cfr1CTw0Sh8OhDSHa5iqPydiFCg1DjyTDJAeqr9eIv7ndSI9uOzGlIeRJhxPFw0iBrWASY+wTyXBmo0NQVhS81aIh0girE3b+bSEqwTnsy/Pk+ZpsVQulutnhUo1qyMH9sEBOAYlcAEq4AbUQANg8ABewBt4tx6tV+vD+pxGF6xsZg/8gvX1DYp0qc8=</latexit>
slide-28
SLIDE 28

Backward Algorithm

Z(w) = X

tn

1 ∈T n

n

Y

i=1

exp {score (w, ti−1, ti)} = X

tn−1

1

∈T n−1

X

tn∈T n

Y

i=1

exp {score (w, ti−1, ti)} = X

tn−1

1

∈T n−1 n−1

Y

i=1

exp {score (w, ti−1, ti)} × X

tn∈T

exp {score (w, tn−1, tn)} = X

t1∈T

exp {score (w, t0, t1)} × X

t2∈T

exp {score (w, t1, t2)} × · · · × X

tn∈T

exp {score (w, tn−1, tn = X

t1∈T

exp {score (w, t0, t1)} × β(w, t1) = β(w, t0)

<latexit sha1_base64="EnPhGM4QW9DIzQB4Ov+IkrlFf0=">AGHic3VRNbxMxEHXLpTwlcKRi0UEaiSIdlPEx6GoEheORWraijhdeR1vYtXrXdmz0Mjan8GFv8KFAwhx7Y1/gzfZoNBURagcECPZHr15o/dmDo5zKQwEwfeV1Uuef/nK2tXGtes3bt5qrt/eM1mhGe+xTGb6IKaGS6F4DwRIfpBrTtNY8v346GV3/LtRGZ2oVJzgcpHSmRCEbBQdG613mzQVIK4zix78o2frCFiSnSyM5BKPwUGEi3KkgRqXdLQ9ViUmus2FkxVZYVgR+nBPJEyAWE+DHzos1LNPcESt4QeUhBtf2KJwmgmgxGkMbz17i6KRxtgurXM+Skyla1nyIlqz+c0YX3NTEv+bItYuUm3OX8YdiqhZT54PUXhqaC6wt/O1L2wUFhd3eV5ah02zMD815uMOdDTO2nPXZxRDNqNeUTNVtAJpoGXk7BOWqiOnah5QoYZK1KugElqTD8MchYqkEwycsGKQzPKTuiI953qaLO4cBOP7YS3fIECeZdkcBnqKLHZamxkzS2DErw+Z0rQLPqvULSJ4NrFB5AVyxmVBSAwZrn5JPBSaM5ATl1CmhfOK2ZhqysD9pbMlPK/iyc+Rl5O9bifc7Gy+ftzaflGvYw3dRfQBgrRU7SNXqEd1EPMe+9D57X/wP/if/q/9tRl1dqXvuoF/CP/kBkYUXBQ=</latexit>
slide-29
SLIDE 29

= M

tn

1 ∈T n

n

O

i=1

exp {score (w, ti−1, ti)} = M

tn−1

1

∈T n−1

M

tn∈T n

O

i=1

exp {score (w, ti−1, ti)} = M

tn−1

1

∈T n−1 n−1

O

i=1

exp {score (w, ti−1, ti)} ⊗ M

tn∈T

exp {score (w, tn−1, tn)} = M

t1∈T

exp {score (w, t0, t1)} ⊗ M

t2∈T

exp {score (w, t1, t2)} ⊗ · · · ⊗ M

tn∈T

exp {score (w, tn−1, tn)} β(w, t0) = M

t1∈T

exp {score (w, t0, t1)} ⊗ β(w, t1)

<latexit sha1_base64="A+rcliZkie8SKXGlMHFS8WlLDlc=">AGUHiczVTPb9MwFPZGWkb51cGRi0XFtEpQxR3ix6FoEheOQ1q3SXUXOa7TWnOcKH6BVH+RC678Xdw4QACJ81QaUADWk8yfbT5+/5fe872I+VNOC6H9fWrzmN5vWNG62bt27fudvevHdgojThYsgjFSVHPjNCS2GIEGJozgRLPSVOPRPXhf3h+9EYmSk92Eei3HIploGkjOwkLfpBFsD6stpFKvUeBkNGcz8IPcI8caU2lXAXGmsv38WOe4JIMhWXLAckLljiNqRIB0AxTEKdWRmZ4lAjLuDt81f548x2LInpEwkTeR0Bl28OKml09bWAP9OT6ZtYU1TiebLReDVlP/nupd1Vex/pg3TxfMXO/SXLXVUl9sB3jk0v3cYiN/OF7/0u1IsfXro51345MIzNV4S30BbNWdLr4Ky+tCSLeQuBrYa3fcnlsGriekSjqoij2vfUYnEU9DoYErZsyIuDGM5aA5ErkLZoaETN+wqZiZFPNrKBxVn6IOX5kQkOosQuDbhElysyFhozD3LMSb1bsC/NXdKIXgxTiTOk5BaL5oFKQKQ4SL3xVPZCI4qLlNGE+k1Yr5jCWMg/2DW6UJL4t49mPkenLQ75Gd3s7bp53dV5UdG+gBeoi2EUHP0S56g/bQEHng/PJ+eJ8bZw1Pje+NdcW1PXqRPfRT9FsfQdHzCz3</latexit>

s . M . max . g s . O . Y . g

<latexit sha1_base64="livBHVnPTEoIeqR53fPeTyTEcFs=">ACOnicbVA7T8MwGHR4lvIKMLJYVCmklLEY6vKwthK9CE1VeW4bmrVsSPbQVRfxcLv4KNgYUBhFj5AThpVAHlJEvnu/v02eFjCrtOM/WwuLS8spqbi2/vrG5tW3v7DaViCQmDSyYkG0PKcIoJw1NSPtUBIUeIy0vNF14rfuiFRU8Fs9Dk3QD6nA4qRNlLPriu3Sv2TI+h61BchixRMBegG6D6j0Ieum8+MLKlpQGbRUIo+zC5+zy4RScFnCeljBRAhlrPfnL7AkcB4RozpFSn5IS6GyOpKWZkncjRUKER8gnHUM5Mpu7cfr1CTw0Sh8OhDSHa5iqPydiFCg1DjyTDJAeqr9eIv7ndSI9uOzGlIeRJhxPFw0iBrWASY+wTyXBmo0NQVhS81aIh0girE3b+bSEqwTnsy/Pk+ZpsVQulutnhUo1qyMH9sEBOAYlcAEq4AbUQANg8ABewBt4tx6tV+vD+pxGF6xsZg/8gvX1DYp0qc8=</latexit>

Viterbi Algorithm

slide-30
SLIDE 30

Examples of Semiring

https://www.aclweb.org/anthology/C08-5001.pdf

slide-31
SLIDE 31

Conclusion

  • Deep-dive into discriminative tagging with CRFs
  • Generalized the idea of a shortest-path problem to semirings
  • You already knew the content of this lecture from your algorithms

class!

slide-32
SLIDE 32

Fin