l101 introduction to structured prediction
play

L101: Introduction to Structured Prediction Ryan Cotterell What is - PowerPoint PPT Presentation

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction? Its just multi-class classification! Definition: Structured Something in the problem is exponentially large Definition: Structured


  1. L101: Introduction to Structured Prediction Ryan Cotterell

  2. What is structured prediction? • It’s just multi-class classification! • Definition: Structured • Something in the problem is exponentially large • Definition: Structured Prediction: • The output space of the prediction problem is exponentially large

  3. <latexit sha1_base64="hNljSnMJNJYiU+LQK4eaDv9/dw=">ACVHicbVHLSgMxFM2Mr1pfVZdugkWsIGWq4mMhFNy4VLA+aErJpHc0NJMZkox0CPORuhD8EjcuzNQivg4EDufec5N7EqaCaxMEr54/NT0zO1eZry4sLi2v1FbXrnWSKQYdlohE3YZUg+ASOoYbAbepAhqHAm7C4VlZv3kEpXkir0yeQi+m95JHnFHjpH5tmDZyTGI+wKMdfIpJpCizBEYpscTAyE20miUKika+61pIUVis7hv821MuHRWah4YFfauKPD/vu3SiJ2zX6sHzWAM/Je0JqSOJrjo157JIGFZDNIwQbXutoLU9CxVhjMBRZVkGlLKhvQeuo5KGoPu2XEoBd5ygBHiXJHGjxWvzsjbXO49B1ljvo37VS/K/WzUx03LNcpkByT4vijKBTYLhPGAK2BG5I5Qprh7K2YP1OVq3D9UxyGclDj8Wvkvud5rtvab+5cH9fbdJI4K2kCbqIFa6Ai10Tm6QB3E0BN685DneS/euz/lz3y2+t7Es45+wF/+AFpctBQ=</latexit> Recall Logistic Regression We will define this later • Goal : to construct a probability distribution exp { score ( y, x ) } p ( y | x ) = P y 0 2 Y exp { score ( y 0 , x ) } • The major question: What if |Y| is really, really big? • Can we find an efficient algorithm for computing that sum?

  4. <latexit sha1_base64="BdJPE7QsldfNtL3wbHY3Oy6AGpk=">AB+nicbVDLSsNAFL2pr1pfqS7dDBbBVUlb8bEQim5cVrAPaUOZTCft0MkzEyUkvZT3LhQxK1f4s6/MUmDqPXAwOGce7lnjhNwprRlfRq5peWV1bX8emFjc2t7xyzutpQfSkKbxOe+7DhYUc4EbWqmOe0EkmLP4bTtjK8Sv31PpWK+uNWTgNoeHgrmMoJ1LPXN4rTnYT0imEd3sym6QNW+WbLKVgq0SCoZKUGRt/86A18EnpUaMKxUt2KFWg7wlIzwums0AsVDTAZ4yHtxlRgjyo7SqP0GsDJDry/gJjVL150aEPaUmnhNPJjnVXy8R/O6oXbP7IiJINRUkPkhN+RI+yjpAQ2YpETzSUwkSzOisgIS0x03FYhLeE8wcn3lxdJq1qu1Mq1m+NS/TKrIw/7cABHUIFTqM1NKAJB7gEZ7hxZgaT8ar8TYfzRnZzh78gvH+BaYUk7o=</latexit> <latexit sha1_base64="sEHNemEVkNeD5/hsbIKHubHJoL0=">AB+nicbVDLSgMxFL1TX7W+Wl26CRbBVZmq+FgIRTcuK9iHtEPJpJk2NJMZkoxSpv0UNy4UceuXuPNvzEwHUeuBwOGce7knxw05U9q2P63cwuLS8kp+tbC2vrG5VSxtN1UQSUIbJOCBbLtYUc4EbWimOW2HkmLf5bTljq4Sv3VPpWKBuNXjkDo+HgjmMYK1kXrF0qTrYz0kmMd30wm6QEYr2xU7BZon1YyUIUO9V/zo9gMS+VRowrFSnaodaifGUjPC6bTQjRQNMRnhAe0YKrBPlROn0ado3yh95AXSPKFRqv7ciLGv1Nh3zWSU/31EvE/rxNp78yJmQgjTQWZHfIijnSAkh5Qn0lKNB8bgolkJisiQywx0atQlrCeYKT7y/Pk+ZhpXpUObo5LtcuszrysAt7cABVOIUaXEMdGkDgAR7hGV6sifVkvVpvs9Gcle3swC9Y718BE5P2</latexit> <latexit sha1_base64="VdY3pceih/fCEJChDbG3fkgqc=">AB/HicbVDLSgMxFL3js9ZXtUs3wSK4KjOt+FgIRTcuK9iHtGPJpJk2NPMgyQjDtP6KGxeKuPVD3Pk3ZqZF1HogcDjnXu7JcULOpDLNT2NhcWl5ZTW3l/f2NzaLuzsNmUQCUIbJOCBaDtYUs582lBMcdoOBcWew2nLGV2mfueCskC/0bFIbU9PCZywhWuoViuOuh9WQYJ7cTsboHFXutFoy2YGNE+sGSnBDPVe4aPbD0jkUV8RjqXsWGao7AQLxQink3w3kjTEZIQHtKOpjz0q7SQLP0EHWukjNxD6+Qpl6s+NBHtSxp6jJ9Ok8q+Xiv95nUi5p3bC/DBS1CfTQ27EkQpQ2gTqM0GJ4rEmAimsyIyxAITpfvKZyWcpTj+/vI8aVbKVrVcvT4q1S5mdeRgD/bhECw4gRpcQR0aQCGR3iGF+PBeDJejbfp6Ix2ynCLxjvXzJklJo=</latexit> Structured Prediction in a Meme Sentiment Analysis: |Y| = 2 Is sentiment positive or negative? Movie Genre Prediction: |Y| = n Which genre is this script? Part-of-Speech Tagging: |Y| = 2 n This sentence has which part-of-speech-tag sequence?

  5. Predict Trees! • Predict dependency parses from raw text • Classic problem in NLP

  6. Predict Subsets! • Determinantal Point Processes • A distribution over subsets

  7. Why isn’t Structured Prediction just Statistics? • Computer scientists develop combinatorial algorithms professionally • Minimum spanning tree, shortest path problems, maximum flow, LP relaxations • Structured prediction is the intersection of algorithms and high- dimensional statistics (Theoretical) Statistics Computer Science

  8. Deep Dive into Discriminative Tagging • Assign each word in a sentence a coarse-grained grammatical category • Noun, Verb, Adjective, Adverb, Determiner, etc… • Arguably, the simplest structured prediction problem in NLP

  9. Back in 2001…

  10. <latexit sha1_base64="1OPlgygso5+F1juU8nAq34DA=">ACPnicbVDLahsxFNU4bR5umzjJMhtRU3ChmHEc8tgEQzdZulA/wGOMRr6TCGtGQrqTxAzZdnkG7rspsEkK3XVZju3Vb94Dg6Nz3CbUFn3/q1dae/FyfWNzq/zq9ZvtncruXteq1HDocCWV6YfMghQJdFCghL42wOJQi+cfCzivWswVqjkM041DGN2mYhIcIZOGlU6uhbEDK/CKMOcBrEY01/m/w9DbRGhUN4FYHWYBw62ZklisDeW2Z+IEum7ifFSp+nV/BrpKGgtSJQu0R5UvwVjxNIYEuWTWDhq+xmHGDAouIS8HqQXN+IRdwsDRhMVgh9ns/Jy+c8qYRsq4lyCdqX9WZCy2dhqHLrPY0v4bK8T/xQYpRqfDTCQ6RUj4fFCUSuoMKbykY2GAo5w6wrgRblfKr5hHJ3j5ZkJZwWOf5+8SrqH9Uaz3vx0VG2dL+zYJAfkLamRBjkhLXJB2qRDOLkj38gjefLuvQfv2fs+Ty15i5p98he8Hz8BTYSxHQ=</latexit> <latexit sha1_base64="9KvgiK/zWN+qKGLCTC8oO5S57g=">ACVnicfVFdSxwxFM2Mn10/OupjX4JL0UJZmup+iAIvio0FXLzrJmsnd2g5lkSO6oyzB/Ul/an+KLmFnH4hceCBzOuYfcnMSZFBbD8J/nT03PzM7Nf2osLC4tfw5WVk+szg2HDtdSm7OYWZBCQcFSjLDLA0lnAaXxU/uklGCu0+o3jDHopGyqRCM7QSf0gjRCuXa6wXBsoIwkJbkYpw1GcFfld/rEsYyMGI7wG92jT5nz86HWAwXWbmx8GKV1th80w1Y4AX1L2jVpkhpH/eAmGmiep6CQS2Ztx1m2CuYQcElI0ot5AxfsG0HVUsRsr5jUtKvThnQRBt3FNKJ+jxRsNTacRq7yWpT+9qrxPe8bo7JTq8QKsRFH+8KMklRU2rjulAGOAox4wboTblfIRM4yj+4nGpITdCr/+P/ktOfnRam+1to5/Nvf/1HXMky9knWySNtkm+SQHJEO4eSW3Hm+N+X9e79GX/ucdT36swaeQE/eAUdrZ/</latexit> What is a score function for tagging? • Arbitrary function that takes a word sequence and a tag as input and tell you how good they are together score ( w , t ) = “goodness” ( w , t ) p ( t | w ) ∝ exp { score ( w , t ) }

  11. <latexit sha1_base64="jU6959q+oSYD68OzEntITv/8EU=">ACX3icdVFda9swFJW9dkmTLPO2p7EXsTBoQRnLd32UAjsZY8dNE1HFIsXyeismWk62zB+E/2rdCX/ZPJaVr6tQNCh3Pula6OolxJi2F45fkvtrZfNpo7rXbnVfd18ObtmdWFETASWmlzHnELSmYwQokKznMDPI0UjKOL7U/XoKxUmenuMphmvJ5JhMpODpFiwZwh/XV1qhDVRMQYK7LOW4iJLyd7VPbzlWzMj5AvfoMS1ZpFVsV6nbKMFIK8oE7HGh1ZS/eovVnQC/vhGvQpGWxIj2xwMgsuWaxFkUKGQnFrJ4Mwx2nJDUqhoGqxwkLOxQWfw8TRjKdgp+U6n4p+ckpME23cypCu1fsdJU9tPbKrEe0j71afM6bFJh8nZYywuETNxclBSKoqZ12DSWBgSqlSNcGOlmpWLBDRfovqS1DuFbjaO7Jz8lZ5/7g4P+wc/D3vDXJo4m+UA+kl0yIF/IkPwgJ2REBLn2fK/tdby/fsPv+sFNqe9tet6RB/Df/wPNG7hI</latexit> <latexit sha1_base64="AFd/hOApGgbQmNSrOhsHbpkCWNY=">ACTHicdVDLSiNBFK2O74wzRl26KQyCghM6Kj4WguDGVYhgfJAOobpyOylS3dVU3VZD0x/oxoU7v8KNC2UYsDpG8TFzoOBwzj3cW8ePpTDouvdOYWx8YnJqeqb4Y/bnr7nS/MKpUYnm0OBKn3uMwNSRNBAgRLOYw0s9CWc+f3D3D+7BG2Eik5wEMrZN1IBIztFK7xD2Ea5tLDVcaMk9CgKteyLDnB+lVtk7fOGaeFt0ertF9+papQaKZ/F0DvFK6n/0nt9Yuld2KOwT9TqojUiYj1NulO6+jeBJChFwyY5pVN8ZWyjQKLiEreomBmPE+60LT0oiFYFrpsIyMrlilQwOl7YuQDtWPiZSFxgxC307mJ5qvXi7+y2smGOy2UhHFCULEXxcFiaSoaN4s7QgNHOXAEsa1sLdS3mOacbT9F4cl7OXYfv/yd3K6UaluVjaPt8oHF6M6pskSWSarpEp2yAE5InXSIJzckAfyRJ6dW+fR+eP8fR0tOKPMIvmEwuQLNxi2hA=</latexit> You score function can be any function! Linear Function (dot product of a weight vector and a feature function) score ( w , t ) = θ · f ( w , t ) score ( w , t ) = Neural-Network ( w , t ) Non-linear Function (neural network)

  12. n u N N N N N o N b r e V V V V V V score ( w , N, V, A, D, N) v d A A A A A A t e D D D D D D w Time flies like an arrow

  13. <latexit sha1_base64="s9jhtMHKXTsl+NMqkQHCvDAU5ik=">ACT3icbVHLThsxFPUEWkL6CnTJxmpUNZWqaAIVjwUSg1LKhFAjdPI49wBC49nZN+BRNb8IZuy4zfYsABV9YShSkuPZOn43KePo0xJi2F4E9Tm5l+8XKgvNl69fvP2XNp+cimuRHQE6lKzUnELSipoYcSFZxkBngSKTiOzvfK+PEFGCtTfYiTDAYJP9UyloKjl4bN+HubJRzPothdFp/pNmU2T4buScPiE2VS0+ldcOUOix+6oAzGXMYexHOitSA0V7puYLnWnKCtoYNlthJ5yCPifdirRIhYNh85qNUpEnoFEobm2/G2Y4cNygFAqKBstZFyc81Poe6p5Anbgpn4U9KNXRjROjT8a6VSdrXA8sXaSRD6zXNP+GyvF/8X6OcabAyd1liNo8TgozhXFlJbm0pE0IFBNPOHCSL8rFWfcIH+Cx5N2Cqx/ufJz8nRaqe71ln79rW1s1vZUScr5ANpky7ZIDtknxyQHhHkitySe/IQ/Azugl+1KrUWVOQ9+Qu1xd9WfbVf</latexit> <latexit sha1_base64="iUmWof06X3MD0EAZxpNQlCf89Pk=">ACnicbVDLSgMxFL3js9ZX1aWbaBFclakVHwuh4MZlhb6knZMmrahmcyQZIQy7dqNv+LGhSJu/QJ3/o2ZdpBqPRA4nHMufe4AWdK2/aXtbC4tLymlpLr29sbm1ndnaryg8loRXic1/WXawoZ4JWNOc1gNJsedyWnMH17Ffu6dSMV+U9TCgjod7gnUZwdpI7czBqOlh3SeYR+VxS4zQFZpVRi2Tydo5ewI0T/IJyUKCUjvz2ez4JPSo0IRjpRp5O9BOhKVmhNxuhkqGmAywD3aMFRgjyonmpwyRkdG6aCuL80TGk3U2YkIe0oNPdck4y3VXy8W/Maoe5eOBETQaipINOPuiFH2kdxL6jDJCWaDw3BRDKzKyJ9LDHRpr30pITLGc/J8+T6kuX8gVbk+zxbukjhTswyEcQx7OoQg3UIKEHiAJ3iBV+vRerberPdpdMFKZvbgF6yPb/1m0w=</latexit> <latexit sha1_base64="4bvyp1Do5BRki97fWNMj1s7m/k=">ACyXicfVFLbxMxEPZueZTwCnDkYhGhthKNi2icECqQEJIPbRITVspTiOvM5tY9Xq39myb4O6Jf8iNGz8Fb3YpKUWMZOnzHzfvOJcSYtR9CMIV27dvnN39V7r/oOHjx63nzw9tFlhBPRFpjJzHMLSmro0QFx7kBnsYKjuLTj1X86ByMlZk+wHkOw5RPtEyk4Ohdo/bPnClIcJ2lHKdx4rCkLJVj+vt/UTIjJ1PcoO8pSwXjsEsZ4hzHw5Z0VmoFyiv1qiblBWlo7ZIh25PxlrlEldZwmu3EF5on3R/6uXZP1q0r/l5ZT3C5pHh5opu2W6N2J+pGC6M3Qa8BHdLY/qj9nY0zUaSgUShu7aAX5Th03KAUCnzhwkLOxSmfwMBDzVOwQ7e4RElfes+YJpnxTyNdeJcZjqfWztPYZ1bd2r9jlfNfsUGByduhkzovELSoCyWFopjR6qx0LA0IVHMPuDS90rFlPt7oT9+vYR3lb25GvkmONzs9ra6W19ed3Y+NOtYJc/JC7JOemSb7JDPZJ/0iQg+BSogvNwNzwLZ+HXOjUMGs4zcs3Cb78AqKTj1w=</latexit> <latexit sha1_base64="OpDjQ/QvtoXNYqTR8mHTmDbLJI=">ACEHicbVDLSsNAFJ3UV62vqEs3wSLWTUmt+NgV3LizQl/S1DKZTtqhk0mYuRFKmk9w46+4caGIW5fu/BuTtBS1HrhwOde7r3H9jlTYJpfWmZhcWl5JbuaW1vf2NzSt3caygskoXicU+2bKwoZ4LWgQGnLV9S7NqcNu3hZeI376lUzBM1GPm04+K+YA4jGKpqx9aLoYBwTy8jixOHSiMZ0otGt8JS7L+AI6et4smimMeVKakjyaotrVP62eRwKXCiAcK9UumT50QiyBEU6jnBUo6mMyxH3ajqnALlWdMH0oMg5ipWc4noxLgJGqPydC7Co1cu24MzlW/fUS8T+vHYBz3gmZ8AOgkwWOQE3wDOSdIwek5QAH8UE8niWw0ywBITiDPMpSFcJDidvTxPGsfFUrlYvjnJV26ncWTRHtpHBVRCZ6iCrlAV1RFBD+gJvaBX7VF71t6090lrRpvO7KJf0D6+ARNxnhQ=</latexit> How do we normalize? • Why is this hard? • There are an exponential number of summands exp { score ( t , w ) } p ( t | w ) = t 0 2 T n exp { score ( t 0 , w ) }O ( |T | n ) P • T is the set of tags (typically about 12) • |T n | = |T | n • Naïve algorithm runs in O ( |T | n ) • The normalizer is terms the partition function (Zustandssumme) X exp { score ( t 0 , w ) } Z ( w ) = t 0 2 T n

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend