models w latent random variables
play

Models w/ Latent Random Variables Chunting Zhou Site - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Models w/ Latent Random Variables Chunting Zhou Site https://phontron.com/class/nn4nlp2019/ With Slides from Graham Neubig Discriminative vs. Generative Models Discriminative model: calculate the


  1. CS11-747 Neural Networks for NLP Models w/ Latent Random Variables Chunting Zhou Site https://phontron.com/class/nn4nlp2019/ With Slides from Graham Neubig

  2. Discriminative vs. Generative Models • Discriminative model: calculate the probability of output given input P(Y|X) • Generative model: calculate the probability of a variable P(X), or multiple variables P(X,Y) • Which of the following models are discriminative vs. generative? • Standard BiLSTM POS tagger • Globally normalized CRF POS tagger • Language model

  3. Types of Variables • Observed vs. Latent: • Observed: something that we can see from our data, e.g. X or Y • Latent: a variable that we assume exists, but we aren’t given the value • Deterministic vs. Random: • Deterministic: variables that are calculated directly according to some deterministic function • Random (stochastic): variables that obey a probability distribution, and may take any of several (or infinite) values

  4. Quiz: What Types of Variables? • In the an attentional sequence-to-sequence model using MLE/teacher forcing, are the following variables observed or latent? deterministic or random? • The input word ids f • The encoder hidden states h • The attention values a • The output word ids e

  5. <latexit sha1_base64="V46Rh9MPIhydS0Zl5tSUxvmWbI=">AB63icbVBNSwMxEJ2tX7V+VT16CRbBg5TdKuix6MVjBfsB7VKyabYNTbJLkhXr0r/gxYMiXv1D3vw3Zts9aOuDgcd7M8zMC2LOtHdb6ewsrq2vlHcLG1t7+zulfcPWjpKFKFNEvFIdQKsKWeSNg0znHZiRbEIOG0H45vMbz9QpVk780kpr7AQ8lCRrDJpMcz9NQvV9yqOwNaJl5OKpCj0S9/9QYRSQSVhnCsdzY+OnWBlGOJ2WeomMSZjPKRdSyUWVPvp7NYpOrHKAIWRsiUNmqm/J1IstJ6IwHYKbEZ60cvE/7xuYsIrP2UyTgyVZL4oTDgyEcoeRwOmKDF8YgkmitlbERlhYmx8ZRsCN7iy8ukVat659Xa3UWlfp3HUYQjOIZT8OAS6nALDWgCgRE8wyu8OcJ5cd6dj3lrwclnDuEPnM8fhE+N5A=</latexit> <latexit sha1_base64="/KRA9KS/tvSXo7x6oOFxYVNHw=">AB9XicbVDJSgNBEO1xjXGLevTSGIQIEmaioOAl6MVjBLNAMoaeTk3SpGehu0aNQ/7DiwdFvPov3vwbO8tBEx8UPN6roqeF0uh0ba/rYXFpeWV1cxadn1jc2s7t7Nb01GiOFR5JCPV8JgGKUKokAJjVgBCzwJda9/NfLr96C0iMJbHMTgBqwbCl9whka6iwuPx08XtIU9QHbUzuXtoj0GnSfOlOTJFJV27qvViXgSQIhcMq2bjh2jmzKFgksYZluJhpjxPutC09CQBaDdHz1kB4apUP9SJkKkY7V3xMpC7QeBJ7pDBj29Kw3Ev/zmgn6524qwjhBCPlkZ9IihEdRUA7QgFHOTCEcSXMrZT3mGIcTVBZE4Iz+/I8qZWKzkmxdHOaL19O48iQfXJACsQhZ6RMrkmFVAknijyTV/JmPVgv1rv1MWldsKYze+QPrM8fRQSRtg=</latexit> Latent Variable Models • A latent variable model (LVM) is a probability distribution over two sets of variables : x, z p ( x, z ; θ ) where the x variables are observed at learning time in a dataset and z are latent variables.

  6. What is Latent Random Variable Model • Older latent variable models • Topic models (unsupervised)

  7. What is Latent Random Variable Model • Older latent variable models • Topic models (unsupervised) • Hidden Markov Model (unsupervised tagger)

  8. What is Latent Random Variable Model • Older latent variable models • Topic models (unsupervised) • Hidden Markov Model (unsupervised tagger) • Some tree-structured Model (unsupervised parsing)

  9. Why Latent Variable Models? • Some variables are not observed naturally and we want to model / infer these hidden variables: e.g. topics of an article • Specify structural relationships in the context of unknown variables, to learn interpretable structure: - Inject inductive bias / prior knowledge

  10. Deep Structured Latent Variable Models • Specify structure, but interpretable structure is often discrete: e.g. POS tags, dependency parse trees • There is always a tradeo ff between interpretability and flexibility: model constraints v.s. model capacity

  11. Examples of Deep Latent Variable Models • Deep latent variable models • Variational Autoencoders (VAEs) • Generative Adversarial Network (GANs) • Flow-based generative models

  12. Variational Auto-encoders (Kingma and Welling 2014)

  13. A Latent Variable Model • We observed output x (assume a continuous vector for now) • We have a latent variable z generated from a Gaussian • We have a function f, parameterized by Θ that maps from z to x , where this function is usually a neural net z ~ N (0, I) Θ x = f( z ; Θ ) x N

  14. An Example (Goersch 2016) f z x

  15. <latexit sha1_base64="bTwryeDNzmltFuegbWHI1b8uCLs=">AB6XicbVBNS8NAEJ34WetX1aOXxSJ4KkVFE9FLx6r2A9oQ9lsN+3SzSbsToQS+g+8eFDEq/Im/GbZuDtj4YeLw3w8y8IJHCoOt+Oyura+sbm4Wt4vbO7t5+6eCwaeJUM95gsYx1O6CGS6F4AwVK3k40p1EgeSsY3U791hPXRsTqEcJ9yM6UCIUjKVHsR1r1R2K+4MZJl4OSlDjnqv9NXtxyNuEImqTEdz03Qz6hGwSfFLup4QlIzrgHUsVjbjxs9mlE3JqlT4JY21LIZmpvycyGhkzjgLbGVEcmkVvKv7ndVIMr/xMqCRFrth8UZhKgjGZvk36QnOGcmwJZVrYWwkbUk0Z2nCKNgRv8eVl0qxWvPNK9f6iXLvJ4yjAMZzAGXhwCTW4gzo0gEIz/AKb87IeXHenY9564qTzxzBHzifP04PjTU=</latexit> <latexit sha1_base64="Ur/I5DTSHCL8dILJ0zPUjty2s4=">AB9HicbVBNTwIxEJ3FL8Qv1KOXRmKCF7KLJnokevGIiYAJbEi3dKGh7a5tlwQ2/A4vHjTGqz/Gm/GAntQ8CWTvLw3k5l5QcyZNq7eTW1jc2t/LbhZ3dvf2D4uFRU0eJIrRBIh6pxwBrypmkDcMp4+xolgEnLaC4e3Mb42o0iySD2YcU1/gvmQhI9hYyZ90GepoJlBcnpx3iyW34s6BVomXkRJkqHeLX51eRBJBpSEca9323Nj4KVaGEU6nhU6iaYzJEPdp21KJBdV+Oj96is6s0kNhpGxJg+bq74kUC63HIrCdApuBXvZm4n9eOzHhtZ8yGSeGSrJYFCYcmQjNEkA9pigxfGwJorZWxEZYIWJsTkVbAje8surpFmteBeV6v1lqXaTxZGHEziFMnhwBTW4gzo0gMATPMrvDkj58V5dz4WrTknmzmGP3A+fwCuGJFi</latexit> <latexit sha1_base64="wOJ+X90BOhNazuCRDu548Non8Go=">ACAXicbVDLSgNBEJyNrxhfq14EL4NBiJewGwU9Br14jGAekCzL7GQ2GTL7YKZXEtd48Ve8eFDEq3/hzb9xkuxBEwsaiqpuru8WHAFlvVt5JaWV1bX8uFjc2t7R1zd6+hokRSVqeRiGTLI4oJHrI6cBCsFUtGAk+wpje4mvjNOyYVj8JbGMXMCUgv5D6nBLTkmgdDl+O4gGO3bQDfQZkXBo+3J+4ZtEqW1PgRWJnpIgy1Fzq9ONaBKwEKgSrVtKwYnJRI4FWxc6CSKxYQOSI+1NQ1JwJSTj8Y42OtdLEfSV0h4Kn6eyIlgVKjwNOdAYG+mvcm4n9eOwH/wkl5GCfAQjpb5CcCQ4QnceAul4yCGlCqOT6Vkz7RBIKOrSCDsGef3mRNCpl+7RcuTkrVi+zOPLoEB2hErLROaqia1RDdUTRI3pGr+jNeDJejHfjY9aM7KZfQHxucPNe2WvA=</latexit> <latexit sha1_base64="JQICHrLNf4VYVNWRQ5i4MTe5E4=">ACBnicbVDLSgNBEJyNrxhfqx5FGAxCAhJ2o6AXIejFYwTzgCSE2ckGTK7O8z0SpI1Jy/+ihcPinj1G7z5N04eB0saCiqunu8qTgGhzn20osLa+sriXUxubW9s79u5eWYeRoqxEQxGqkc0EzxgJeAgWFUqRnxPsIrXux7lXumNA+DOxhI1vBJ+BtTgkYqWkfykz/BA+z+BLzDArm3EdugzIKN/GabdtrJORPgReLOSBrNUGzaX/VWSCOfBUAF0brmOhIaMVHAqWCjVD3STBLaIx1WMzQgPtONePLGCB8bpYXboTIVAJ6ovydi4ms98D3T6RPo6nlvLP7n1SJoXzRiHsgIWECni9qRwBDicSa4xRWjIAaGEKq4uRXTLlGEgkuZUJw519eJOV8zj3N5W/P0oWrWRxJdICOUAa56BwV0A0qohKi6BE9o1f0Zj1ZL9a79TFtTVizmX30B9bnD+Ful4A=</latexit> A probabilistic perspective on Variational Auto-Encoder • For each datapoint i : • Draw latent variables z i ∼ p ( z ) (prior) Draw data point x i ∼ p θ ( x | z ) • • Joint probability distribution over data and latent variables: p ( x, z ) = p ( z ) p θ ( x | z )

  16. What is Our Loss Function? • We would like to maximize the corpus log likelihood X log P ( X ) = log P ( x ; θ ) x ∈ X • For a single example, the marginal likelihood is Z P ( x ; θ ) = P ( x | z ; θ ) P ( z ) d z • We can approximate this by sampling z s then summing X S ( x ) := { z 0 ; z 0 ∼ P ( z ) } P ( x ; θ ) ≈ P ( x | z ; θ ) where z ∈ S ( x )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend