latent variable generative models and the expectation
play

Latent-Variable Generative Models and the Expectation Maximization - PowerPoint PPT Presentation

CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Motivation: Bag-Of-Words (BOW)


  1. CS 533: Natural Language Processing Latent-Variable Generative Models and the Expectation Maximization (EM) Algorithm Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32

  2. Motivation: Bag-Of-Words (BOW) Document Model ◮ Fixed-length documents x ∈ V T ◮ BOW parameters: word distribution p W over V defining T � p X ( x ) = p W ( x t ) t =1 ◮ Model’s generative story: any word in any document is independently generated. ◮ What if the true generative story underlying data is different? x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ MLE: p X ( x (1) ) = p X ( x (2) ) = (1 / 2) 10 Karl Stratos CS 533: Natural Language Processing 2/32

  3. Latent-Variable BOW (LV-BOW) Document Model ◮ LV-BOW parameters ◮ p Z : “topic” distribution over { 1 . . . K } ◮ p W | Z : conditional word distribution over V defining T � p X | Z ( x | z ) = p W | Z ( x t | z ) ∀ z ∈ { 1 . . . K } t =1 K � p X ( x ) = p Z ( z ) × p X | Z ( x | z ) z =1 ◮ Model’s generative story: for each document, a topic is generated and conditioning on that words are independently generated Karl Stratos CS 533: Natural Language Processing 3/32

  4. Back to the Example x (1) = ( a, a, a, a, a, a, a, a, a, a ) V = { a, b } x (2) = ( b, b, b, b, b, b, b, b, b, b ) T = 10 ◮ K = 2 with p Z (1) = p Z (2) = 1 / 2 ◮ p W | Z ( a | 1) = p W | Z ( b | 2) = 1 ◮ p X ( x (1) ) = p X ( x (2) ) = 1 / 2 ≫ (1 / 2) 10 Key idea: introduce a latent variable Z to model true generative process more faithfully Karl Stratos CS 533: Natural Language Processing 4/32

  5. The Latent-Variable Generative Model Paradigm Model. Joint distribution over X and Z p XZ ( x, z ) = p Z ( z ) × p X | Z ( x | z ) Learning. We don’t observe Z ! � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z � �� � p X ( x ) Karl Stratos CS 533: Natural Language Processing 5/32

  6. The Learning Problem ◮ How can we solve � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Specifically for LV-BOW, given N documents x (1) . . . x ( N ) ∈ V T , how can we learn topic distribution p Z and conditional word distribution p W | Z that maximize �� � N T � � p W | Z ( x ( i ) log p Z ( z ) × t | z ) i =1 t =1 z ∈Z Karl Stratos CS 533: Natural Language Processing 6/32

  7. A Proposed Algorithm 1. Initialize p Z and p W | Z as random distributions. 2. Repeat until convergence: 2.1 For i = 1 . . . N compute conditional posterior distribution p Z ( z ) × � T t =1 p W | Z ( x ( i ) t | z ) p Z | X ( z | x ( i ) ) = � K z ′ =1 p Z ( z ′ ) × � T t =1 p W | Z ( x ( i ) t | z ′ ) 2.2 Update model parameters by � N i =1 p Z | X ( z | x ( i ) ) p Z ( z ) = � K � N i =1 p Z | X ( z ′ | x ( i ) ) z ′ =1 � N i =1 p Z | X ( z | x ( i ) ) × count ( w | x ( i ) ) p W | Z ( w | z ) = � � N i =1 p Z | X ( z | x ( i ) ) × count ( w ′ | x ( i ) ) w ′ ∈ V where count ( w | x ( i ) ) is number of times w ∈ V appears in x ( i ) . Karl Stratos CS 533: Natural Language Processing 7/32

  8. Code Karl Stratos CS 533: Natural Language Processing 8/32

  9. Code in Action Karl Stratos CS 533: Natural Language Processing 9/32

  10. Code in Action: Bad Initialization Karl Stratos CS 533: Natural Language Processing 10/32

  11. Another Example Initial values After convergence Karl Stratos CS 533: Natural Language Processing 11/32

  12. Again Possible to Get Stuck in a Local Optimum Initial values After convergence Karl Stratos CS 533: Natural Language Processing 12/32

  13. Why Does It Work? ◮ A special case of the expectation maximization (EM) algorithm adapted for LV-BOW ◮ EM is an extremely important and general concept ◮ Another special case: variational autoencoders (VAEs, next class) Karl Stratos CS 533: Natural Language Processing 13/32

  14. Setting ◮ Original problem: difficult to optimize (nonconvex) � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z ◮ Alternative problem: easy to optimize (often concave) max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is some arbitrary posterior distribution that is easy to compute Karl Stratos CS 533: Natural Language Processing 14/32

  15. Solving the Alternative Problem ◮ Many models we considered (LV-BOW, HMM, PCFG) can be written as � p τ ( a ) count τ ( a | x,z ) p XZ ( x, z ) = ( τ,a ) ∈E ◮ E is a set of possible event type-value pairs. ◮ count τ ( a | x, z ) is number of times τ = a happens in ( x, z ) ◮ Model has a distribution p τ over possible values of type τ ◮ Example p XZ (( a, a, a, b, b ) , 2) = p Z (2) × p W | Z ( a | 2) 3 × p W | Z ( b | 2) 2 (LV-BOW) p XZ (( La , La , La ) , ( N, N, N )) = o ( La | N ) 3 × t ( N |∗ ) × t ( N | N ) 2 × t ( STOP | N ) (HMM) Karl Stratos CS 533: Natural Language Processing 15/32

  16. Closed-Form Solution If x (1) . . . x ( N ) ∼ pop X are iid samples, max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) N � � q Z | X ( z | x ( i ) ) log p XZ ( x ( i ) , z ) ≈ max p XZ i =1 z N � � � q Z | X ( z | x ( i ) ) count τ ( a | x ( i ) , z ) log p τ ( a ) = max p τ i =1 z ( τ,a ) ∈E � N � � � � q Z | X ( z | x ( i ) ) count τ ( a | x ( i ) , z ) = max log p τ ( a ) p τ ( τ,a ) ∈E i =1 z MLE solution! � N � z q Z | X ( z | x ( i ) ) count τ ( a | x ( i ) , z ) i =1 p τ ( a ) = � � N � z q Z | X ( z | x ( i ) ) count τ ( a ′ | x ( i ) , z ) a ′ i =1 Karl Stratos CS 533: Natural Language Processing 16/32

  17. This is How We Derived LV-BOW EM Updates Using q Z | X = p Z | X � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z | x ( i ) , z ′ ) i =1 p Z ( z ) = � � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z ′′ | x ( i ) , z ′ ) z ′′ i =1 � N i =1 p Z | X ( z | x ( i ) ) = � � N i =1 p Z | X ( z ′′ | x ( i ) ) z ′′ � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z, w | x ( i ) , z ′ ) i =1 p W | Z ( w | z ) = � � N � z ′ p Z | X ( z ′ | x ( i ) ) count τ ( z ′ = z, w ′ | x ( i ) , z ′ ) w ′ ∈ V i =1 � N i =1 p Z | X ( z | x ( i ) ) count ( w | x ( i ) ) = � � N i =1 p Z | X ( z | x ( i ) ) count ( w ′ | x ( i ) ) w ′ ∈ V Karl Stratos CS 533: Natural Language Processing 17/32

  18. Game Plan ◮ So we have established that it is often easy to solve the alternative problem max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ q Z | X ( ·| x ) where q Z | X is any posterior distribution easy to compute ◮ We will relate the original log likelihood objective with this quantity by the following slide. Karl Stratos CS 533: Natural Language Processing 18/32

  19. ELBO: Evidence Lower Bound For any q Z | X we define ELBO( p XZ , q Z | X ) = E [log p XZ ( x, z )] + H ( q Z | X ) x ∼ pop X z ∼ q Z | X ( ·| x ) � � where H ( q Z | X ) = E − log q Z | X ( z | x ) . x ∼ pop X z ∼ q Z | X ( ·| x ) Claim. For all q Z | X , � � � ELBO( p XZ , q Z | X ) ≤ E log p XZ ( x, z ) x ∼ pop X z ∈Z with equality iff q Z | X = p Z | X . (Proof on board) Karl Stratos CS 533: Natural Language Processing 19/32

  20. EM: Coordinate Ascent on ELBO Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) x ∼ pop X p XZ z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: q Z | X ← arg max ELBO( p XZ , ¯ q Z | X ) ¯ q Z | X p XZ ← arg max ELBO(¯ p XZ , q Z | X ) ¯ p XZ 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 20/32

  21. Equivalently Input : sampling access to pop X , definition of p XZ Output : local optimum of � � � max E log p XZ ( x, z ) p XZ x ∼ pop X z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: p XZ ← arg max E [log p XZ ( x, z )] x ∼ pop X p XZ z ∼ p Z | X ( ·| x ) 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 21/32

  22. EM Can Only Increase the Objective (Or Leave It Unchanged) LL( p ′ XZ ) ELBO( p ′ XZ , q ′ Z | X ) LL( p XZ ) = ELBO( p XZ , p Z | X ) LL( p XZ ) LL( p XZ ) ⇒ ⇒ ELBO( p XZ , q Z | X )   � LL( p XZ ) = E  log p XZ ( x, z )  x ∼ pop X z ∈Z ELBO( p XZ , q Z | X ) = LL( p XZ ) − D KL ( q Z | X || p Z | X ) = E [log p XZ ( x, z )] + H ( q Z | X ) x ∼ pop X z ∼ qZ | X ( ·| x ) Karl Stratos CS 533: Natural Language Processing 22/32

  23. EM Can Only Increase the Objective (Or Leave It Unchanged) From https://media.nature.com/full/nature-assets/nbt/ journal/v26/n8/extref/nbt1406-S1.pdf Karl Stratos CS 533: Natural Language Processing 23/32

  24. Sample Version Input : N iid samples from pop X , definition of p XZ Output : local optimum of N 1 � � p XZ ( x ( i ) , z ) max log N p XZ i =1 z ∈Z 1. Initialize p XZ (e.g., random distribution). 2. Repeat until convergence: N � � p Z | X ( z | x ( i ) ) log ¯ p XZ ( x ( i ) , z ) p XZ ← arg max ¯ p XZ i =1 z ∈Z 3. Return p XZ Karl Stratos CS 533: Natural Language Processing 24/32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend