lecture 3 comparing frequentist and bayesian estimation
play

Lecture 3: Comparing frequentist and Bayesian estimation - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 3: Comparing frequentist and Bayesian estimation techniques Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment


  1. CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 3: Comparing frequentist and Bayesian estimation techniques Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

  2. Text classification The task: binary classification (e.g. sentiment analysis) Assign (sentiment) label L i ∈ { +, − } to a document W i =(w i1 ...w iN ). W 1 = “This is an amazing product: great battery life, amazing features and it’s cheap.” W 2 = “How awful. It’s buggy, saps power and is way too expensive.” The data: A set D of N documents with (or without) labels The model: Naive Bayes We will use a frequentist model and a Bayesian model and compare supervised and unsupervised estimation techniques for them. 2 Bayesian Methods in NLP

  3. A Naive Bayes model The task: Assign (sentiment) label L i ∈ {+, − } to document W i . W 1 = “This is an amazing product: great battery life, amazing features and it’s cheap.” W 2 = “How awful. It’s buggy, saps power and is way too expensive.” The model: L i = argmax L P( L | W i ) = argmax L P( W i | L )P( L) Assume W i is a “bag of words”: W 1 = {an:1, and: 1, amazing: 2, battery: 1, cheap: 1, features: 1, great: 1,…} W 2 = {awful: 1, and: 1, buggy: 1, expensive: 1,…} P( W i | L ) is a multinomial distribution: W i ∼ Multinomial( θ L ) With a vocabulary of V words, θ L = ( θ 1 ,…., θ V ) P( L ) is a Bernoulli distribution: L ∼ Bernoulli( π ) 3 Bayesian Methods in NLP

  4. The frequentist (maximum-likelihood) model 4 Bayesian Methods in NLP

  5. The frequentist model The frequentist model has specific parameters θ L and π L i = argmax L P( W i | θ L )P( L | π ) P( W i | θ L ) is a multinomial over V words with parameter θ L = ( θ 1 ,…., θ V ): W i ∼ Multinomial( θ L ) P( L | π ) is a Bernoulli distribution with parameter π : L ∼ Bernoulli( π ) 5 Bayesian Methods in NLP

  6. The frequentist model π L i w ij N i N θ L 2 6 Bayesian Methods in NLP

  7. Supervised MLE The data is labeled: We have a set D of D documents W 1 ... W d with N words Each document W i has N i words D + documents (subset D + ) have a positive label and N + words D − documents (subset D - ) have a negative label and N - words Each word w i appears N + (w i ) times in D +, N − (w i ) times in D - Each word w i appears N j (w i ) times in D j MLE: relative frequency estimation - Labels: L ∼ Bernoulli( π ) with π = D + / d - Words: W i |+ ∼ Multinomial( θ + ) with θ i+ = N + (w i ) / N + - Words: W i | − ∼ Multinomial( θ − ) with θ i − = N - (w i ) / N - 7 Bayesian Methods in NLP

  8. Inference with MLE The inference task: Given a new document W i+1 , what is its label L i+1 ? Recall: the word w j occurs N i+1 (w j ) times in W i+1. P ( L = + | W i +1 ) P (+) P ( W i +1 | +) ∝ V θ N i +1 ( w j ) Y = π + j j =1 8 Bayesian Methods in NLP

  9. Unsupervised MLE The data is unlabeled: We have a set D of D documents W 1 ... W d with N words Each document W i has N i words Each word w 1 ...w i ...w V appears N j ( w i ) times in W j EM algorithm: “expected relative frequency estimation” Initialization: pick initial π (0) , θ +(0) , θ − (0) Iterate: - Labels: L ∼ Bernoulli( π ) with π (t) = 〈 N + 〉 (t-1) / 〈 N 〉 (t-1) - Words: W i |+ ∼ Multinomial( θ + ) with θ i+ (t) = 〈 N + ( w i ) 〉 (t-1) / 〈 W + 〉 (t-1) - Words: W i | − ∼ Multinomial( θ − ) with θ i − (t) = 〈 N − ( w i ) 〉 (i-1) / 〈 W − 〉 (i-1) 9 Bayesian Methods in NLP

  10. Maximum Likelihood estimation With complete (= labeled) data D = { 〈 X i , Z i 〉 }, maximize the complete likelihood p ( X, Z | θ ) : θ * = argmax θ ∏ i p ( X i , Z i | θ ) or θ * = argmax θ ∑ i ln( p ( X i , Z i | θ )) 10 Bayesian Methods in NLP

  11. Maximum Likelihood estimation With incomplete (= unlabeled) data , D = { 〈 X i , ? 〉 } maximize the incomplete (marginal) likelihood p ( X | θ ): θ * = argmax θ ∑ i ln( p ( X i | θ )) = argmax θ ∑ i ln( ∑ Z p ( X i , Z | θ ) p ( Z | X i , θ ’) ) = argmax θ ∑ i ln( E Z|X ᵢ , θ ’ [ p ( X i , Z | θ ) ] ) p ( Z | X , θ ): the posterior probability of Z ( X = our data) E Z|X ᵢ , θ [ p ( X i , Z | θ ) ]: the expectation of p ( X , Z | θ ) wrt . p ( Z | X , θ ) Find parameters θ new that maximize the expected log- likelihood of the joint p ( Z , X | θ new ) under p ( Z | X , θ old ) This requires an iterative approach 11 Bayesian Methods in NLP

  12. The EM algorithm 1. Initialization: Choose initial parameters θ old 2. Expectation step: Compute p ( Z | X , θ old ) (= posterior of the latent variables Z ) 3. Maximization step: Compute θ new θ new maximizes the expected log-likelihood of the joint p ( Z , X | θ new ) under p ( Z | X , θ old ) : X θ new p ( Z | X , θ old ) ln p ( X , Z | θ ) = arg max θ Z 4. Check for convergence. Stop, or set θ old := θ new and go to 2. 12 Bayesian Methods in NLP

  13. The EM algorithm The classes we find may not correspond to the classes we would be interested in. Seed knowledge (e.g. a few positive and negative words) may help We are not guaranteed to find a global optimum, and may get stuck in a local optimum. Initialization matters 13 Bayesian Methods in NLP

  14. In our example... Initialization: Pick (random) π A, π B = (1- π A ), θ A , θ B E-step: Set N A ,N B , N A (w 1 ),...,N A (w V ), N B (w 1 ), ... N B (w V ) := 0 For each document W i , Set L i = A with P( L i = A | W i , π A, π B , θ A , θ B ) ∝ π A ∏ j P(w ij | θ A ) Set L i = B with P( L i = B | W i , π A, π B , θ A , θ B ) ∝ π b ∏ j P(w ij | θ B ) Update N A += P( L i = A | W i , π A, π B , θ A , θ B ) N B += P( L i =B | W i , π A, π B , θ A , θ B ) For all words w ij in W i : N A (w ij ) += P( L i = A | W i , π A, π B , θ A , θ B ) N B (w ij ) += P( L i = B | W i , π A, π B , θ A , θ B ) M-step: π A := N A /(N A + N B ) π B := N B /(N A + N B ) θ A (w i ) := N A (w i ) / ∑ j (N A (w j )) θ B (w i ) := N B (w i ) / ∑ j (N B (w j )) 14 Bayesian Methods in NLP

  15. The Bayesian model 15 Bayesian Methods in NLP

  16. The Bayesian model The Bayesian model has priors Dir( γ ) and Beta( α , β ) with hyperparameters γ = ( γ 1 , ..., γ V ) and α , β It does not have specific θ L and π , but integrates them out: L i = argmax L ∫∫ P( W i | θ L )P( θ L ; γ L , D ) P( L | π )P( π ; α , β , D )d θ L d π = argmax L ∫ P( W i | θ L )P( θ L ; γ L , D )d θ L ∫ P( L | π )P( π ; α , β , D )d π = argmax L P( W i | γ L , D ) P( L | α , β , D ) P( W i | θ L ) is a multinomial with parameter θ L = ( θ 1 ,…., θ V ), P( θ L ; γ L ) is a Dirichlet with hyperparameter γ L = ( γ 1 ,…., γ V ) θ L ∼ Dirichlet( γ L ) W i ∼ Multinomial( θ L ) P( L | π ) is a Bernoulli with parameter π , drawn from a Beta prior π ∼ Beta( α , β ) L ∼ Bernoulli( π ) 16 Bayesian Methods in NLP

  17. The Bayesian model α , β π L i w ij N i N γ θ L 2 17 Bayesian Methods in NLP

  18. Bayesian: supervised The data is labeled: We have a set D of D documents W 1 ... W D with N words Each document W i has N i words D + documents (subset D + ) have a positive label and N + words D − documents (subset D - ) have a negative label and N - words Each word w i appears N + (w i ) times in D +, N − (w i ) times in D - Each word w j appears N i (w j ) times in W i Bayesian estimation P(L = + | D ) = (D + + α ) / (D + α + β ) P(w i |+, D ) = (N + (w i ) + γ i )/(N + (w i ) + γ 0 ) P( W i | +, D ) = ∏ j P(w j | +) Ni(wj) P(L i = + | W i , D ) = [(D + + α ) / (D + α + β )] ∏ j P(w j | +) Ni(wj) 18 Bayesian Methods in NLP

  19. Bayesian: unsupervised We need to approximate an integral/expectation: p (L i =+ | W i ) ∝ ∫∫ p ( W i |+, θ + ) p ( θ + ; γ , D ) p ( L=+| π ) p ( π ; α , β , D )d θ + d π ∝ ∫ p ( W i | +, θ + ) p ( θ + ; γ , D ) d θ + ∫ p ( L=+| π ) p ( π ; α , β , D )d π ∝ p ( W i | γ , +, D ) p (L i =+ | α , β , D ) 19 Bayesian Methods in NLP

  20. Approximating expectations Z 1 E [ f ( x )] = f ( x ) p ( x ) dx 0 N 1 X f ( x ( i ) ) = lim N N →∞ i =1 for x (1) ...x ( i ) ...x ( N ) drawn from p ( x ) T 1 X f ( x ( i ) ) ≈ T i =1 for x (1) ...x ( i ) ...x ( T ) drawn from p ( x ) We can approximate the expectation of f(x), 〈 f(x) 〉 = ∫ f(x) p (x)dx, by sampling a finite number of points x (1) , ..., x (T) according to p (x), evaluating f(x (i) ) for each of them, and computing the average. 20 Bayesian Methods in NLP

  21. Markov Chain Monte Carlo A multivariate distribution p( x ) = p(x 1 ,…,x k ) with discrete x i has only a finite number of possible outcomes. Markov Chain Monte Carlo methods construct a Markov chain whose states are the outcomes of p( x ) . The probability of visiting state x j is p( x j ) We sample from p( x ) by visiting a sequence of states from this Markov chain. 21 Bayesian Methods in NLP

  22. Gibbs sampling Our states: One label assignment L 1 ,…,L N to each of our N documents x = ( L 1 ,…,L N ) Our transitions: We go from one label assignment x = (+,+,-,+,-...+) to another y = (-,+,+,+,…,+) Our intermediate steps: We generate label Y i conditioned on Y 1 ...Y i-1 and X i+1 ...X N Call label assignment Y 1 ...Y i-1 , X i+1 ...X N L (-i) We need to compute P( Y i | D, L (-i), α , β , γ ) 22 Bayesian Methods in NLP

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend