Lecture 3: Comparing frequentist and Bayesian estimation - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 3: Comparing frequentist and Bayesian estimation techniques Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

Text classification The task: binary classification (e.g. sentiment analysis) Assign (sentiment) label L i ∈ { +, − } to a document W i =(w i1 ...w iN ). W 1 = “This is an amazing product: great battery life, amazing features and it’s cheap.” W 2 = “How awful. It’s buggy, saps power and is way too expensive.” The data: A set D of N documents with (or without) labels The model: Naive Bayes We will use a frequentist model and a Bayesian model and compare supervised and unsupervised estimation techniques for them. 2 Bayesian Methods in NLP

A Naive Bayes model The task: Assign (sentiment) label L i ∈ {+, − } to document W i . W 1 = “This is an amazing product: great battery life, amazing features and it’s cheap.” W 2 = “How awful. It’s buggy, saps power and is way too expensive.” The model: L i = argmax L P( L | W i ) = argmax L P( W i | L )P( L) Assume W i is a “bag of words”: W 1 = {an:1, and: 1, amazing: 2, battery: 1, cheap: 1, features: 1, great: 1,…} W 2 = {awful: 1, and: 1, buggy: 1, expensive: 1,…} P( W i | L ) is a multinomial distribution: W i ∼ Multinomial( θ L ) With a vocabulary of V words, θ L = ( θ 1 ,…., θ V ) P( L ) is a Bernoulli distribution: L ∼ Bernoulli( π ) 3 Bayesian Methods in NLP

The frequentist (maximum-likelihood) model 4 Bayesian Methods in NLP

The frequentist model The frequentist model has specific parameters θ L and π L i = argmax L P( W i | θ L )P( L | π ) P( W i | θ L ) is a multinomial over V words with parameter θ L = ( θ 1 ,…., θ V ): W i ∼ Multinomial( θ L ) P( L | π ) is a Bernoulli distribution with parameter π : L ∼ Bernoulli( π ) 5 Bayesian Methods in NLP

The frequentist model π L i w ij N i N θ L 2 6 Bayesian Methods in NLP

Supervised MLE The data is labeled: We have a set D of D documents W 1 ... W d with N words Each document W i has N i words D + documents (subset D + ) have a positive label and N + words D − documents (subset D - ) have a negative label and N - words Each word w i appears N + (w i ) times in D +, N − (w i ) times in D - Each word w i appears N j (w i ) times in D j MLE: relative frequency estimation - Labels: L ∼ Bernoulli( π ) with π = D + / d - Words: W i |+ ∼ Multinomial( θ + ) with θ i+ = N + (w i ) / N + - Words: W i | − ∼ Multinomial( θ − ) with θ i − = N - (w i ) / N - 7 Bayesian Methods in NLP

Inference with MLE The inference task: Given a new document W i+1 , what is its label L i+1 ? Recall: the word w j occurs N i+1 (w j ) times in W i+1. P ( L = + | W i +1 ) P (+) P ( W i +1 | +) ∝ V θ N i +1 ( w j ) Y = π + j j =1 8 Bayesian Methods in NLP

Unsupervised MLE The data is unlabeled: We have a set D of D documents W 1 ... W d with N words Each document W i has N i words Each word w 1 ...w i ...w V appears N j ( w i ) times in W j EM algorithm: “expected relative frequency estimation” Initialization: pick initial π (0) , θ +(0) , θ − (0) Iterate: - Labels: L ∼ Bernoulli( π ) with π (t) = 〈 N + 〉 (t-1) / 〈 N 〉 (t-1) - Words: W i |+ ∼ Multinomial( θ + ) with θ i+ (t) = 〈 N + ( w i ) 〉 (t-1) / 〈 W + 〉 (t-1) - Words: W i | − ∼ Multinomial( θ − ) with θ i − (t) = 〈 N − ( w i ) 〉 (i-1) / 〈 W − 〉 (i-1) 9 Bayesian Methods in NLP

Maximum Likelihood estimation With complete (= labeled) data D = { 〈 X i , Z i 〉 }, maximize the complete likelihood p ( X, Z | θ ) : θ * = argmax θ ∏ i p ( X i , Z i | θ ) or θ * = argmax θ ∑ i ln( p ( X i , Z i | θ )) 10 Bayesian Methods in NLP

Maximum Likelihood estimation With incomplete (= unlabeled) data , D = { 〈 X i , ? 〉 } maximize the incomplete (marginal) likelihood p ( X | θ ): θ * = argmax θ ∑ i ln( p ( X i | θ )) = argmax θ ∑ i ln( ∑ Z p ( X i , Z | θ ) p ( Z | X i , θ ’) ) = argmax θ ∑ i ln( E Z|X ᵢ , θ ’ [ p ( X i , Z | θ ) ] ) p ( Z | X , θ ): the posterior probability of Z ( X = our data) E Z|X ᵢ , θ [ p ( X i , Z | θ ) ]: the expectation of p ( X , Z | θ ) wrt . p ( Z | X , θ ) Find parameters θ new that maximize the expected log- likelihood of the joint p ( Z , X | θ new ) under p ( Z | X , θ old ) This requires an iterative approach 11 Bayesian Methods in NLP

The EM algorithm 1. Initialization: Choose initial parameters θ old 2. Expectation step: Compute p ( Z | X , θ old ) (= posterior of the latent variables Z ) 3. Maximization step: Compute θ new θ new maximizes the expected log-likelihood of the joint p ( Z , X | θ new ) under p ( Z | X , θ old ) : X θ new p ( Z | X , θ old ) ln p ( X , Z | θ ) = arg max θ Z 4. Check for convergence. Stop, or set θ old := θ new and go to 2. 12 Bayesian Methods in NLP

The EM algorithm The classes we find may not correspond to the classes we would be interested in. Seed knowledge (e.g. a few positive and negative words) may help We are not guaranteed to find a global optimum, and may get stuck in a local optimum. Initialization matters 13 Bayesian Methods in NLP

In our example... Initialization: Pick (random) π A, π B = (1- π A ), θ A , θ B E-step: Set N A ,N B , N A (w 1 ),...,N A (w V ), N B (w 1 ), ... N B (w V ) := 0 For each document W i , Set L i = A with P( L i = A | W i , π A, π B , θ A , θ B ) ∝ π A ∏ j P(w ij | θ A ) Set L i = B with P( L i = B | W i , π A, π B , θ A , θ B ) ∝ π b ∏ j P(w ij | θ B ) Update N A += P( L i = A | W i , π A, π B , θ A , θ B ) N B += P( L i =B | W i , π A, π B , θ A , θ B ) For all words w ij in W i : N A (w ij ) += P( L i = A | W i , π A, π B , θ A , θ B ) N B (w ij ) += P( L i = B | W i , π A, π B , θ A , θ B ) M-step: π A := N A /(N A + N B ) π B := N B /(N A + N B ) θ A (w i ) := N A (w i ) / ∑ j (N A (w j )) θ B (w i ) := N B (w i ) / ∑ j (N B (w j )) 14 Bayesian Methods in NLP

The Bayesian model 15 Bayesian Methods in NLP

The Bayesian model The Bayesian model has priors Dir( γ ) and Beta( α , β ) with hyperparameters γ = ( γ 1 , ..., γ V ) and α , β It does not have specific θ L and π , but integrates them out: L i = argmax L ∫∫ P( W i | θ L )P( θ L ; γ L , D ) P( L | π )P( π ; α , β , D )d θ L d π = argmax L ∫ P( W i | θ L )P( θ L ; γ L , D )d θ L ∫ P( L | π )P( π ; α , β , D )d π = argmax L P( W i | γ L , D ) P( L | α , β , D ) P( W i | θ L ) is a multinomial with parameter θ L = ( θ 1 ,…., θ V ), P( θ L ; γ L ) is a Dirichlet with hyperparameter γ L = ( γ 1 ,…., γ V ) θ L ∼ Dirichlet( γ L ) W i ∼ Multinomial( θ L ) P( L | π ) is a Bernoulli with parameter π , drawn from a Beta prior π ∼ Beta( α , β ) L ∼ Bernoulli( π ) 16 Bayesian Methods in NLP

The Bayesian model α , β π L i w ij N i N γ θ L 2 17 Bayesian Methods in NLP

Bayesian: supervised The data is labeled: We have a set D of D documents W 1 ... W D with N words Each document W i has N i words D + documents (subset D + ) have a positive label and N + words D − documents (subset D - ) have a negative label and N - words Each word w i appears N + (w i ) times in D +, N − (w i ) times in D - Each word w j appears N i (w j ) times in W i Bayesian estimation P(L = + | D ) = (D + + α ) / (D + α + β ) P(w i |+, D ) = (N + (w i ) + γ i )/(N + (w i ) + γ 0 ) P( W i | +, D ) = ∏ j P(w j | +) Ni(wj) P(L i = + | W i , D ) = [(D + + α ) / (D + α + β )] ∏ j P(w j | +) Ni(wj) 18 Bayesian Methods in NLP

Approximating expectations Z 1 E [ f ( x )] = f ( x ) p ( x ) dx 0 N 1 X f ( x ( i ) ) = lim N N →∞ i =1 for x (1) ...x ( i ) ...x ( N ) drawn from p ( x ) T 1 X f ( x ( i ) ) ≈ T i =1 for x (1) ...x ( i ) ...x ( T ) drawn from p ( x ) We can approximate the expectation of f(x), 〈 f(x) 〉 = ∫ f(x) p (x)dx, by sampling a finite number of points x (1) , ..., x (T) according to p (x), evaluating f(x (i) ) for each of them, and computing the average. 20 Bayesian Methods in NLP

Markov Chain Monte Carlo A multivariate distribution p( x ) = p(x 1 ,…,x k ) with discrete x i has only a finite number of possible outcomes. Markov Chain Monte Carlo methods construct a Markov chain whose states are the outcomes of p( x ) . The probability of visiting state x j is p( x j ) We sample from p( x ) by visiting a sequence of states from this Markov chain. 21 Bayesian Methods in NLP

Gibbs sampling Our states: One label assignment L 1 ,…,L N to each of our N documents x = ( L 1 ,…,L N ) Our transitions: We go from one label assignment x = (+,+,-,+,-...+) to another y = (-,+,+,+,…,+) Our intermediate steps: We generate label Y i conditioned on Y 1 ...Y i-1 and X i+1 ...X N Call label assignment Y 1 ...Y i-1 , X i+1 ...X N L (-i) We need to compute P( Y i | D, L (-i), α , β , γ ) 22 Bayesian Methods in NLP

Lecture 3: Comparing frequentist and Bayesian estimation - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 3: Comparing frequentist and Bayesian estimation techniques Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

Bayesian Methods for Parameter Estimation Bayesian vs Frequentist Inference Frequentist Chris

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

Frequentist Properties of Bayesian Methods Applied Bayesian Statistics Dr. Earvin Balderama

Frequentist and Bayesian stochastic frontier models in Stata Federico Belotti Silvio Daidone

Workshop 7.2b: Introduction to Bayesian models Murray Logan 07 Feb 2017 Section 1 Frequentist

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Comparing Frequentist and Bayesian Fixed-Confidence Guarantees for Selection-of-the-Best Problems

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

HYPOTHESIS TESTING PART I RECAP & OUTLOOK BAYESIAN PARAMETER ESTIMATION FREQUENTIST

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Comparison of Bayesian and Frequentist Inference 18.05 Spring 2014 First discuss last class 19

Lecture 11: Group Assignment 1 Review, Procrustes Intro COMPSCI/MATH 290-04 Chris Tralie, Duke

3 Visualizing quantitative Information 1 Outline New ideas about good and bad graphs

3/1/2010 Acceleration Techniques V1.2 Anthony Steed Anthony Steed Based on slides from Celine

Rasterization (03) RNDr. Martin Madaras, PhD. martin.madaras@stuba.sk Last lessons summary 2

Frequentist and Bayesian statistics Claus Ekstrm E-mail: ekstrom@life.ku.dk Outline 1

Welcome ! BAYE SIAN R E G R E SSION MOD E L IN G W ITH R STAN AR M Jake Thompson Ps y

A Frequentist Semantics for a Generalized Jeffrey Conditionalization Dirk Draheim Tallinn

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

Lecture 3: Comparing frequentist and Bayesian estimation - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 3: Comparing frequentist and Bayesian estimation techniques Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

Bayesian Methods for Parameter Estimation Bayesian vs Frequentist Inference Frequentist Chris

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Basics of Bayesian Inference A frequentist thinks of unknown parameters as fixed Basics of

Workshop 7.2b: Introduction to Bayesian models Murray Logan February 7, 2017 Table of

Frequentist Properties of Bayesian Methods Applied Bayesian Statistics Dr. Earvin Balderama

Frequentist and Bayesian stochastic frontier models in Stata Federico Belotti Silvio Daidone

Workshop 7.2b: Introduction to Bayesian models Murray Logan 07 Feb 2017 Section 1 Frequentist

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Comparing Frequentist and Bayesian Fixed-Confidence Guarantees for Selection-of-the-Best Problems

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

Two Statistical Paradigms Bayesian versus Frequentist Steven Janke April 2012 (Bayesian

Statistics for Applications Chapter 8: Bayesian Statistics 1/17 The Bayesian approach (1)

HYPOTHESIS TESTING PART I RECAP &amp; OUTLOOK BAYESIAN PARAMETER ESTIMATION FREQUENTIST

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Comparison of Bayesian and Frequentist Inference 18.05 Spring 2014 First discuss last class 19

Lecture 11: Group Assignment 1 Review, Procrustes Intro COMPSCI/MATH 290-04 Chris Tralie, Duke

3 Visualizing quantitative Information 1 Outline New ideas about good and bad graphs

3/1/2010 Acceleration Techniques V1.2 Anthony Steed Anthony Steed Based on slides from Celine

Rasterization (03) RNDr. Martin Madaras, PhD. martin.madaras@stuba.sk Last lessons summary 2

Frequentist and Bayesian statistics Claus Ekstrm E-mail: ekstrom@life.ku.dk Outline 1

Welcome ! BAYE SIAN R E G R E SSION MOD E L IN G W ITH R STAN AR M Jake Thompson Ps y

A Frequentist Semantics for a Generalized Jeffrey Conditionalization Dirk Draheim Tallinn

Probability Basics Martin Emms October 1, 2020 Probability Basics Outline Probability

HYPOTHESIS TESTING PART I RECAP & OUTLOOK BAYESIAN PARAMETER ESTIMATION FREQUENTIST