SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last - PowerPoint PPT Presentation

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers

Last time • Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) ∏ ← ⎯⎯ = = new k i k y k i

Problems with Naive Bayes Y arg max P ( Y y ) P ( X | Y y ) ← ⎯⎯ = = k k y k • It assumes all n-grams are independent of each other. Wrong! • Example : Shakespeare has unique unigrams like: doth, till, morrow, oft, shall, methinks • Each unigram votes for Shakespeare, making the prediction over- confident. • Ask your 10 friends for an opinion, and they all vote the same way which seems confident — but their opinions already mutually informed each other from prior conversations.

Alternative to Naive Bayes? • We want a model that doesn’t assume independence between the inputs. • Ideally, give weight to an n-gram that helps improve accuracy, but give less to it if other n-grams overlap with that same correct prediction. • Solution : Logistic Regression Maximum Entropy (MaxEnt) • Multinomial logistic regression • Log-linear model • Neural network (single layer) •

Let’s talk about features • All inputs to Logistic Regression are features . • So far we’ve counted n-grams, so think of each n- gram as a feature . • Define a feature function over the text x: f i ( x ) • Each unique n-gram has a feature index i • The function’s value is the n-gram’s count.

Feature Example X1 = the lady doth protest too much methinks - Shakespeare X2 = it was the best of times it was the worst of times - Dickens f7 is unigram ‘the’ F238 is bigram ‘the best’ f 7 ( x 1) = 1 f 238 ( x 1) = 0 f 7 ( x 2) = 2 f 238 ( x 2) = 1

Weights • Once you have features, you just need weights • We want a score for each class label Shakespeare Dickens f 1 ( x ) = 1 1.31 -.23 f 2 ( x ) = 2 0.49 0.72 -0.82 0.1 f 3 ( x ) = 1 1.47 1.31 score ( x , c ) = ∑ w i , c f i ( x ) i

Weights Shakespeare Dickens score ( x , c ) = ∑ w i , c f i ( x ) 1.47 1.31 i But we want probabilities, right? ∑ i w i , c f i ( x ) Z = ∑ c ∑ w i , c f i ( x ) P ( c | x ) = Z i And for easier math later, nice [0,1] … exp(x) exp ( ∑ i w i , c f i ( x )) Z = ∑ c ∑ exp ( w i , c f i ( x )) P ( c | x ) = i Z

Logistic Regression • Logistic Regression is just a vector of weights multiplied by your n-gram vector of counts. P ( c | x ) = 1 Z exp ( ∑ w i , c f i ( x )) i • (and normalize to get probabilities)

Logistic Regression “it was the best of times it was the worst of times” -Dickens it was the best of he she times pizza ok worst f(x) 2 1 2 1 2 0 0 2 0 0 1 Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 0.03 0.21 -0.03 -0.32 0.01 0.23 0.41 -0.2 -2.1 0 0.18 Shakespeare w Where do these weights come from?

Learning in Logistic Regression • We need to learn the weights • Goal : choose weights that give the “best results” or the weights that give the “least error” • Loss function : measures how wrong our predictions are • K ∑ Loss ( y ) = − 1{ y = k } log p ( y = k | x ) k =1 Example! Loss ( dickens ) = − log p ( dickens | x ) 0.0 when p(y|x)=1.0

Learning in Logistic Regression • Goal : choose weights that give the “least error” K ∑ Loss ( y ) = − 1{ y = k } log p ( y = k | x ) k =1 • Choose weights that give probabilities close to 1.0 to each of the correct labels. But how???

Learning in Logistic Regression • Gradient descent : how to update the weights Find the slope of each wi 1. • Take its partial derivative, of course! Move in the direction of the slope. 2. Update all weights. 3. Recalculate the loss function. 4. Repeat 5.

Learning in Logistic Regression • Gradient descent : how to update the weights Another description with lots of hand waving: 1. Initialize the weights randomly 2. Compute probabilities for all data 3. Jiggle the weights up and down based on mistakes 4. Repeat

̂ Learning in Logistic Regression • Weight updates The feature value! 1 or 0 ∂ L = ( p ( y = k | x ) − 1{ y = k }) x k ∂ w k Logistic regression w k = w k − α ∂ L ∂ w k • It’s easier than it looks. Compare your probability to the correct answer. Update the weight based on how far off your probability was.

Summary: Logistic Regression • Optimizes P( Y | X ) directly • You define the features (usually n-gram counts) • It learns a vector of weights for each Y value Gradient descent, update weights based on error • • Multiply the feature vector by the weight vector • Output is P(Y=y | X) after normalizing • Choose the most probable Y

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last - PowerPoint PPT Presentation

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last time Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) = = new k i k y k i Problems with

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Set 14 Neural NLP Fall 2020 : Chambers Why are these so different? Last time :

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Set 11 Distributional Similarity some slides adapted from Dan Jurafsky and Bill

SI425 : NLP Set 10 Lexical Relations slides adapted from Dan Jurafsky and Bill MacCartney Three

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

SI425 : NLP Set 4 Smoothing Language Models Fall 2017 : Chambers Review: evaluating n-gram

SI425 : NLP Set 3 Language Models Fall 2017 : Chambers Language Modeling Which sentence is

SI425 : NLP Set 5 Nave Bayes Classification Fall 2020 : Chambers Motivation We want to

SI425 Natural Language Processing Set 1 Intro to NLP Fall 2020: Chambers Assumptions about

SI425 : NLP Set 8 Words as Vectors (distributional similarity) Fall 2020 : Chambers some

SI425 : NLP Set 4 Smoothing Language Models Fall 2020 : Chambers Review: evaluating n-gram

SI425 : NLP Set 10 Syntax and Parsing Fall 2020 : Chambers Syntax Grammar, or syntax:

SI425 : NLP Set 9 Word2Vec - Neural Words Fall 2020 : Chambers Why are these so different? Last

SI425 : NLP Set 2 Probability Review Fall 2020 : Chambers help me make a new rumor

SI425 : NLP Set 7 Syntax and Parsing Syntax Grammar, or syntax: The kind of implicit

Communicating with Errors Someone sends you a message: As mmbrof teGreek commniand art of n

Multicast Security (MSEC) WG I ETF-55, At lant a, GA Tue, Nov 19, 2002 9:00 11:30 1

A Double Regularization Approach for Inverse Problems with Noisy Data and Inexact Operator Ismael

Nuclear Physics: Conventions v1.0, Jan 2016 The Natural System of Units is particularly popular in

Trekking map Kailash/Gurla Mandhata 1:100,000 Jacob Trltzsch, Jan Kropek, Manfred

Personal data and Consumer Protection A CMA Perspective A CMA Perspective Jason Freeman

L1 b^?[ f]5 i/sor P/.y; 4 or, Pa ODF /r'', arr - Mb ta\e..r Revr'!u/ J /r'r l{r ObE ObF

(Log) Barrier methods November 9, 2018 339 / 429 Barrier Methods for Constrained Optimization