si425 nlp
play

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last - PowerPoint PPT Presentation

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last time Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) = = new k i k y k i Problems with


  1. SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers

  2. Last time • Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) ∏ ← ⎯⎯ = = new k i k y k i

  3. Problems with Naive Bayes Y arg max P ( Y y ) P ( X | Y y ) ← ⎯⎯ = = k k y k • It assumes all n-grams are independent of each other. Wrong! • Example : Shakespeare has unique unigrams like: doth, till, morrow, oft, shall, methinks • Each unigram votes for Shakespeare, making the prediction over- confident. • Ask your 10 friends for an opinion, and they all vote the same way which seems confident — but their opinions already mutually informed each other from prior conversations.

  4. Alternative to Naive Bayes? • We want a model that doesn’t assume independence between the inputs. • Ideally, give weight to an n-gram that helps improve accuracy, but give less to it if other n-grams overlap with that same correct prediction. • Solution : Logistic Regression Maximum Entropy (MaxEnt) • Multinomial logistic regression • Log-linear model • Neural network (single layer) •

  5. Let’s talk about features • All inputs to Logistic Regression are features . • So far we’ve counted n-grams, so think of each n- gram as a feature . • Define a feature function over the text x: f i ( x ) • Each unique n-gram has a feature index i • The function’s value is the n-gram’s count.

  6. Feature Example X1 = the lady doth protest too much methinks - Shakespeare X2 = it was the best of times it was the worst of times - Dickens f7 is unigram ‘the’ F238 is bigram ‘the best’ f 7 ( x 1) = 1 f 238 ( x 1) = 0 f 7 ( x 2) = 2 f 238 ( x 2) = 1

  7. Weights • Once you have features, you just need weights • We want a score for each class label Shakespeare Dickens f 1 ( x ) = 1 1.31 -.23 f 2 ( x ) = 2 0.49 0.72 -0.82 0.1 f 3 ( x ) = 1 1.47 1.31 score ( x , c ) = ∑ w i , c f i ( x ) i

  8. Weights Shakespeare Dickens score ( x , c ) = ∑ w i , c f i ( x ) 1.47 1.31 i But we want probabilities, right? ∑ i w i , c f i ( x ) Z = ∑ c ∑ w i , c f i ( x ) P ( c | x ) = Z i And for easier math later, nice [0,1] … exp(x) exp ( ∑ i w i , c f i ( x )) Z = ∑ c ∑ exp ( w i , c f i ( x )) P ( c | x ) = i Z

  9. Logistic Regression • Logistic Regression is just a vector of weights multiplied by your n-gram vector of counts. P ( c | x ) = 1 Z exp ( ∑ w i , c f i ( x )) i • (and normalize to get probabilities)

  10. Logistic Regression “it was the best of times it was the worst of times” -Dickens it was the best of he she times pizza ok worst f(x) 2 1 2 1 2 0 0 2 0 0 1 Dickens w -0.1 0.05 0.0 0.42 0.12 0.3 0.2 1.1 -1.5 -0.2 0.3 0.03 0.21 -0.03 -0.32 0.01 0.23 0.41 -0.2 -2.1 0 0.18 Shakespeare w Where do these weights come from?

  11. Learning in Logistic Regression • We need to learn the weights • Goal : choose weights that give the “best results” or the weights that give the “least error” • Loss function : measures how wrong our predictions are • K ∑ Loss ( y ) = − 1{ y = k } log p ( y = k | x ) k =1 Example! Loss ( dickens ) = − log p ( dickens | x ) 0.0 when p(y|x)=1.0

  12. Learning in Logistic Regression • Goal : choose weights that give the “least error” K ∑ Loss ( y ) = − 1{ y = k } log p ( y = k | x ) k =1 • Choose weights that give probabilities close to 1.0 to each of the correct labels. But how???

  13. Learning in Logistic Regression • Gradient descent : how to update the weights Find the slope of each wi 1. • Take its partial derivative, of course! Move in the direction of the slope. 2. Update all weights. 3. Recalculate the loss function. 4. Repeat 5.

  14. Learning in Logistic Regression • Gradient descent : how to update the weights Another description with lots of hand waving: 1. Initialize the weights randomly 2. Compute probabilities for all data 3. Jiggle the weights up and down based on mistakes 4. Repeat

  15. ̂ Learning in Logistic Regression • Weight updates The feature value! 1 or 0 ∂ L = ( p ( y = k | x ) − 1{ y = k }) x k ∂ w k Logistic regression w k = w k − α ∂ L ∂ w k • It’s easier than it looks. Compare your probability to the correct answer. Update the weight based on how far off your probability was.

  16. Summary: Logistic Regression • Optimizes P( Y | X ) directly • You define the features (usually n-gram counts) • It learns a vector of weights for each Y value Gradient descent, update weights based on error • • Multiply the feature vector by the weight vector • Output is P(Y=y | X) after normalizing • Choose the most probable Y

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend