SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last - - PowerPoint PPT Presentation

si425 nlp
SMART_READER_LITE
LIVE PREVIEW

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last - - PowerPoint PPT Presentation

SI425 : NLP Set 6 Logistic Regression Fall 2020 : Chambers Last time Naive Bayes Classifier Given X, what is the most probable Y? Y arg max P ( Y y ) P ( X | Y y ) = = new k i k y k i Problems with


slide-1
SLIDE 1

SI425 : NLP

Set 6 Logistic Regression

Fall 2020 : Chambers

slide-2
SLIDE 2

Last time

  • Naive Bayes Classifier

Given X, what is the most probable Y?

= = ⎯⎯ ←

i k i k y new

y Y X P y Y P Y

k

) | ( ) ( max arg

slide-3
SLIDE 3

Problems with Naive Bayes

) | ( ) ( max arg

k k y

y Y X P y Y P Y

k

= = ⎯⎯ ←

  • It assumes all n-grams are independent of each other. Wrong!
  • Example: Shakespeare has unique unigrams like: doth, till, morrow, oft,

shall, methinks

  • Each unigram votes for Shakespeare, making the prediction over-

confident.

  • Ask your 10 friends for an opinion, and they all vote the same way which

seems confident — but their opinions already mutually informed each

  • ther from prior conversations.
slide-4
SLIDE 4

Alternative to Naive Bayes?

  • We want a model that doesn’t assume independence

between the inputs.

  • Ideally, give weight to an n-gram that helps improve

accuracy, but give less to it if other n-grams overlap with that same correct prediction.

  • Solution: Logistic Regression
  • Maximum Entropy (MaxEnt)
  • Multinomial logistic regression
  • Log-linear model
  • Neural network (single layer)
slide-5
SLIDE 5

Let’s talk about features

  • All inputs to Logistic Regression are features.
  • So far we’ve counted n-grams, so think of each n-

gram as a feature.

  • Define a feature function over the text x:
  • Each unique n-gram has a feature index i
  • The function’s value is the n-gram’s count.

fi(x)

slide-6
SLIDE 6

Feature Example

X1 = the lady doth protest too much methinks - Shakespeare X2 = it was the best of times it was the worst of times - Dickens

f7(x1) = 1

f7(x2) = 2

f7 is unigram ‘the’ F238 is bigram ‘the best’

f238(x1) = 0

f238(x2) = 1

slide-7
SLIDE 7

Weights

  • Once you have features, you just need weights
  • We want a score for each class label

score(x, c) = ∑

i

wi,c fi(x) f1(x) = 1 f2(x) = 2 f3(x) = 1

Shakespeare 1.31 0.49

  • 0.82

1.47

Dickens

  • .23

0.72 0.1

1.31

slide-8
SLIDE 8

Weights

But we want probabilities, right?

score(x, c) = ∑

i

wi,c fi(x)

Shakespeare

1.47

Dickens

1.31

P(c|x) = ∑i wi,c fi(x) Z

Z = ∑

c ∑ i

wi,c fi(x) And for easier math later, nice [0,1] … exp(x)

P(c|x) = exp(∑i wi,c fi(x)) Z

Z = ∑

c ∑ i

exp(wi,c fi(x))

slide-9
SLIDE 9

Logistic Regression

  • Logistic Regression is just a vector of weights

multiplied by your n-gram vector of counts.

  • (and normalize to get probabilities)

P(c|x) = 1 Z exp(∑

i

wi,c fi(x))

slide-10
SLIDE 10

Logistic Regression

“it was the best of times it was the worst of times” -Dickens

it was the best

  • f

he she times pizza

  • k

worst 2 1 2 1 2 2 1

  • 0.1

0.05 0.0 0.42 0.12 0.3 0.2 1.1

  • 1.5
  • 0.2

0.3 0.03 0.21 -0.03 -0.32 0.01 0.23 0.41

  • 0.2
  • 2.1

0.18

Dickens w

Shakespeare w

f(x)

Where do these weights come from?

slide-11
SLIDE 11

Learning in Logistic Regression

  • We need to learn the weights
  • Goal: choose weights that give the “best results”
  • r the weights that give the “least error”
  • Loss function: measures how wrong our predictions are

Loss(y) = −

K

k=1

1{y = k} log p(y = k|x) Loss(dickens) = − log p(dickens|x)

Example! 0.0 when p(y|x)=1.0

slide-12
SLIDE 12

Learning in Logistic Regression

  • Goal: choose weights that give the “least error”
  • Choose weights that give probabilities close to 1.0 to

each of the correct labels.

Loss(y) = −

K

k=1

1{y = k} log p(y = k|x)

But how???

slide-13
SLIDE 13

Learning in Logistic Regression

  • Gradient descent: how to update the weights

1. Find the slope of each wi

  • Take its partial derivative, of course!

2. Move in the direction of the slope. 3. Update all weights. 4. Recalculate the loss function. 5. Repeat

slide-14
SLIDE 14

Learning in Logistic Regression

  • Gradient descent: how to update the weights

Another description with lots of hand waving:

  • 1. Initialize the weights randomly
  • 2. Compute probabilities for all data
  • 3. Jiggle the weights up and down based on mistakes
  • 4. Repeat
slide-15
SLIDE 15

Learning in Logistic Regression

  • Weight updates

∂L ∂wk = (p(y = k|x) − 1{y = k})xk

  • It’s easier than it looks. Compare your probability to the correct
  • answer. Update the weight based on how far off your probability was.

The feature value!

̂ wk = wk − α ∂L ∂wk

Logistic regression 1 or 0

slide-16
SLIDE 16

Summary: Logistic Regression

  • Optimizes P( Y | X ) directly
  • You define the features (usually n-gram counts)
  • It learns a vector of weights for each Y value
  • Gradient descent, update weights based on error
  • Multiply the feature vector by the weight vector
  • Output is P(Y=y | X) after normalizing
  • Choose the most probable Y
slide-17
SLIDE 17