Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text - - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text classification Emma Strubell Lets try this again Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2 Outline Basic


slide-1
SLIDE 1

Emma Strubell

Algorithms for NLP

CS 11-711 · Fall 2020

Lecture 2: Linear text classification

slide-2
SLIDE 2

Let’s try this again…

Emma Yulia Bob Sanket Han Jiateng

2

she/her she/her he/him he/him he/him he/him

slide-3
SLIDE 3

Outline

■ Basic representations of text data for classification ■ Four linear classifiers ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine; SVM) ■ Logistic regression

3

slide-4
SLIDE 4

Text classification

Problem definition

4

■ Given a text ■ Choose a label ■ For example: ■ Sentiment analysis ■ Toxic comment classification ■ Language identification

xt w = (w1, w2, . . . , wT) ∈ V∗

el y ∈ Y.

Y = Y =

{ positive, negative, neutral }

Y = Y =

{ toxic, non-toxic }

Y = Y =

{ Mandarin, English, Spanish, … } The drinks were strong but the fish tacos were bland

w1 w2 w3 w4 w5 w6 w7 w8 w9 w = w10 y = negative

slide-5
SLIDE 5

■ Sequence length T can be different for every sentence/document ■ The bag-of-words is a fixed-length vector of word counts: ■ Length of x is equal to the size of the vocabulary, V ■ For each x there may be many possible w (representation ignores word order)

How to represent text for classification?

One choice of R: bag-of-words

5

The drinks were strong but the fish tacos were bland

w = x =

a a r d v a r k

… 1 1 … 1 … 1 2 1 … 2 …

b u t t h e w e r e b l a n d z y t h e r … t a c

  • s

s t r

  • n

g t a c

  • fi

s h … … … …

slide-6
SLIDE 6

Linear classification on bag-of-words

6

■ Let score the compatibility of bag-of-words x and label y.

Then:

■ In a linear classifier this scoring function has the simple form:

where θ is a vector of weights, and f is a feature function

ˆ y = argmax

y

ψ(x, y).

x ψ(x, y).

ψ(x, y) = θ · f (x, y) = X

j=1

θj × fj(x, y),

slide-7
SLIDE 7

Feature functions

7

■ In classification, the feature function is usually a simple combination of x and y,

such as:

■ If we have K labels, this corresponds to column vectors that look like:

fj(x, y) = ( x

xfantastic if y = positive

  • therwise

f (x, y = 1) =

x0 x1 … x|V| …

| {z }

(K−1)×V

T

slide-8
SLIDE 8

Feature functions

8

■ In classification, the feature function is usually a simple combination of x and y,

such as:

■ If we have K labels, this corresponds to column vectors that look like:

fj(x, y) = ( x

xfantastic if y = positive

  • therwise

f (x, y = 1) = f (x, y = 2) = [

x0 x1 … x|V| … … x0 x1 … x|V| …

| {z }

(K−2)×V

{z

V

T T

slide-9
SLIDE 9

Feature functions

9

■ In classification, the feature function is usually a simple combination of x and y,

such as:

■ If we have K labels, this corresponds to column vectors that look like:

fj(x, y) = ( x

xbland if y = negative

  • therwise

f (x, y = 1) = f (x, y = 2) = [

x0 x1 … x|V| … … x0 x1 … x|V| …

f (x, y = K) =

… x0 x1 … x|V|

| {z }

(K−1)×V

T T T

slide-10
SLIDE 10

Linear classification in Python

10

def compute_score(x, y, weights): total = 0 for feature, count in feature_function(x, y).items(): total += weights[feature] * count return total

x =

a a r d v a r k

… 1 1 … 1 … 1 2 1 … 2 …

b u t t h e w e r e b l a n d z y t h e r … t a c

  • s

s t r

  • n

g t a c

  • fi

s h … … … … θ =

  • 0.16 -1.66 -1.55 0.23 0.17 -3.43 0.18 -2.08 -1.46 0.13 1.47 -0.06 1.84

… 0.36

K × V

<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit>
slide-11
SLIDE 11

Linear classification in Python

11

import numpy as np def compute_score(x, y, weights): return np.dot(weights, feature_function(x, y))

x =

a a r d v a r k

… 1 1 … 1 … 1 2 1 … 2 …

b u t t h e w e r e b l a n d z y t h e r … t a c

  • s

s t r

  • n

g t a c

  • fi

s h … … … … θ =

  • 1.13 -0.37 0.97 0.58 -1.46
  • 1
  • 0.49 2.35 0.49 -0.34 0.69 0.87 0.36

  • 0.26

K × V

<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit>
slide-12
SLIDE 12

Ok, but how to obtain θ ?

12

■ The learning problem is to find the right weights θ. ■ The rest of this lecture will cover four supervised learning algorithms: ■ Naïve Bayes ■ Perceptron ■ Large-margin (support vector machine) ■ Logistic regression ■ All these methods assume a labeled dataset of N examples:

aset {(x(i), y(i))}N

i=1.

slide-13
SLIDE 13

Probabilistic classification

13

■ Naïve Bayes is a probabilistic classifier. It takes the following strategy: ■ Define a probability model p(x, y) ■ Estimate the parameters of the probability model by maximum likelihood,

i.e. by maximizing the likelihood of the dataset

■ Set the scoring function equal to the log-probability:

where C is constant in y. This ensures that:

ψ(x, y) = log p(x, y) = log p(y | x) + C,

ˆ y = argmax

y

p(y | x).

slide-14
SLIDE 14

■ First, assume each instance ((x, y) pair) is independent of the others: ■ Apply the chain rule of probability: ■ Define the parametric form of each probability: ■ The multinomial is a distribution over vectors of counts ■ The parameters μ and φ are vectors of probabilities

p(x(1:N), y (1:N)) =

N

Y

i=1

p(x(i), y(i)).

A probability model for text classification

14

p(x, y) = p(x | y) × p(y)

p(y) = Categorical(µ) p(x | y) = Multinomial(φ, T).

slide-15
SLIDE 15

■ Suppose the word bland has probability φj .

What is the probability that this word appears 3 times?

■ Each word’s probability is exponentiated by its count: ■ The coefficient is the count of the number of possible orderings of x. Crucially, it

does not depend on the frequency parameter φ.

The multinomial distribution

15

T) = ⇣PV

j=1 xj

⌘ ! QV

j=1(xj!) V

Y

j=1

φxj

j .

Multinomial(x; φ,

slide-16
SLIDE 16

■ Naïve Bayes can be formulated in our linear classification framework by setting θ

equal to the log parameters: where f(x, y) is extended to include an “offset” 1 for each possible label after the word counts.

Naïve Bayes text classification

16

ψ(x, y) = θ · f (x, y) = log p(x | y) + log p(y),

θ =

log φy1,w1 log φy1,w2 … log φy1,wv log μy1 log φy2,w1 … log φy2,wv log μy2 … log φyk,wv log μyk

K × (V + 1)

<latexit sha1_base64="4F1q9zeaHiIDviznv+xz7u4uTGE=">ADHXicbVJLaxsxEJa3r9R9xGmPvYiaQPrA7KaF9BjaS6GXFGonYBmj1Y5sEa20SLNuzOJ/Uuip/Se9lV5Lf0jvldbyDoZkPQx81DH5MWnmM4z+d6MbNW7fv7Nzt3rv/4OFub+/RyNvSCRgKq607S7kHrQwMUaGs8IBz1MNp+n5u3X8dAHOK2s+4bKASc5nRklOAbXtNf7QBmqHDw9GNEXNHk27fXjQVwbvQqSBvRJYyfTvc5flR5mBQaO79OIkLnFTcoRIaVl1Wei4OczGAdoeOg2qerRV3Q/eDIqrQvHIK29lzMqnu/zNPAzDnO/XZs7bwuNi5RvplUyhQlghGbRrLUFC1d60Az5UCgXgbAhVNhVirm3HGBQa3u/uU2c9ALwNZHKi/rzl3mwMBnYfOcm+x5xSTPlV5mIHmpcVUxL/j62R4mS1U4RtFLjaSdJkGZNapmTJca5DI1lfbHZ45svpuj1DV5NB4PZ4twFSrGgptPbB05mxZtIqvtvProqEAl0GJDR/aRtGWJRkey2ugtHhIHk1OPz4un/8tlmZHfKEPCUHJCFH5Ji8JydkSARZkC/kG/kefY1+RD+jXxtq1GlyHpOWRb/ARK7CRY=</latexit>
slide-17
SLIDE 17

■ In relative frequency estimation, the parameters are set to empirical frequencies: ■ This turns out to be identical to the maximum likelihood estimate (yay):

Estimating Naïve Bayes

17

ˆ φy,j = count(y, j) PV

j0=1 count(y, j0)

= P

i:y(i)=y x(i) j

PV

j0=1

P

i:y(i)=y x(i) j0

ˆ µy = count(y) P

y0 count(y0).

ˆ φ, ˆ µ = argmax

φ,µ N

Y

i=1

p(x(i), y(i)) = argmax

φ,µ N

X

i=1

log p(x(i), y(i)).

slide-18
SLIDE 18

■ To deal with low counts, it can be helpful to smooth probabilities: ■ Smoothing introduces bias, moving the parameters away from their maximum-

likelihood estimates.

■ But, it corrects variance, the extent to which the parameters depend on the

idiosyncrasies of a finite dataset.

■ The smoothing term α is a hyperparameter that must be tuned on a

development set.

Smoothing, bias, variance

18

ˆ φy,j = α + count(y, j) V α + PV

j0=1 count(y, j0)

.

slide-19
SLIDE 19

■ Naïve Bayes is so called because: ■ Bayes rule is used to convert the observation probability p(x | y) into the label

probability p(y | x).

■ The multinomial distribution naïvely ignores dependencies between words, and

treats each word as equally informative. Suppose naïve and Bayes always occur together. Should we really count them both independently for classification?

■ Discriminative classifiers avoid this problem by not attempting to model the

generative probability p(x).

Too naïve?

19

slide-20
SLIDE 20

■ A simple learning rule: ■ Run the current classifier on an instance in the training data, obtaining ■ If the prediction is incorrect:

  • 1. Increase the weights for the features of the true label y(i)
  • 2. Decrease the weights for the features of the predicted label ŷ

■ Repeat until all training instances are correctly classified (or you run out of time) ■ If the dataset is linearly separable — if there is some θ that correctly labels all the

training instances — then this method is guaranteed to find it.

g ˆ y = argmaxy ψ(x(i), y).

The perceptron classifier

20

e the weights for the features of the pre

θ ← θ + f (x(i), y(i)) − f (x(i), ˆ y).

slide-21
SLIDE 21

Loss functions

21

■ Many classifiers can be viewed as minimizing a loss function on the weights.

Such a function should have two properties:

■ It should be a good proxy for the accuracy of the classifier. ■ It should be easy to optimize. ■ Do you see why 1 — accuracy is not a good loss function?

`0-1(θ; x(i), y(i)) = ( 0, y(i) = argmaxy θ · f(x(i), y) 1,

  • therwise
slide-22
SLIDE 22

Perceptron as gradient descent

22

■ The perceptron can be viewed as optimizing the loss function: ■ The gradient of the perceptron loss is part of the perceptron update:

`perceptron(θ; x(i), y(i)) = − θ · f (x(i), y(i)) + max

y06=y(i) θ · f (x(i), y0)

@ @θ`perceptron = − f (x(i), y(i)) + f (x(i), ˆ y) θ(t+1) ←θ(t) − @ @θ`perceptron =θ(t) + f (x(i), y(i)) − f (x(i), ˆ y).

Gradient descent!

slide-23
SLIDE 23

Perceptron vs. Naïve Bayes

23

■ Both Naïve Bayes and perceptron loss functions are convex, making them relatively

easy to optimize. However, NB can be optimized in closed form while perceptron requires iterating over the dataset multiple times.

■ NB can suffer infinite loss on a single example, since the logarithm of 0 probability

is -inf; some examples will be over-emphasized, others will be under-emphasized

■ NB assumes the observed features are conditionally independent given the label;

performance depends on the extent to which this holds.

■ Perceptron treats all correct answers equally, even if θ only gives the correct

answer by a very small margin, the loss is still 0.

slide-24
SLIDE 24

Large margin learning

24

■ For better generalization, the correct label should outscore all other labels by a

large margin:

■ The margin can be incorporated into a margin loss:

(θ; x(i), y(i)) = θ · f (x(i), y(i)) − max

y06=y(i) θ · f (x(i), y0)

`MARGIN(θ; x(i), y(i)) = ( 0, (θ; x(i), y(i)) 1, 1 (θ; x(i), y(i)),

  • therwise

⇣ ⌘

in = max

⇣ 0, 1 − (θ; x(i), y(i)) ⌘

slide-25
SLIDE 25

Large margin learning

25

■ Margin loss can be minimized using a learning rule similar to perceptron. First, let’s

generalize the notion of classification error using a cost function:

■ Using the cost function, we define the online support vector machine (SVM)

classification rule:

c(y(i), y) = ( 1, y(i) 6= ˆ y 0,

  • therwise,

ˆ y = argmax

y∈Y

θ · f(x(i), y) + c(y(i), y) θ(t) ←(1 − λ)θ(t−1) + f(x(i), y(i)) − f(x(i), ˆ y)

???

cost-augmented decoding regularization

slide-26
SLIDE 26

Large margin learning

26

slide-27
SLIDE 27

Large margin learning

27

slide-28
SLIDE 28

■ Perceptron and large margin classification are discriminative: they learn to

discriminate correct and incorrect labels.

■ Naïve Bayes is probabilistic: it assigns calibrated confidence scores to its

predictions.

■ Logistic regression is both discriminative and probabilistic. It directly computes the

conditional probability of the label:

■ Exponentiation ensures that the probabilities are non-negative. ■ Normalization ensures that the probabilities sum to one.

p(y | x; θ) = exp(θ · f (x, y)) P

y02Y exp(θ · f (x, y0))

Logistic regression

28

slide-29
SLIDE 29

■ Two equivalent views of logistic regression learning: ■ Maximization of the conditional log-likelihood: ■ Minimization of the logistic loss:

Learning logistic regression

29

log p(y (1:N) | x(1:N); θ) =

N

X

i=1

log p(y(i) | x(i); θ)

`LogReg(θ; x(i), y(i)) = −θ·f (x(i), y(i))+log X

y02Y

exp(θ·f (x(i), y0)).

(Compare to perceptron loss!)

X =

N

X

i=1

θ · f(x(i), y(i)) − log X

y02Y

exp ⇣ θ · f(x(i), y0) ⌘

p(y | x; θ) = exp(θ · f (x, y)) P

y02Y exp(θ · f (x, y0))

slide-30
SLIDE 30

Loss functions

30

−2 −1 1 2 θ · f(x(i), y(i)) − θ · f(x(i), ˆ y) 1 2 3 loss 0/1 loss margin loss logistic loss

slide-31
SLIDE 31

High-dimensional classification

31

■ Possible problems: ■ What if the number of features in f(x, y) is larger than the number of training

instances?

■ What happens in logistic regression if a feature appears only with one label?

What will its weight be?

■ These problems relate to the variance of the classifier — its sensitivity to

idiosyncratic features of the training data.

slide-32
SLIDE 32

Regularization

32

■ Learning can often be made more robust by regularization: penalizing large

  • weights. E.g.:

■ ■ The scalar λ controls the strength of regularization (a hyperparameter). ■ The support vector machine classifier combines regularization with the

large margin loss.

min

θ N

X

i=1

`LogReg(θ; x(i), y(i)) + ||θ||2

2,

I ||θ||2

2 = P j ✓2 j .

I The scalar c

slide-33
SLIDE 33

Regularization

33

slide-34
SLIDE 34

Gradient descent

34

■ Logistic regression, perceptron and large margin classification all learn by

minimizing a loss function. A general strategy for minimization is gradient descent:

■ Where η ∈ ℝ+ is the learning rate.

θ(t+1) ← θ(t) − ⌘ @ @θ

N

X

i=1

`(θ(t); x(i), y(i)),

slide-35
SLIDE 35

Stochastic gradient descent

35

■ Computing the gradient over all instances (batched) is expensive. ■ Stochastic gradient descent (SGD) approximates the gradient by its value on a

single instance: where (x(i), y(i)) is sampled at random from the training set.

■ Minibatch gradient descent approximates the gradient by its value on a small

number of instances. This is well suited to modern high-throughput hardware (e.g. GPUs and TPUs) and is commonly used in deep learning.

@ @θ

N

X

i=1

`(θ(t); x(i), y(i)) ≈ C × @ @θ`(θ(t); x(i), y(i)),

slide-36
SLIDE 36

Linear classification: summary

36

Pros Cons Naïve Bayes simple, probabilistic, fast not very accurate Perceptron simple, accurate not probabilistic, may overfit Large margin error-driven learning, can be regularized not probabilistic Logistic regression error-driven learning, regularized more difficult to implement

slide-37
SLIDE 37

Announcements

37

■ Please fill out the survey posted to Piazza so that we can get to know you better!

Due next Friday (9/11).

■ No recitation this Friday 9/4. Please use office hours or post to Piazza if you have

questions!

■ Working on pushing the recitation back to start at 3pm to avoid current conflict

w/ Colloquium.