Text Classification Diyi Yang Some slides borrowed from Jacob - - PowerPoint PPT Presentation

text classification
SMART_READER_LITE
LIVE PREVIEW

Text Classification Diyi Yang Some slides borrowed from Jacob - - PowerPoint PPT Presentation

CS 4650/7650: Natural Language Processing Text Classification Diyi Yang Some slides borrowed from Jacob Eisenstein (was at GT) and Dan Jurafsky at Stanford 1 TA Office Hours Ian Stewart: Tuesdays, 2-4pm, Coda C1106 Jiaao Chen: Thursdays,


slide-1
SLIDE 1

CS 4650/7650: Natural Language Processing

Text Classification

Diyi Yang

1

Some slides borrowed from Jacob Eisenstein (was at GT) and Dan Jurafsky at Stanford

slide-2
SLIDE 2

TA Office Hours

¡ Ian Stewart: Tuesdays, 2-4pm, Coda C1106 ¡ Jiaao Chen: Thursdays, 2-4pm, Coda C1008 ¡ Nihal Singh: Fridays, 9-11am, Coda C1008 ¡ Jingfeng Yang: Mondays, 10am-12pm, Coda 14th common area

2

slide-3
SLIDE 3

Sign Up for Piazza

https://piazza.com/gatech/spring2020/cs7650cs4650/home

3

slide-4
SLIDE 4

Staff Mailing List

cs4650-7650-s20-staff@googlegroups.com

4

slide-5
SLIDE 5

Waiting List

5

slide-6
SLIDE 6

Your Homework 1

¡ Due date: Jan 15th, 3:00pm, EST

6

slide-7
SLIDE 7

¡ Other Questions?

7

slide-8
SLIDE 8

Very Quick Review on Probabilities

¡ Event space (e.g., !, #) – in this class, usually discrete ¡ Random variables (e.g., %, &) ¡ Random variable % takes value ', ' ∈ ! with probability ) % = ' or ) '

8

slide-9
SLIDE 9

Very Quick Review on Probabilities

¡ Joint probability ! " = $, & = ' ¡ Conditional probability ! " = $ & = ') = )(+,-,.,/)

)(.,/)

9

slide-10
SLIDE 10

Very Quick Review on Probabilities

¡ Always true:

¡ ! " = $, & = ' = ! " = $ & = ' ⋅ ! & = ' = ) & = ' " = $ ⋅ )(" = $)

¡ Sometimes true: ¡ ! " = $, & = ' = !(" = $) ⋅ ! & = '

10

slide-11
SLIDE 11

Very Quick Review on Probabilities

! " = !! !! !%" !

¡ The number of ways to select k words out of n given words (“unordered samples without

replacement”)

& &', &), … , &" = &! &'! &)! ⋯ &"!

¡ Here, &, &', &) … , &" are all non-negative integers, and &' + &) + &- + ⋯ &" = & ¡ The number of ways to split n distinct words into k distinct groups of sizes n1, . . . , nk, respectively

11

slide-12
SLIDE 12

Classification

¡ A mapping ℎ from input data x (drawn from instance space #) to a label y from

some enumerable output space %

¡ # = set of all documents

¡ % = {English, Mandarin, Greek, …}

¡ x = a single document ¡ y = ancient Greek

12

slide-13
SLIDE 13

Movie Ratings

13

slide-14
SLIDE 14

Customer Review

14

slide-15
SLIDE 15

Political Opinion Mining

15

slide-16
SLIDE 16

Female or Male Author?

16

slide-17
SLIDE 17

Is This Spam?

17

slide-18
SLIDE 18

What Is the Subject of This Article?

18

slide-19
SLIDE 19

This Class

¡ Basic representations of text data for classification ¡ Three linear classifiers ¡ Naïve Bayes ¡ Perception ¡ Logistic regression

19

slide-20
SLIDE 20

The Text Classification Problem

¡ Given a text ! = #$, #&, … , #( ∈ *∗, predict a label , ∈ -

20

slide-21
SLIDE 21

Some Direct Text Classification Applications

Task ! " Language identification text {English, Mandarin, Greek, …} Spam classification email {spam, not spam} Authorship attribution text {jk rowling, james joyce, …} Genre classification novel {detective, romance, gothic, …} Sentiment classification text {positive, negative, neutral, mixed}

21

slide-22
SLIDE 22

Some Direct Text Classification Applications

Task ! " Language identification text {English, Mandarin, Greek, …} Spam classification email {spam, not spam} Authorship attribution text {jk rowling, james joyce, …} Genre classification novel {detective, romance, gothic, …} Sentiment classification text {positive, negative, neutral, mixed}

Indirectly, methods from text classification apply to a huge range of settings in natural language processing, and will appear again and again throughout the course.

22

slide-23
SLIDE 23

Bag-of-Words

23

slide-24
SLIDE 24

The Bag-of-Words

¡ One challenge is that the sequential representation !", !$, … , !& may have a

different length ' for every document.

¡ The bag-of-words is a fixed-length representation, which consists of a vector of

word counts:

¡ The length of ( is equal to the size of the vocabulary ) ¡ For each (, there may be many possible w, depending on word order.

24

slide-25
SLIDE 25

Linear Classification on the Bag of Words

¡ Let !(#, %) score the compatibility of bag-of-words # and label %, then

' % = argmax

.

!(#, %)

¡ In a linear classifier, this scoring function has a simple form:

! #, % = / ⋅ 1 #, % = 2

345

6

3 ⋅ 7 3 #, %

¡ where / is a vector of weights, and 1 is a feature function

25

slide-26
SLIDE 26

Feature Functions

¡ In classification, the feature function is usually a simple combination of

! and ", such as: #

$ !, " = '()*+,-, if y = FICTION

0,

  • therwise

26

slide-27
SLIDE 27

Summary and Next Steps

¡ To summarize, our classification function is:

! " = argmax

)

* ⋅ , -, " where - is the bag-of-words representation, and , is a feature function

¡ The learning problem is to find the right weights *, assuming a labeled

dataset (-(0), "(0)) 023

4

27

slide-28
SLIDE 28

Probabilistic Classification

¡ Naïve Bayes is a probabilistic classifier. It takes the following strategy:

¡ Define a probability model !(#, %) ¡ Estimate the parameters of the probability model by maximum likelihood – that

is, by maximizing the likelihood of the dataset

28

slide-29
SLIDE 29

A Probability Model for Text Classification

¡ First, assume each instance is independent of the others

¡ ! " #:% , ' #:%

= ∏*+#

%

!("(*), '(*)) ¡ Apply the chain rule of the probability

¡ ! ", ' = ! " ' ⋅ !(')

¡ Define the parametric form of each probability

¡ ! ' = Categorical 9

! " ' = Multinomail(>)

¡ The multinomial is a distribution over vectors of counts ¡ The parameters 9 and > are vectors of probabilities

29

slide-30
SLIDE 30

The Multinomial Distribution

¡ Suppose the word whale has probability !"

¡ What is the probability that this word appears 3 times?

30

slide-31
SLIDE 31

The Multinomial Distribution

Each word’s probability is exponentiated by its count,

¡ Multinomail(+, -, .) =

∑234

5

62 ! ∏234

5

(62!) ∏9:;

<

  • 9

62

31

slide-32
SLIDE 32

The Multinomial Distribution

Each word’s probability is exponentiated by its count,

¡ Multinomail(+, -, .) =

∑234

5

62 ! ∏234

5

(62!) ∏9:;

<

  • 9

62

¡ The coefficient is the count of the number of possible orderings of +.

32

slide-33
SLIDE 33

The Multinomial Distribution

Each word’s probability is exponentiated by its count,

¡ Multinomail(+, -, .) =

∑234

5

62 ! ∏234

5

(62!) ∏9:;

<

  • 9

62

¡ The coefficient is the count of the number of possible orderings of +. ¡ Crucially, it does not depend on the frequency parameter -

33

slide-34
SLIDE 34

Estimating Naïve Bayes

¡ In relative frequency estimation, the parameters are set to empirical frequencies: ¡ This turns out to be identical to the maximum likelihood estimate:

34

slide-35
SLIDE 35

Quick Question (1)

Multiplying lots of small probabilities (all are under 1) can lead to numerical underflow …

35

slide-36
SLIDE 36

Quick Question (1)

Multiplying lots of small probabilities (all are under 1) can lead to numerical underflow …

36

slide-37
SLIDE 37

Low Count Issue

¡ What if we have seen no training documents with the word fantastic

and classified in the topic positive ?

¡

̂ " “$%&'%(')*” ",()')-.) =

12345(“7845895:1”, <29:5:=>) ∑@∈B 12345(C,<29:5:=>)

= 0

¡ Zero probabilities cannot be conditioned away

37

slide-38
SLIDE 38

Smoothing

¡ To deal with low counts, it can be helpful to smooth probabilities ¡ Smoothing term ! is a hyperparameter, which must be tuned on a development set ¡ Laplace (add-1)smoothing: widely used

38

slide-39
SLIDE 39

Too Naïve?

¡ Naïve Bayes is so called because: ¡ Bayes rule is used to convert the observation probability !(#|%) into the label

probability ! ' #

¡ The multinomial distribution naively ignores dependencies between words, and

treats every word as equally informative

¡ Discriminative classifiers avoid this problem by not attempting to model the

“generative” probability !(#)

39

slide-40
SLIDE 40

The Perceptron Classifier

¡ Error-driven rather than independence assumption

40

slide-41
SLIDE 41

The Perceptron Classifier

¡ A simple learning rule: ¡ Run the current classifier on an instance in the training data, obtaining !

" = argmax

)

*(,(-), ")

¡ If the prediction is incorrect: ¡ Increase the weights for the features of the true label ¡ Decrease the weights for the features of the predicted label ¡ 0 ← 0 + 3 , 4 , "(-)

− 3 , 4 , ! " ¡ Repeat until all training instances are correctly classified, or run out of time

41

slide-42
SLIDE 42

The Perceptron Classifier (Online Learning)

42

slide-43
SLIDE 43

Loss Function

¡ Many classifiers can be viewed as minimizing a loss function on the weights. ¡ Such a function should have two properties:

¡ It should be a good proxy for the accuracy of the classifier ¡ It should be easy to optimize

43

slide-44
SLIDE 44

Perceptron as Gradient Descent

¡ This perceptron can be viewed as optimizing the loss function

45

slide-45
SLIDE 45

Perceptron as Gradient Descent

¡ This perceptron can be viewed as optimizing the loss function ¡ The gradient of the perceptron loss is part of the perceptron update

46

slide-46
SLIDE 46

Logistic Regression

¡ Perceptron classification is discriminative – learns to discriminate correct and

incorrect labels

¡ Naïve Bayes is probabilistic: it assigns calibrated confidence scores to its predictions ¡ Logistic regression is both discriminative and probabilistic. It directly computes the

conditional probability of the label:

47

slide-47
SLIDE 47

Logistic Regression

¡ Logistic regression is both discriminative and probabilistic. It directly computes the

conditional probability of the label:

¡ Exponentiation ensures that the probabilities are non-negative.

48

slide-48
SLIDE 48

Logistic Regression

¡ Logistic regression is both discriminative and probabilistic. It directly computes the

conditional probability of the label:

¡ Exponentiation ensures that the probabilities are non-negative. ¡ Normalization ensures that the probabilities sum to one.

49

slide-49
SLIDE 49

Learning Logistic Regression

¡ Maximization of the conditional log-likelihood

50

slide-50
SLIDE 50

Learning Logistic Regression

¡ Maximization of the conditional log-likelihood ¡ Minimization of the negative log-likelihood/logistic loss

51

slide-51
SLIDE 51

Regularization

¡ Learning can often be made more robust by regularization: penalizing large weights ¡ where the scalar ! controls the strength of regularization, and

52

slide-52
SLIDE 52

Gradient Descent (Batch Optimization)

¡ Logistic regression, perceptron both learn by minimizing a loss function. A general

strategy for minimization is gradient descent

¡ where !(#) ∈ ℝ' is the learning rate at iteration t

53

slide-53
SLIDE 53

Stochastic Gradient Descent (Online Optimization)

¡ Computing the gradient over all instances is expensive ¡ Stochastic gradient descent approximates the gradient by its value on a single data:

¡

! " , $ " is sampled at random from the training set

54

theoretically guaranteed!

slide-54
SLIDE 54

Online Optimization

¡ Gradient descent computes the gradient over all instances ¡ Stochastic gradient descent approximates the gradient by its value on a single data ¡ Minibatch gradient descent approximates the gradient by its value on small number

  • f instances. This is suited to GPU architectures, widely used in deep learning.

55

slide-55
SLIDE 55

Generalized Gradient Descent

56

slide-56
SLIDE 56

Summary of Linear Classification

Pros Cons Naive Bayes Simple, probabilistic, fast Closed-form solution Not very accurate Perceptron Simple, accurate Not probabilistic, may overfit Logistic Regression Error-driven learning, regularized More difficult to implement

57