Text Classification Fall 2020 2020-09-18 Adapted from slides from - - PowerPoint PPT Presentation

text classification
SMART_READER_LITE
LIVE PREVIEW

Text Classification Fall 2020 2020-09-18 Adapted from slides from - - PowerPoint PPT Presentation

SFU NatLangLab CMPT 413/825: Natural Language Processing Text Classification Fall 2020 2020-09-18 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1 Announcements Remaining lectures on language modeling (LM) on


slide-1
SLIDE 1

Text Classification

Fall 2020 2020-09-18 CMPT 413/825: Natural Language Processing

SFU NatLangLab

Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan

1

slide-2
SLIDE 2

Announcements

  • Remaining lectures on language modeling (LM) on Canvas
  • Initial grades for HW0-Programming section released
  • You have until 11:59 Friday to resubmit / address any comments in

the feedback

  • We will aim to have final grades for HW0 out next week
  • For those that do not have group / single student groups, we have

created a Piazza group through which you can contact each other.

2

slide-3
SLIDE 3

Why classify?

  • Authorship attribution
  • Language detection
  • News categorization

Spam detection Sentiment analysis

3

slide-4
SLIDE 4

Other Examples

4

Prepositional phrase attachment Intent detection

slide-5
SLIDE 5

Classification: The Task

  • Inputs:
  • A document d
  • A set of classes C = {c1, c2, c3, … , cm}
  • Output:
  • Predicted class c for document d

Movie was terrible

Classify

Negative Amazing acting

Classify

Positive

5

slide-6
SLIDE 6

Rule-based classification

  • Combinations of features on words in document, meta-data 



 IF there exists word w in document d such that w in [good, great, extra-ordinary, …], 
 THEN output Positive 
 
 IF email address ends in [ithelpdesk.com, makemoney.com, spinthewheel.com, …]
 THEN output SPAM

  • Simple, can be very accurate
  • But: rules may be hard to define (and some even unknown to us!)
  • Expensive
  • Not easily generalizable

6

slide-7
SLIDE 7

Supervised Learning: Let’s use statistics!

  • Data-driven approach

Let the machine figure out the best patterns to use!

  • Inputs:
  • Set of m classes C = {c1, c2, …, cm}
  • Set of n ‘labeled’ documents: {(d1, c1), (d2, c2), …, (dn, cn)}
  • Output: Trained classifier,
  • What form should F take?
  • How to learn F?

F : d → c

7

slide-8
SLIDE 8

Recall: general guidelines for model building

Two steps to building a probability model:

  • 1. Define the model
  • What independence assumptions do we make?
  • What are the model parameters (probability values)?
  • 2. Estimate the model parameters (training/learning)

8

  • What form should F take?
  • How to learn F?
slide-9
SLIDE 9

Types of supervised classifiers

Naive Bayes Logistic regression Support vector machines k-nearest neighbors

9

slide-10
SLIDE 10

Naive Bayes Classifier General setting

  • Let the input be represented as features: ,
  • Let be the output classification
  • We can have a simple classification model using Bayes rule
  • Make strong (naive) conditional independence assumptions

x r fj 1 ≤ j ≤ r y P(y|x) = P(y) ⋅ P(x|y) P(x) P(x|y) =

r

j=1

P(fj|y) P(y|x) ∝ P(y) ⋅

r

j=1

P(fj|y)

10

Bayes rule

Posterior Prior Likelihood

slide-11
SLIDE 11

Naive Bayes classifier for text classification

  • For text classification: input is document
  • Use as our features the words

, where is our vocabulary

  • is the output classification
  • Predicting the best class:

x d = (w1, …, wk) wj 1 ≤ j ≤ |V| V c

11

= arg max

c∈C P(c|d)

cMAP = arg max

c∈C

P(c)P(d|c) P(d) = arg max

c∈C P(c)P(d|c)

Prior probability of class

P(c) → c

Conditional probability of generating document from class

P(d|c) → d c

maximum a posteriori (MAP) estimate

slide-12
SLIDE 12

How to represent P(d | c)?

  • Option 1: represent the entire sequence of words
  • (too many sequences!)
  • Option 2: Bag of words
  • Assume position of each word is irrelevant 


(both absolute and relative)

  • Probability of each word is conditionally independent


given class

P(w1, w2, w3, …, wk|c) P(w1, w2, w3, …, wk|c) = P(w1|c)P(w2|c)…P(wk|c) c

12

slide-13
SLIDE 13

Bag of words

15

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about

  • anyone. I've seen it several

times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!

es r it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great … 6 5 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 … it it it it it it I I I I I love recommend movie the the the the to to to and and and seen seen yet would with who whimsical while whenever times sweet several scenes satirical romantic

  • f

manages humor have happy fun friend fairy dialogue but conventions areanyone adventure always again about t, he ... cal ng t ral py I

13

count word

slide-14
SLIDE 14

Predicting with Naive Bayes

14

Note that is the number of tokens (words) in the document. The index is the position of the token.

k i

  • Once we assume that the position of each word is irrelevant and that

the words are conditionally independent given class , we have:

  • The maximum a posteriori (MAP) estimate is now:

c P(d|c) = P(w1, w2, w3, …, wk|c) = P(w1|c)P(w2|c)…P(wk|c) cMAP = arg max

c∈C P(c)P(d|c) = arg max c∈C

̂ P(c)

k

i=1

̂ P(wi|c)

is used to indicate the estimated probability

̂ P

slide-15
SLIDE 15

Naive Bayes as a generative model

Generate the entire data set one document at a time

15

slide-16
SLIDE 16

Naive Bayes as a generative model

Sample a category

16

d1 c = Science

P(c)

slide-17
SLIDE 17

Naive Bayes as a generative model

Sample words

17

d1 c = Science w1 = Scientists

P(c)

P(w1|c)

slide-18
SLIDE 18

Naive Bayes as a generative model

18

d1 c = Science w2 = have w1 = Scientists w3 = discovered

P(c)

P(w1|c) P(w2|c) P(w3|c)

Generate the entire data set one document at a time

slide-19
SLIDE 19

Naive Bayes as a generative model

Generate the entire data set one document at a time

19

d1 c = Science w2 = have w1 = Scientists w3 = discovered d2 c = Environment w2 = warming w1 = Global w3 = has

slide-20
SLIDE 20

Estimating probabilities

20

slide-21
SLIDE 21

Data sparsity

  • What about when count(‘amazing’, positive) = 0?
  • Implies P(‘amazing’ | positive) = 0
  • Given a review document, d = “…. most amazing movie ever …”

21

= arg max

c∈C

̂ p(c)

k

i=1

P(wi|c) = arg max

c∈C

̂ p(c) ⋅ 0 = arg max

c∈C 0

cMAP Can’t determine the best !

c

slide-22
SLIDE 22

Solution: Smoothing!

22

Laplace smoothing

  • Simple, easy to use
  • Effective in practice
slide-23
SLIDE 23

Overall process

  • Input: Set of annotated documents
  • A. Compute vocabulary V of all words
  • B. Calculate 


  • C. Calculate


  • D. (Prediction) Given document 


{(di, ci)}n

i=1

d = (w1, w2, . . . , wk)

23

Variants Multinomial Naive Bayes Normal counts (0,1,2,…) for each document Binary (Multinomial) NB / Bernoulli NB Binarized counts (0/1) For each document Name based on the distribution of the features P(fi|y) → P(wi|c)

slide-24
SLIDE 24

Naive Bayes Example

Choosing%a%class: P(c|d5)$ P(j|d5)$

1/4$*$(2/9)3 *$2/9$*$2/9$ ≈$0.0001 Doc Words Class Training 1 Chinese Beijing$Chinese c 2 Chinese$Chinese$Shanghai c 3 Chinese$Macao c 4 Tokyo$Japan$Chinese j Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?

41

Conditional%Probabilities: P(Chinese|c)$= P(Tokyo|c)$$$$= P(Japan|c)$$$$$= P(Chinese|j)$= P(Tokyo|j)$$$$$= P(Japan|j)$$$$$$=$ Priors: P(c)=$ P(j)=$

3 4 1 4

ˆ P(w | c) = count(w,c)+1 count(c)+ |V | ˆ P(c) = Nc N

(5+1)$/$(8+6)$=$6/14$=$3/7 (0+1)$/$(8+6)$=$1/14 (1+1)$/$(3+6)$=$2/9$ (0+1)$/$(8+6)$=$1/14 (1+1)$/$(3+6)$=$2/9$ (1+1)$/$(3+6)$=$2/9$

3/4$*$(3/7)3 *$1/14$*$1/14$ ≈$0.0003

24

Smoothing with α = 1

slide-25
SLIDE 25

Some details

25

  • Vocabulary is important
  • Tokenization matters: it can affect your vocabulary
  • Tokenization = how you break your sentence up into tokens / words
  • Make sure you are consistent with your tokenization!
  • Special multi-word tokens: NOT_happy
slide-26
SLIDE 26

Some details

26

  • Vocabulary is important
  • Tokenization matters: it can affect your vocabulary
  • Tokenization = how you break your sentence up into tokens / words
  • Make sure you are consistent with your tokenization!
  • Handling unknown words in test not in your training vocabulary?
  • Remove them from your test document! Just ignore them.
  • Handling stop words (common words like a, the that may not be useful)
  • Remove them from the training data!

Better to use

  • Modified counts (tf-idf) that down weighs

frequent, unimportant words

  • Better models!
slide-27
SLIDE 27

Features

  • In general, Naive Bayes can use any set of features, not just words
  • URLs, email addresses, Capitalization, …
  • Domain knowledge can be crucial to performance

Top features for Spam detection

27

slide-28
SLIDE 28

Naive Bayes and Language Models

  • If features = bag of words, NB gives a per class

unigram language model!

  • For class , assigning each word: 


assigning sentence:

c

28

P(w|c) P(s|c) = ∏

w∈s

P(w|c)

Example with positive and negative sentiments

P(s|pos) = 0.0000005

slide-29
SLIDE 29

Naive Bayes as a language model

  • Which$class$assigns$the$higher$probability$to$s?

0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film

Model$pos Model$neg

film love this fun I

0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2

P(s|pos)$$>$$P(s|neg)

0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film

29

slide-30
SLIDE 30

Advantages of Naive Bayes

  • Very$Fast,$low$storage$requirements
  • Robust$to$Irrelevant$Features

Irrelevant$Features$cancel$each$other$without$affecting$results

  • Very$good$in$domains$with$many$equally$important$features

Decision$Trees$suffer$from$fragmentationin$such$cases$– especially$if$little$data

  • Optimal$if$the$independence$assumptions$hold:$If$assumed$

independence$is$correct,$then$it$is$the$Bayes$Optimal$Classifier$for$problem

  • A$good$dependable$baseline$for$text$classification
  • But%we%will%see%other%classifiers%that%give%better%accuracy

30

slide-31
SLIDE 31

Failings of Naive Bayes (1)

  • Independence assumptions are too strong


  • XOR problem: Naive Bayes cannot learn a decision boundary
  • Both variables are jointly required to predict class

Independence assumption broken!

31

slide-32
SLIDE 32

Failings of Naive Bayes (2)

  • Class imbalance:
  • One or more classes have more instances than others
  • Data skew causes NB to prefer one class over the other
  • 100 documents with class=MA and “Boston” occurring once each
  • 10 documents with class=BC and “Vancouver” occurring once each
  • New document : “Boston Boston Vancouver Vancouver Vancouver”

d

32

Does not handle rare classes well

  • Okay if test distribution follows

training and you don’t care about the rare classes

  • Low macro-average metrics

Re-weight classes if needed

P(class = MA|d) > P(class = BC|d)

slide-33
SLIDE 33

When to use Naive Bayes

  • Small data sizes:
  • Naive Bayes is great! (high bias)
  • Rule-based classifiers might work well too
  • Medium size datasets:
  • More advanced classifiers might perform better (e.g. SVM, logistic regression)
  • Large datasets:
  • Naive Bayes becomes competitive again (although most classifiers work well)

33

slide-34
SLIDE 34

Practical text classification

  • Domain knowledge is crucial to selecting good features
  • Handle class imbalance by re-weighting classes
  • Use log scale operations instead of multiplying probabilities
  • Since

:

  • Class with highest un-normalized log probability score is still the most probable
  • Model is now just max of sum of weights

log(xy) = log(x) + log(y) cMAP = arg max

cj∈C log P(cj) + k

i=1

log P(xi|cj)

34

better to sum logs of probabilities than to multiply probabilities

slide-35
SLIDE 35

35