Text Classification
Fall 2020 2020-09-18 CMPT 413/825: Natural Language Processing
SFU NatLangLab
Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan
1
Text Classification Fall 2020 2020-09-18 Adapted from slides from - - PowerPoint PPT Presentation
SFU NatLangLab CMPT 413/825: Natural Language Processing Text Classification Fall 2020 2020-09-18 Adapted from slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan 1 Announcements Remaining lectures on language modeling (LM) on
1
the feedback
created a Piazza group through which you can contact each other.
2
3
4
Movie was terrible
Classify
Negative Amazing acting
Classify
Positive
5
IF there exists word w in document d such that w in [good, great, extra-ordinary, …], THEN output Positive IF email address ends in [ithelpdesk.com, makemoney.com, spinthewheel.com, …] THEN output SPAM
6
Let the machine figure out the best patterns to use!
7
8
9
r
j=1
r
j=1
10
Bayes rule
Posterior Prior Likelihood
11
c∈C P(c|d)
c∈C
c∈C P(c)P(d|c)
maximum a posteriori (MAP) estimate
(both absolute and relative)
given class
12
15
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about
times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
es r it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great … 6 5 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 … it it it it it it I I I I I love recommend movie the the the the to to to and and and seen seen yet would with who whimsical while whenever times sweet several scenes satirical romantic
manages humor have happy fun friend fairy dialogue but conventions areanyone adventure always again about t, he ... cal ng t ral py I
13
count word
14
Note that is the number of tokens (words) in the document. The index is the position of the token.
k i
c∈C P(c)P(d|c) = arg max c∈C
k
i=1
is used to indicate the estimated probability
̂ P
15
16
d1 c = Science
17
d1 c = Science w1 = Scientists
P(w1|c)
18
d1 c = Science w2 = have w1 = Scientists w3 = discovered
P(w1|c) P(w2|c) P(w3|c)
19
d1 c = Science w2 = have w1 = Scientists w3 = discovered d2 c = Environment w2 = warming w1 = Global w3 = has
20
21
c∈C
k
i=1
c∈C
c∈C 0
22
i=1
23
Choosing%a%class: P(c|d5)$ P(j|d5)$
1/4$*$(2/9)3 *$2/9$*$2/9$ ≈$0.0001 Doc Words Class Training 1 Chinese Beijing$Chinese c 2 Chinese$Chinese$Shanghai c 3 Chinese$Macao c 4 Tokyo$Japan$Chinese j Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
41
Conditional%Probabilities: P(Chinese|c)$= P(Tokyo|c)$$$$= P(Japan|c)$$$$$= P(Chinese|j)$= P(Tokyo|j)$$$$$= P(Japan|j)$$$$$$=$ Priors: P(c)=$ P(j)=$
3 4 1 4
ˆ P(w | c) = count(w,c)+1 count(c)+ |V | ˆ P(c) = Nc N
(5+1)$/$(8+6)$=$6/14$=$3/7 (0+1)$/$(8+6)$=$1/14 (1+1)$/$(3+6)$=$2/9$ (0+1)$/$(8+6)$=$1/14 (1+1)$/$(3+6)$=$2/9$ (1+1)$/$(3+6)$=$2/9$
3/4$*$(3/7)3 *$1/14$*$1/14$ ≈$0.0003
∝
∝
24
Smoothing with α = 1
25
26
Better to use
frequent, unimportant words
Top features for Spam detection
27
28
w∈s
Example with positive and negative sentiments
0.1 I 0.1 love 0.01 this 0.05 fun 0.1 film
film love this fun I
0.1 0.1 0.01 0.05 0.1 0.1 0.001 0.01 0.005 0.2
0.2 I 0.001 love 0.01 this 0.005 fun 0.1 film
29
30
31
d
32
Does not handle rare classes well
training and you don’t care about the rare classes
Re-weight classes if needed
33
:
log(xy) = log(x) + log(y) cMAP = arg max
cj∈C log P(cj) + k
∑
i=1
log P(xi|cj)
34
better to sum logs of probabilities than to multiply probabilities
35