SLIDE 1
Natural Language Processing Angel Xuan Chang - - PowerPoint PPT Presentation
Natural Language Processing Angel Xuan Chang - - PowerPoint PPT Presentation
SFU NatLangLab Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University 2020-01-23 0 Natural Language Processing Angel Xuan Chang
SLIDE 2
SLIDE 3
2
Classification tasks in NLP Naive Bayes Classifier Log linear models
SLIDE 4
3
Sentiment classification: Movie reviews
◮ neg unbelievably disappointing ◮ pos Full of zany characters and richly applied satire, and some great plot twists ◮ pos this is the greatest screwball comedy ever filmed ◮ neg It was pathetic. The worst part about it was the boxing scenes.
SLIDE 5
4
Intent Detection
◮ ADDR CHANGE I just moved and want to change my address. ◮ ADDR CHANGE Please help me update my address. ◮ FILE CLAIM I just got into a terrible accident and I want to file a claim. ◮ CLOSE ACCOUNT I’m moving and I want to disconnect my service.
SLIDE 6
5
Prepositional Phrases
◮ noun attach: I bought the shirt with pockets ◮ verb attach: I bought the shirt with my credit card ◮ noun attach: I washed the shirt with mud ◮ verb attach: I washed the shirt with soap ◮ Attachment depends on the meaning of the entire sentence – needs world knowledge, etc. ◮ Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words
SLIDE 7
6
Ambiguity Resolution: Prepositional Phrases in English
◮ Learning Prepositional Phrase Attachment: Annotated Data v n1 p n2 Attachment join board as director V is chairman
- f
N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N . . . . . . . . . . . . . . .
SLIDE 8
7
Prepositional Phrase Attachment
Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2
SLIDE 9
8
Back-off Smoothing
◮ Random variable a represents attachment. ◮ a = n1 or a = v (two-class classification) ◮ We want to compute probability of noun attachment: p(a = n1 | v, n1, p, n2). ◮ Probability of verb attachment is 1 − p(a = n1 | v, n1, p, n2).
SLIDE 10
9
Back-off Smoothing
- 1. If f (v, n1, p, n2) > 0 and ˆ
p = 0.5
ˆ p(an1 | v, n1, p, n2) = f (an1, v, n1, p, n2) f (v, n1, p, n2)
- 2. Else if f (v, n1, p) + f (v, p, n2) + f (n1, p, n2) > 0
and ˆ p = 0.5
ˆ p(an1 | v, n1, p, n2) = f (an1, v, n1, p) + f (an1, v, p, n2) + f (an1, n1, p, n2) f (v, n1, p) + f (v, p, n2) + f (n1, p, n2)
- 3. Else if f (v, p) + f (n1, p) + f (p, n2) > 0
ˆ p(an1 | v, n1, p, n2) = f (an1, v, p) + f (an1, n1, p) + f (an1, p, n2) f (v, p) + f (n1, p) + f (p, n2)
- 4. Else if f (p) > 0 (try choosing attachment based on
preposition alone)
ˆ p(an1 | v, n1, p, n2) = f (an1, p) f (p)
- 5. Else ˆ
p(an1 | v, n1, p, n2) = 1.0
SLIDE 11
10
Prepositional Phrase Attachment: Results
◮ Results (Collins and Brooks 1995): 84.5% accuracy with the use of some limited word classes for dates, numbers, etc. ◮ Toutanova, Manning, and Ng, 2004: use sophisticated smoothing model for PP attachment 86.18% with words & stems; with word classes: 87.54% ◮ Merlo, Crocker and Berthouzoz, 1997: test on multiple PPs, generalize disambiguation of 1 PP to 2-3 PPs 1PP: 84.3% 2PP: 69.6% 3PP: 43.6%
SLIDE 12
11
Natural Language Processing
Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan
Simon Fraser University
January 23, 2020 Part 2: Probabilistic Classifiers
SLIDE 13
12
Classification Task
◮ Input:
◮ A document d ◮ a set of classes C = {c1, c2, . . . , cm}
◮ Output: Predicted class c for document d ◮ Example:
◮ neg unbelievably disappointing ◮ pos this is the greatest screwball comedy ever filmed
SLIDE 14
13
Supervised learning: Let’s use statistics!
◮ Inputs:
◮ Set of m classes C = {c1, c2, . . . , cm} ◮ Set of n labeled documents: {(d1, c1), (d2, c2), . . . , (dn, cn)}
◮ Output: Trained classifier F : d → c
◮ What form should F take? ◮ How to learn F?
SLIDE 15
14
Types of supervised classifiers
SLIDE 16
15
Classification tasks in NLP Naive Bayes Classifier Log linear models
SLIDE 17
16
Naive Bayes Classifier
◮ x is the input that can be represented as d independent features fj, 1 ≤ j ≤ d ◮ y is the output classification ◮ P(y | x) = P(y)·P(x|y)
P(x)
(Bayes Rule) ◮ P(x | y) = d
j=1 P(fj | y)
◮ P(y | x) ∝ P(y) · d
j=1 P(fj | y)
◮ We can ignore P(x) in the above equation because it is a constant scaling factor for each y.
SLIDE 18
17
Naive Bayes Classifier for text classification
◮ For text classificaiton: input x = documentd = (w1, . . . , wk), ◮ Use as our features the words wj, 1 ≤ j ≤ |V | where V is our vocabulary ◮ c is the output classification ◮ Assume that position of each word is irrelevant and that the words are conditionally independent given class c P(w1, w2, . . . , wk|c) = P(w1|c)P(w2|c) . . . P(wk|c) ◮ Maximum a posteriori estimate cMAP = arg max
c
P(c)P(d|c) = arg max
c
ˆ P(c)
k
- i=1
ˆ P(wi|c)
SLIDE 19
18
Bag of words
SLIDE 20
19
Estimating probabilities
Maximum likelihood estimate
ˆ P(cj) = Count(cj) n ˆ P(wi|cj) = Count(wi, cj)
- w∈V [Count(w, cj)]
Smoothing
ˆ P(wi|c) = Count(wi, c) + α
- w∈V [Count(w, cj) + α]
SLIDE 21
20
Overall process
Input: Set of labeled documents: {(di, ci)}n
i=1
◮ Compute vocabulary V of all words ◮ Calculate ˆ P(cj) = Count(cj) n ◮ Calculate ˆ P(wi|cj) = Count(wi, cj) + α
- w∈V [Count(w, cj) + α]
◮ Prediction: Given document d = (w1, . . . , wk) cMAP = arg max
c
ˆ P(c)
k
- i=1
ˆ P(wi|c)
SLIDE 22
21
Naive Bayes Example
SLIDE 23
22
Tokenization
Tokenization matters - it can affect your vocabulary
◮ aren’t aren’t arent are n’t aren t ◮ Emails, URLs, phone numbers, dates, emoticons
SLIDE 24
23
Features
◮ Remember: Naive Bayes can use any set of features ◮ Captitalization, subword features (end with -ing), etc ◮ Domain knowledge crucial for performance
Top features for spam detection
[Alqatawna et al, IJCNSS 2015]
SLIDE 25
24
Evaluation
◮ Table of prediction (binary classification) ◮ Ideally we want to get
SLIDE 26
25
Evaluation Metrics
Accuracy = TP + TN Total = 200 250 = 80%
SLIDE 27
26
Evaluation Metrics
Accuracy = TP + TN Total = 200 250 = 80%
SLIDE 28
27
Precision and Recall
SLIDE 29
28
Precision and Recall
from Wikipedia
SLIDE 30
29
F-Score
SLIDE 31
30
Choosing Beta
SLIDE 32
31
Aggregating scores
◮ We have Precision, Recall, F1 for each class ◮ How to combine them for an overall score?
◮ Macro-average: Compute for each class, then average ◮ Micro-average: Collect predictions for all classes and jointly evaluate
SLIDE 33
32
Macro vs Micro average
◮ Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 ◮ Microaveraged precision: 100/120 = .83 ◮ Microaveraged score is dominated by score on common classes
SLIDE 34
33
Validation
◮ Choose a metric: Precision/Recall/F1 ◮ Optimize for metric on Validation (aka Development) set ◮ Finally evaluate on ‘unseen’ Test set ◮ Cross-validation
◮ Repeatedly sample several train-val splits ◮ Reduces bias due to sampling errors
SLIDE 35
34
Advantanges of Naive Bayes
◮ Very fast, low storage requirements ◮ Robust to irrelevant features ◮ Very good in domains with many equally important features ◮ Optimal if the independence assumptions hold ◮ Good dependable baseline for text classification
SLIDE 36
35
When to use Naive Bayes
◮ Small data sizes: Naive Bayes is great! . Rule-based classifiers can work well too ◮ Medium size datasets: More advanced classifiers might perform better (SVM, logistic regression) ◮ Large datasets: Naive Bayes becomes competive again (most learned classifiers will work well)
SLIDE 37
36
Failings of Naive Bayes (1)
Independence assumptions are too strong
◮ XOR problem: Naive Bayes cannot learn a decision boundary ◮ Both variables are jointly required to predict class. Independence assumption broken!
SLIDE 38
37
Failings of Naive Bayes (2)
Class Imbalance
◮ One or more classes have more instances than others ◮ Data skew causes NB to prefer one class over the other
SLIDE 39
38
Failings of Naive Bayes (3)
Weight magnitude errors
◮ Classes with larger weights are preferred ◮ 10 documents with class=MA and “Boston” occurring once each ◮ 10 documents with class=CA and “San Francisco” occurring
- nce each
◮ New document d: “Boston Boston Boston San Francisco San Francisco” P(class = CA|d) > P(class = MA|d)
SLIDE 40
39
Naive Bayes Summary
◮ Domain knowledge is crucial to selecting good features ◮ Handle class imbalance by re-weighting classes ◮ Use log scale operations instead of multiplying probabilities P(cNB) = arg max
cj∈C
log P(cj) +
- i
log P(xi|cj) ◮ Model is now just max of sum of weights
SLIDE 41
40
Classification tasks in NLP Naive Bayes Classifier Log linear models
SLIDE 42
41
Log linear model
◮ The model classifies input into output labels y ∈ Y ◮ Let there be m features, fk(x, y) for k = 1, . . . , m ◮ Define a parameter vector v ∈ Rm ◮ Each (x, y) pair is mapped to score: s(x, y) =
- k
vk · fk(x, y) ◮ Using inner product notation: v · f(x, y) =
- k
vk · fk(x, y) s(x, y) = v · f(x, y) ◮ To get a probability from the score: Renormalize! Pr(y | x; v) = exp (s(x, y))
- y′∈Y exp (s(x, y′))
SLIDE 43
42
Log linear model
◮ The name ‘log-linear model’ comes from: log Pr(y | x; v) = v · f(x, y)
- linear term
− log
- y′
exp
- v · f(x, y′)
- normalization term
◮ Once the weights v are learned, we can perform predictions using these features. ◮ The goal: to find v that maximizes the log likelihood L(v) of the labeled training set containing (xi, yi) for i = 1 . . . n L(v) =
- i
log Pr(yi | xi; v) =
- i
v · f(xi, yi) −
- i
log
- y′
exp
- v · f(xi, y′)
SLIDE 44
43
Log linear model
◮ Maximize: L(v) =
- i
v · f(xi, yi) −
- i
log
- y′
exp
- v · f(xi, y′)
- ◮ Calculate gradient:
dL(v) dv
- v
=
- i
f(xi, yi) −
- i
1
- y′′ exp (v · f(xi, y′′))
- y′
f(xi, y′) · exp
- v · f(xi, y′)
- =
- i
f(xi, yi) −
- i
- y′
f(xi, y′) exp (v · f(xi, y′))
- y′′ exp (v · f(xi, y′′))
=
- i
f(xi, yi)
- Observed counts
−
- i
- y′
f(xi, y′) Pr(y′ | xi; v)
- Expected counts
SLIDE 45
44
Gradient ascent
◮ Init: v(0) = 0 ◮ t ← 0 ◮ Iterate until convergence:
◮ Calculate: ∆ = dL(v)
dv
- v=v(t)
◮ Find β∗ = arg maxβ L(v(t) + β∆) ◮ Set v(t+1) ← v(t) + β∗∆
SLIDE 46
45
Learning the weights: v: Generalized Iterative Scaling
f # = maxx,y
- j fj(x, y)
(the maximum possible feature value; needed for scaling) Initialize v(0) For each iteration t expected[j] ← 0 for j = 1 .. # of features For i = 1 to | training data | For each feature fj expected[j] += fj(xi, yi) · P(yi | xi; v(t)) For each feature fj(x, y)
- bserved[j] = fj(x, y) ·
c(x,y) |training data|
For each feature fj(x, y) v(t+1)
j
← v(t)
j
·
f #
- bserved[j]
expected[j]
- cf. Goodman, NIPS ’01
SLIDE 47