Natural Language Processing Angel Xuan Chang - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Angel Xuan Chang - - PowerPoint PPT Presentation

SFU NatLangLab Natural Language Processing Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar Simon Fraser University 2020-01-23 0 Natural Language Processing Angel Xuan Chang


slide-1
SLIDE 1

SFU NatLangLab

Natural Language Processing

Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar

Simon Fraser University

2020-01-23

slide-2
SLIDE 2

1

Natural Language Processing

Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar

Simon Fraser University

January 23, 2020 Part 1: Classification tasks in NLP

slide-3
SLIDE 3

2

Classification tasks in NLP Naive Bayes Classifier Log linear models

slide-4
SLIDE 4

3

Sentiment classification: Movie reviews

◮ neg unbelievably disappointing ◮ pos Full of zany characters and richly applied satire, and some great plot twists ◮ pos this is the greatest screwball comedy ever filmed ◮ neg It was pathetic. The worst part about it was the boxing scenes.

slide-5
SLIDE 5

4

Intent Detection

◮ ADDR CHANGE I just moved and want to change my address. ◮ ADDR CHANGE Please help me update my address. ◮ FILE CLAIM I just got into a terrible accident and I want to file a claim. ◮ CLOSE ACCOUNT I’m moving and I want to disconnect my service.

slide-6
SLIDE 6

5

Prepositional Phrases

◮ noun attach: I bought the shirt with pockets ◮ verb attach: I bought the shirt with my credit card ◮ noun attach: I washed the shirt with mud ◮ verb attach: I washed the shirt with soap ◮ Attachment depends on the meaning of the entire sentence – needs world knowledge, etc. ◮ Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words

slide-7
SLIDE 7

6

Ambiguity Resolution: Prepositional Phrases in English

◮ Learning Prepositional Phrase Attachment: Annotated Data v n1 p n2 Attachment join board as director V is chairman

  • f

N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N . . . . . . . . . . . . . . .

slide-8
SLIDE 8

7

Prepositional Phrase Attachment

Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2

slide-9
SLIDE 9

8

Back-off Smoothing

◮ Random variable a represents attachment. ◮ a = n1 or a = v (two-class classification) ◮ We want to compute probability of noun attachment: p(a = n1 | v, n1, p, n2). ◮ Probability of verb attachment is 1 − p(a = n1 | v, n1, p, n2).

slide-10
SLIDE 10

9

Back-off Smoothing

  • 1. If f (v, n1, p, n2) > 0 and ˆ

p = 0.5

ˆ p(an1 | v, n1, p, n2) = f (an1, v, n1, p, n2) f (v, n1, p, n2)

  • 2. Else if f (v, n1, p) + f (v, p, n2) + f (n1, p, n2) > 0

and ˆ p = 0.5

ˆ p(an1 | v, n1, p, n2) = f (an1, v, n1, p) + f (an1, v, p, n2) + f (an1, n1, p, n2) f (v, n1, p) + f (v, p, n2) + f (n1, p, n2)

  • 3. Else if f (v, p) + f (n1, p) + f (p, n2) > 0

ˆ p(an1 | v, n1, p, n2) = f (an1, v, p) + f (an1, n1, p) + f (an1, p, n2) f (v, p) + f (n1, p) + f (p, n2)

  • 4. Else if f (p) > 0 (try choosing attachment based on

preposition alone)

ˆ p(an1 | v, n1, p, n2) = f (an1, p) f (p)

  • 5. Else ˆ

p(an1 | v, n1, p, n2) = 1.0

slide-11
SLIDE 11

10

Prepositional Phrase Attachment: Results

◮ Results (Collins and Brooks 1995): 84.5% accuracy with the use of some limited word classes for dates, numbers, etc. ◮ Toutanova, Manning, and Ng, 2004: use sophisticated smoothing model for PP attachment 86.18% with words & stems; with word classes: 87.54% ◮ Merlo, Crocker and Berthouzoz, 1997: test on multiple PPs, generalize disambiguation of 1 PP to 2-3 PPs 1PP: 84.3% 2PP: 69.6% 3PP: 43.6%

slide-12
SLIDE 12

11

Natural Language Processing

Angel Xuan Chang angelxuanchang.github.io/nlp-class adapted from lecture slides from Anoop Sarkar, Danqi Chen and Karthik Narasimhan

Simon Fraser University

January 23, 2020 Part 2: Probabilistic Classifiers

slide-13
SLIDE 13

12

Classification Task

◮ Input:

◮ A document d ◮ a set of classes C = {c1, c2, . . . , cm}

◮ Output: Predicted class c for document d ◮ Example:

◮ neg unbelievably disappointing ◮ pos this is the greatest screwball comedy ever filmed

slide-14
SLIDE 14

13

Supervised learning: Let’s use statistics!

◮ Inputs:

◮ Set of m classes C = {c1, c2, . . . , cm} ◮ Set of n labeled documents: {(d1, c1), (d2, c2), . . . , (dn, cn)}

◮ Output: Trained classifier F : d → c

◮ What form should F take? ◮ How to learn F?

slide-15
SLIDE 15

14

Types of supervised classifiers

slide-16
SLIDE 16

15

Classification tasks in NLP Naive Bayes Classifier Log linear models

slide-17
SLIDE 17

16

Naive Bayes Classifier

◮ x is the input that can be represented as d independent features fj, 1 ≤ j ≤ d ◮ y is the output classification ◮ P(y | x) = P(y)·P(x|y)

P(x)

(Bayes Rule) ◮ P(x | y) = d

j=1 P(fj | y)

◮ P(y | x) ∝ P(y) · d

j=1 P(fj | y)

◮ We can ignore P(x) in the above equation because it is a constant scaling factor for each y.

slide-18
SLIDE 18

17

Naive Bayes Classifier for text classification

◮ For text classificaiton: input x = documentd = (w1, . . . , wk), ◮ Use as our features the words wj, 1 ≤ j ≤ |V | where V is our vocabulary ◮ c is the output classification ◮ Assume that position of each word is irrelevant and that the words are conditionally independent given class c P(w1, w2, . . . , wk|c) = P(w1|c)P(w2|c) . . . P(wk|c) ◮ Maximum a posteriori estimate cMAP = arg max

c

P(c)P(d|c) = arg max

c

ˆ P(c)

k

  • i=1

ˆ P(wi|c)

slide-19
SLIDE 19

18

Bag of words

slide-20
SLIDE 20

19

Estimating probabilities

Maximum likelihood estimate

ˆ P(cj) = Count(cj) n ˆ P(wi|cj) = Count(wi, cj)

  • w∈V [Count(w, cj)]

Smoothing

ˆ P(wi|c) = Count(wi, c) + α

  • w∈V [Count(w, cj) + α]
slide-21
SLIDE 21

20

Overall process

Input: Set of labeled documents: {(di, ci)}n

i=1

◮ Compute vocabulary V of all words ◮ Calculate ˆ P(cj) = Count(cj) n ◮ Calculate ˆ P(wi|cj) = Count(wi, cj) + α

  • w∈V [Count(w, cj) + α]

◮ Prediction: Given document d = (w1, . . . , wk) cMAP = arg max

c

ˆ P(c)

k

  • i=1

ˆ P(wi|c)

slide-22
SLIDE 22

21

Naive Bayes Example

slide-23
SLIDE 23

22

Tokenization

Tokenization matters - it can affect your vocabulary

◮ aren’t aren’t arent are n’t aren t ◮ Emails, URLs, phone numbers, dates, emoticons

slide-24
SLIDE 24

23

Features

◮ Remember: Naive Bayes can use any set of features ◮ Captitalization, subword features (end with -ing), etc ◮ Domain knowledge crucial for performance

Top features for spam detection

[Alqatawna et al, IJCNSS 2015]

slide-25
SLIDE 25

24

Evaluation

◮ Table of prediction (binary classification) ◮ Ideally we want to get

slide-26
SLIDE 26

25

Evaluation Metrics

Accuracy = TP + TN Total = 200 250 = 80%

slide-27
SLIDE 27

26

Evaluation Metrics

Accuracy = TP + TN Total = 200 250 = 80%

slide-28
SLIDE 28

27

Precision and Recall

slide-29
SLIDE 29

28

Precision and Recall

from Wikipedia

slide-30
SLIDE 30

29

F-Score

slide-31
SLIDE 31

30

Choosing Beta

slide-32
SLIDE 32

31

Aggregating scores

◮ We have Precision, Recall, F1 for each class ◮ How to combine them for an overall score?

◮ Macro-average: Compute for each class, then average ◮ Micro-average: Collect predictions for all classes and jointly evaluate

slide-33
SLIDE 33

32

Macro vs Micro average

◮ Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 ◮ Microaveraged precision: 100/120 = .83 ◮ Microaveraged score is dominated by score on common classes

slide-34
SLIDE 34

33

Validation

◮ Choose a metric: Precision/Recall/F1 ◮ Optimize for metric on Validation (aka Development) set ◮ Finally evaluate on ‘unseen’ Test set ◮ Cross-validation

◮ Repeatedly sample several train-val splits ◮ Reduces bias due to sampling errors

slide-35
SLIDE 35

34

Advantanges of Naive Bayes

◮ Very fast, low storage requirements ◮ Robust to irrelevant features ◮ Very good in domains with many equally important features ◮ Optimal if the independence assumptions hold ◮ Good dependable baseline for text classification

slide-36
SLIDE 36

35

When to use Naive Bayes

◮ Small data sizes: Naive Bayes is great! . Rule-based classifiers can work well too ◮ Medium size datasets: More advanced classifiers might perform better (SVM, logistic regression) ◮ Large datasets: Naive Bayes becomes competive again (most learned classifiers will work well)

slide-37
SLIDE 37

36

Failings of Naive Bayes (1)

Independence assumptions are too strong

◮ XOR problem: Naive Bayes cannot learn a decision boundary ◮ Both variables are jointly required to predict class. Independence assumption broken!

slide-38
SLIDE 38

37

Failings of Naive Bayes (2)

Class Imbalance

◮ One or more classes have more instances than others ◮ Data skew causes NB to prefer one class over the other

slide-39
SLIDE 39

38

Failings of Naive Bayes (3)

Weight magnitude errors

◮ Classes with larger weights are preferred ◮ 10 documents with class=MA and “Boston” occurring once each ◮ 10 documents with class=CA and “San Francisco” occurring

  • nce each

◮ New document d: “Boston Boston Boston San Francisco San Francisco” P(class = CA|d) > P(class = MA|d)

slide-40
SLIDE 40

39

Naive Bayes Summary

◮ Domain knowledge is crucial to selecting good features ◮ Handle class imbalance by re-weighting classes ◮ Use log scale operations instead of multiplying probabilities P(cNB) = arg max

cj∈C

log P(cj) +

  • i

log P(xi|cj) ◮ Model is now just max of sum of weights

slide-41
SLIDE 41

40

Classification tasks in NLP Naive Bayes Classifier Log linear models

slide-42
SLIDE 42

41

Log linear model

◮ The model classifies input into output labels y ∈ Y ◮ Let there be m features, fk(x, y) for k = 1, . . . , m ◮ Define a parameter vector v ∈ Rm ◮ Each (x, y) pair is mapped to score: s(x, y) =

  • k

vk · fk(x, y) ◮ Using inner product notation: v · f(x, y) =

  • k

vk · fk(x, y) s(x, y) = v · f(x, y) ◮ To get a probability from the score: Renormalize! Pr(y | x; v) = exp (s(x, y))

  • y′∈Y exp (s(x, y′))
slide-43
SLIDE 43

42

Log linear model

◮ The name ‘log-linear model’ comes from: log Pr(y | x; v) = v · f(x, y)

  • linear term

− log

  • y′

exp

  • v · f(x, y′)
  • normalization term

◮ Once the weights v are learned, we can perform predictions using these features. ◮ The goal: to find v that maximizes the log likelihood L(v) of the labeled training set containing (xi, yi) for i = 1 . . . n L(v) =

  • i

log Pr(yi | xi; v) =

  • i

v · f(xi, yi) −

  • i

log

  • y′

exp

  • v · f(xi, y′)
slide-44
SLIDE 44

43

Log linear model

◮ Maximize: L(v) =

  • i

v · f(xi, yi) −

  • i

log

  • y′

exp

  • v · f(xi, y′)
  • ◮ Calculate gradient:

dL(v) dv

  • v

=

  • i

f(xi, yi) −

  • i

1

  • y′′ exp (v · f(xi, y′′))
  • y′

f(xi, y′) · exp

  • v · f(xi, y′)
  • =
  • i

f(xi, yi) −

  • i
  • y′

f(xi, y′) exp (v · f(xi, y′))

  • y′′ exp (v · f(xi, y′′))

=

  • i

f(xi, yi)

  • Observed counts

  • i
  • y′

f(xi, y′) Pr(y′ | xi; v)

  • Expected counts
slide-45
SLIDE 45

44

Gradient ascent

◮ Init: v(0) = 0 ◮ t ← 0 ◮ Iterate until convergence:

◮ Calculate: ∆ = dL(v)

dv

  • v=v(t)

◮ Find β∗ = arg maxβ L(v(t) + β∆) ◮ Set v(t+1) ← v(t) + β∗∆

slide-46
SLIDE 46

45

Learning the weights: v: Generalized Iterative Scaling

f # = maxx,y

  • j fj(x, y)

(the maximum possible feature value; needed for scaling) Initialize v(0) For each iteration t expected[j] ← 0 for j = 1 .. # of features For i = 1 to | training data | For each feature fj expected[j] += fj(xi, yi) · P(yi | xi; v(t)) For each feature fj(x, y)

  • bserved[j] = fj(x, y) ·

c(x,y) |training data|

For each feature fj(x, y) v(t+1)

j

← v(t)

j

·

f #

  • bserved[j]

expected[j]

  • cf. Goodman, NIPS ’01
slide-47
SLIDE 47

46

Acknowledgements

Many slides borrowed or inspired from lecture notes by Anoop Sarkar, Danqi Chen, Karthik Narasimhan, Dan Jurafsky, Michael Collins, Chris Dyer, Kevin Knight, Chris Manning, Philipp Koehn, Adam Lopez, Graham Neubig, Richard Socher and Luke Zettlemoyer from their NLP course materials. All mistakes are my own. A big thank you to all the students who read through these notes and helped me improve them.