Natural Language Processing (CSE 517): Text Classification (II) - - PowerPoint PPT Presentation

natural language processing cse 517 text classification ii
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing (CSE 517): Text Classification (II) - - PowerPoint PPT Presentation

Natural Language Processing (CSE 517): Text Classification (II) Noah Smith 2016 c University of Washington nasmith@cs.washington.edu February 1, 2016 1 / 17 Quick Review: Text Classification Input: a piece of text x V , usually a


slide-1
SLIDE 1

Natural Language Processing (CSE 517): Text Classification (II)

Noah Smith

c 2016 University of Washington nasmith@cs.washington.edu

February 1, 2016

1 / 17

slide-2
SLIDE 2

Quick Review: Text Classification

Input: a piece of text x ∈ V†, usually a document (r.v. X) Output: a label from a finite set L (r.v. L) Standard line of attack:

  • 1. Human experts label some data.
  • 2. Feed the data to a supervised machine learning algorithm that

constructs an automatic classifier classify : V† → L

  • 3. Apply classify to as much data as you want!

We covered na¨ ıve Bayes, reviewed multinomial logistic regression, and, briefly, the perceptron.

2 / 17

slide-3
SLIDE 3

Multinomial Logistic Regression as “Log Loss”

p(L = ℓ | x) = exp w · φ(x, ℓ)

  • ℓ′∈L exp w · φ(x, ℓ′)

MLE can be rewritten as a minimization problem: ˆ w = argmin

w n

  • i=1

log

  • ℓ′∈L

exp w · φ(xi, ℓ′)

  • fear

− w · φ(xi, ℓi)

  • hope

Recall from lecture 3:

◮ Be wise and regularize! ◮ Solve with batch or stochastic gradient methods. ◮ wj has an interpretation.

3 / 17

slide-4
SLIDE 4

Log Loss and Hinge Loss for (x, ℓ)

log loss:

  • log
  • ℓ′∈L

exp w · φ(x, ℓ′)

  • − w · φ(x, ℓ)

hinge loss:

  • max

ℓ′∈L w · φ(x, ℓ′)

  • − w · φ(x, ℓ)

In the binary case, where “score” is the linear score of the correct label:

−4 −2 2 4 1 2 3 4 5 score loss

In purple is the hinge loss, in blue is the log loss; in red is the

4 / 17

slide-5
SLIDE 5

Minimizing Hinge Loss: Perceptron

min

w n

  • i=1
  • max

ℓ′∈L w · φ(xi, ℓ′)

  • − w · φ(xi, ℓi)

Stochastic subgradient descent on the above is called the perceptron algorithm.

◮ For t ∈ {1, . . . , T}:

◮ Pick it uniformly at random from {1, . . . , n}. ◮ ˆ

ℓit ← argmaxℓ∈L w · φ(xit, ℓ)

◮ w ← w − α

  • φ(xit, ˆ

ℓit) − φ(xit, ℓit)

  • 5 / 17
slide-6
SLIDE 6

Error Costs

Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection.

6 / 17

slide-7
SLIDE 7

Error Costs

Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(ℓ, ℓ′) quantify the “badness” of substituting ℓ′ for correct label ℓ.

7 / 17

slide-8
SLIDE 8

Error Costs

Suppose that not all mistakes are equally bad. E.g., false positives vs. false negatives in spam detection. Let cost(ℓ, ℓ′) quantify the “badness” of substituting ℓ′ for correct label ℓ. Intuition: estimate the scoring function so that score(ℓi) − score(ˆ ℓ) ∝ cost(ℓi, ˆ ℓ)

8 / 17

slide-9
SLIDE 9

General Hinge Loss for (x, ℓ)

  • max

ℓ′∈L w · φ(x, ℓ′) + cost(ℓ, ℓ′)

  • − w · φ(x, ℓ)

In the binary case, with cost(−1, 1) = 1:

−4 −2 2 4 1 2 3 4 5 6 x function(x) −x + pmax(x, 1)

In blue is the general hinge loss; in red is the “zero-one” loss (error).

9 / 17

slide-10
SLIDE 10

Support Vector Machines

A different motivation for the generalized hinge: ˆ w =

n

  • i=1
  • ℓ∈L

αi,ℓ · φ(xi, ℓ) where most only a small number of αi,ℓ are nonzero.

10 / 17

slide-11
SLIDE 11

Support Vector Machines

A different motivation for the generalized hinge: ˆ w =

n

  • i=1
  • ℓ∈L

αi,ℓ · φ(xi, ℓ) where most only a small number of αi,ℓ are nonzero. Those φ(xi, ℓ) are called “support vectors” because they “support” the decision boundary. ˆ w · φ(x, ℓ′) =

  • (i,ℓ)∈S

αi,ℓ · φ(xi, ℓ) · φ(x, ℓ′) See Crammer and Singer (2001) for the multiclass version.

11 / 17

slide-12
SLIDE 12

Support Vector Machines

A different motivation for the generalized hinge: ˆ w =

n

  • i=1
  • ℓ∈L

αi,ℓ · φ(xi, ℓ) where most only a small number of αi,ℓ are nonzero. Those φ(xi, ℓ) are called “support vectors” because they “support” the decision boundary. ˆ w · φ(x, ℓ′) =

  • (i,ℓ)∈S

αi,ℓ · φ(xi, ℓ) · φ(x, ℓ′) See Crammer and Singer (2001) for the multiclass version. Really good tool: SVMlight, http://svmlight.joachims.org

12 / 17

slide-13
SLIDE 13

Support Vector Machines: Remarks

◮ Regularization is critical; squared ℓ2 is most common, and

  • ften used in (yet another) motivation around the idea of

“maximizing margin” around the hyperplane separator.

13 / 17

slide-14
SLIDE 14

Support Vector Machines: Remarks

◮ Regularization is critical; squared ℓ2 is most common, and

  • ften used in (yet another) motivation around the idea of

“maximizing margin” around the hyperplane separator.

◮ Often, instead of linear models that explicitly calculate w · φ,

these methods are “kernelized” and rearrange all calculations to involve inner-products between φ vectors.

◮ Example:

Klinear(v, w) = v · w Kpolynomial(v, w) = (v · w + 1)p KGaussian(v, w) = exp −v − w2

2

2σ2

◮ Linear kernels are most common in NLP. 14 / 17

slide-15
SLIDE 15

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance. 15 / 17

slide-16
SLIDE 16

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance.

◮ Na¨

ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization.

◮ You should have a basic understanding of the tradeoffs in

choosing among them.

16 / 17

slide-17
SLIDE 17

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance.

◮ Na¨

ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization.

◮ You should have a basic understanding of the tradeoffs in

choosing among them.

◮ Rumor: random forests are widely used in industry when

performance matters more than interpretability.

17 / 17

slide-18
SLIDE 18

General Remarks

◮ Text classification: many problems, all solved with supervised

learners.

◮ Lexicon features can provide problem-specific guidance.

◮ Na¨

ıve Bayes, log-linear, and SVM are all linear methods that tend to work reasonably well, with good features and smoothing/regularization.

◮ You should have a basic understanding of the tradeoffs in

choosing among them.

◮ Rumor: random forests are widely used in industry when

performance matters more than interpretability.

◮ Lots of papers about neural networks, but with

hyperparameter tuning applied fairly to linear models, the advantage is not clear (Yogatama et al., 2015).

18 / 17

slide-19
SLIDE 19

Readings and Reminders

◮ Jurafsky and Martin (2015); Collins (2011) ◮ Submit a suggestion for an exam question by Friday at 5pm.

19 / 17

slide-20
SLIDE 20

References I

Michael Collins. The naive Bayes model, maximum-likelihood estimation, and the EM algorithm, 2011. URL http://www.cs.columbia.edu/~mcollins/em.pdf. Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5): 265–292, 2001. Daniel Jurafsky and James H. Martin. Classification: Naive Bayes, logistic regression, sentiment (draft chapter), 2015. URL https://web.stanford.edu/~jurafsky/slp3/7.pdf. Dani Yogatama, Lingpeng Kong, and Noah A. Smith. Bayesian optimization of text

  • representations. In Proc. of EMNLP, 2015. URL

http://www.aclweb.org/anthology/D/D15/D15-1251.pdf.

20 / 17