[PPT] - Text classification II CE-324: Modern Information Retrieval Sharif PowerPoint Presentation

SLIDE 1

Text classification II

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Fall 2018

Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

Outline

} Vector space classification

} Rocchio } Linear classifiers

} SVM

} kNN

2

SLIDE 3

Features

} Supervised learning classifiers can use any sort of feature

} URL, email address, punctuation, capitalization, dictionaries,

network features

} In the simplest bag of words view of documents

} We use only word features } we use all of the words in the text (not a subset)

3

SLIDE 4

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale

genre. I would recommend it to

just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ( )=c

4

SLIDE 5

The bag of words representation

γ( )=c

great 2 love 2 recommend 1 laugh 1 happy 1 ... ...

5

SLIDE 6

6

Recall: vector space representation

} Each doc is a vector

} One component for each term (= word).

} Terms are axes

} Usually normalize vectors to unit length.

} High-dimensional vector space:

} 10,000+ dimensions, or even 100,000+ } Docs are vectors in this space

} How can we do classification in this space?

Sec.14.1

SLIDE 7

7

Classification using vector spaces

} Training set: a set of docs, each labeled with its class (e.g.,

topic)

} This set corresponds to a labeled set of points (or, equivalently,

vectors) in the vector space

} Premise 1: Docs in the same class form a contiguous

regions of space

} Premise 2: Docs from different classes don’t overlap

(much)

} We define surfaces to delineate classes in the space

Sec.14.1

SLIDE 8

8

Documents in a vector space

Government Science Arts

Sec.14.1

SLIDE 9

9

Test document of what class?

Government Science Arts

Sec.14.1

SLIDE 10

10

Test document of what class?

Government Science Arts

Is this similarity hypothesis true in general?

Our main topic today is how to find good separators

Sec.14.1

Government

SLIDE 11

Relevance feedback relation to classification

11

} In

relevance feedback, the user marks docs as relevant/non-relevant.

} Relevant/non-relevant can be viewed as classes or categories.

} For each doc, the user decides which of these two classes

is correct.

} Relevance feedback is a form of text classification.

SLIDE 12

Rocchino for text classification

} Relevance

feedback methods can be adapted for text categorization

} Relevance feedback can be viewed as 2-class classification

} Use standard tf-idf weighted vectors to represent text docs

} For training docs in each category, compute a prototype as

centroid of the vectors of the training docs in the category.

} Prototype = centroid of members of class

} Assign test docs to the category with the closest prototype

vector based on cosine similarity.

12

Sec.14.2

SLIDE 13

Definition of centroid

𝜈 ⃗ 𝑑 = 1 𝐸' ( 𝑒 ⃗

*∈,-

} 𝐸𝑑: docs that belong to class 𝑑 } 𝑒

⃗ : vector space representation of 𝑒.

} Centroid will in general not be a unit vector even when

the inputs are unit vectors.

13

Sec.14.2

SLIDE 14

Rocchino algorithm

14

SLIDE 15

Rocchio: example

15

} We will see that Rocchino finds linear boundaries

between classes

Government Science Arts

SLIDE 16

Illustration of Rocchio: text classification

16

Sec.14.2

SLIDE 17

17

Rocchio properties

} Forms a simple generalization of the examples in each

class (a prototype).

} Prototype vector does not need to be normalized.

} Classification is based on similarity to class prototypes. } Does not guarantee classifications are consistent with the

given training data.

Sec.14.2

SLIDE 18

18

Rocchio anomaly

} Prototype

models have problems with polymorphic (disjunctive) categories.

Sec.14.2

SLIDE 19

Rocchio classification: summary

} Rocchio forms a simple representation for each class:

} Centroid/prototype } Classification is based on similarity to the prototype

} It does not guarantee that classifications are consistent

with the given training data

} It is little used outside text classification

} It has been used quite effectively for text classification } But in general worse than many other classifiers

} Rocchio does not handle nonconvex, multimodal classes

correctly.

19

Sec.14.2

SLIDE 20

Linear classifiers

20

} Assumption:The classes are linearly separable. } Classification decision: ∑

𝑥1𝑦1

3 145

+ 𝑥7 > 0?

} First, we only consider binary classifiers.

} Geometrically, this corresponds to a line (2D), a plane (3D) or

a hyperplane (higher dimensionalities) decision boundary.

} Find the parameters 𝑥7, 𝑥5, … , 𝑥3 based on training set.

} Methods for finding these parameters: Perceptron, Rocchio, …

SLIDE 21

21

Separation by hyperplanes

} A simplifying assumption is linear separability:

} in 2 dimensions, can separate classes by a line } in higher dimensions, need hyperplanes

Sec.14.4

SLIDE 22

Two-class Rocchio as a linear classifier

} Line or hyperplane defined by: } For Rocchio, set:

𝑥 = 𝜈 ⃗ 𝑑5 − 𝜈 ⃗ 𝑑=

𝑥7 = 1 2

𝜈 ⃗ 𝑑5

= − 𝜈

⃗ 𝑑=

=

22

Sec.14.2

𝑥7 + ( 𝑥1𝑒1

? 145

= 𝑥7 + 𝑥@𝑒 ⃗ ≥ 0

SLIDE 23

23

Linear classifier: example

} Class:“interest” (as in interest rate) } Example features of a linear classifier

}

wi ti wi ti

} To classify, find dot product of feature vector and weights

0.70 prime
0.67 rate
0.63 interest
0.60 rates
0.46 discount
0.43 bundesbank
−0.71 dlrs
−0.35 world
−0.33 sees
−0.25 year
−0.24 group
−0.24 dlr

Sec.14.4

SLIDE 24

Linear classifier: example

24

} Class “interest” in Reuters-21578 } 𝑒5:“rate discount dlrs world” } 𝑒=:“prime dlrs” } 𝑥@𝑒

⃗5 = 0.07 ⇒ 𝑒5 is assigned to the “interest” class

} 𝑥@𝑒

⃗= = −0.01 ⇒ 𝑒= is not assigned to this class

𝑥7 = 0

SLIDE 25

Naïve Bayes as a linear classifier

25

𝑄 𝐷5 G 𝑄 𝑢1 𝐷5 IJK,L

? 145

> 𝑄(𝐷=) G 𝑄 𝑢1 𝐷= IJK,L

? 145

log 𝑄(𝐷5) + ( 𝑢𝑔

1,*× log 𝑄 𝑢1 𝐷5 ? 145

> log 𝑄(𝐷=) + ( 𝑢𝑔

1,*× log 𝑄 𝑢1 𝐷= ? 145

𝑥1 = log

T 𝑢1 𝐷5 T 𝑢1 𝐷= 𝑦1 = 𝑢𝑔 1,*

𝑥7 = log

T UV T UV

SLIDE 26

26

Linear programming / Perceptron

Find a,b,c, such that ax + by > c for red points ax + by < c for blue points

Sec.14.4

SLIDE 27

Perceptron

27

} If 𝒚(1) is misclassified:

𝒙IY5 = 𝒙I + 𝒚(1)𝑧(1)

} Perceptron convergence theorem: for linearly separable data

} If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒙, 𝑢 ← 0 repeat 𝑢 ← 𝑢 + 1 𝑗 ← 𝑢 mod 𝑂 if 𝒚(1) is misclassified then 𝒙 = 𝒙 + 𝒚(1)𝑧(1) Until all patterns properly classified

𝜃 can be set to 1 and proof still works

SLIDE 28

28

Linear classifiers

} Many common text classifiers are linear classifiers } Classifiers more powerful than linear often don’t perform

better on text problems.Why?

} Despite

the similarity

f

linear classifiers, noticeable performance differences between them

} For separable problems, there is an infinite number of separating

hyperplanes.

} Different training methods pick different hyperplanes.

} Also different strategies for non-separable problems

Sec.14.4

SLIDE 29

29

Which hyperplane?

In general, lots of possible solutions

Sec.14.4

SLIDE 30

30

Which hyperplane?

} Lots of possible solutions } Some methods find a separating hyperplane, but not the

ptimal one [according to some criterion of expected goodness]

} Which points should influence optimality?

} All points

} E.g., Rocchino

} Only “difficult points” close to decision boundary

} E.g., SupportVector Machine (SVM)

Sec.14.4

SLIDE 31

31

Linear classifiers: Which Hyperplane?

} Some methods find a separating hyperplane, but not the optimal

ne [according to some criterion of expected goodness]

} E.g., perceptron

} A SupportVector Machine (SVM) finds an optimal* solution.

} Maximizes the distance between the hyperplane and the “difficult

points” close to decision boundary

} One intuition: if there are no points near the decision surface, then

there are no very uncertain classification decisions

Ch. 15

SLIDE 32

32

Support Vector Machine (SVM)

Support vectors Maximizes margin

} SVMs maximize the margin around

the separating hyperplane.

} A.k.a. large margin classifiers

} The

decision function is fully specified by a subset of training samples, the support vectors.

} Solving

SVMs is a quadratic programming problem

} Seen

by many as the most successful current text classification method*

*but other discriminative methods

ften perform very similarly
Sec. 15.1

Narrower margin

SLIDE 33

33

Another intuition

} If you have to place a fat separator between classes, you

have less choices, and so the capacity of the model has been decreased

Sec. 15.1

SLIDE 34

34

} w: decision hyperplane normal vector } xi: data point i } yi: class of data point i (+1 or -1)

} Classifier is:

f(xi) = sign(wTxi + b)

} Functional margin of xi is:

yi (wTxi + b)

} The functional margin of a dataset is twice the minimum

functional margin for any point

} The factor of 2 comes from measuring the whole width of the

margin

} Problem: we can increase this margin simply by scaling w, b….

Maximum Margin: Formalization

Sec. 15.1

SLIDE 35

35

Geometric Margin

} Distance from example to the separator is } Examples closest to the hyperplane are support vectors. } Margin ρ of the separator is the width of separation between

support vectors of classes.

w x w b y r

T +

=

r ρ x

w

Sec. 15.1

SLIDE 36

36

Linear SVM Mathematically

The linearly separable case

} Assume that the functional margin of each data item is at least 1, then

the following two constraints follow for a training set {(xi ,yi)}

} For support vectors, the inequality becomes an equality } Then, since each example’s distance from the hyperplane is } The margin is:

1 =

i

y if 1 ≥ b +

i

x

T

w 1 = −

i

y if 1 ≤ − b +

i

x

T

w

w 2 = ρ w x w b y r

T +

=

Sec. 15.1

SLIDE 37

37

Linear Support Vector Machine (SVM)

}

Hyperplane wT x + b = 0

}

Extra scale constraint: mini=1,…,n |wTxi + b| = 1

}

This implies: wT(xa–xb) = 2 ρ = ‖xa–xb‖2 = 2/‖w‖2

b = x +

T

w 1 b = +

a

x

T

w 1

b =

+

b

x

T

w

ρ

Sec. 15.1

SLIDE 38

38

Linear SVMs Mathematically (cont.)

} Then we can formulate the quadratic optimization problem: } A better formulation (min ‖w‖= max 1/‖w‖ ):

Find w and b such that is maximized; and for all {(xi , yi)} 1

=

i

y if 1

≤

b +

i

x

T

w ; 1 =

i

y if 1 ≥ b +

i

x

T

w

w 2 = ρ

Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (wTxi + b) ≥ 1

Sec. 15.1

SLIDE 39

39

Solving the Optimization Problem

} This is now optimizing a quadratic function subject to linear constraints } Quadratic optimization problems are a well-known class of mathematical

programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) Find w and b such that is minimized; w

T

w ½ = ) w ( Φ 1 ≥ ) b +

i

x

T

w (

i

y :

)}

i

y ,

i

x

(

{

and for all

Sec. 15.1

SLIDE 40

40

Soft Margin Classification

} If the training data is not linearly

separable, slack variables ξi can be added to allow misclassification of difficult or noisy examples.

} Allow some errors

} Let some points be moved to where

they belong, at a cost

} Still, try to minimize training set

errors, and to place hyperplane “far” from each class (large margin)

j

ξ

i

ξ

Sec. 15.2.1

SLIDE 41

41

Soft Margin Classification Mathematically

} The old formulation: } The new formulation incorporating slack variables: } Parameter C can be viewed as a way to control overfitting

} A regularization term

Find w and b such that

)}

i

y ,

i

x

(

{

is minimized and for all w

T

½ w = ) w ( Φ 1 ≥ b) +

i

x

T

w (

i

y Find w and b such that

)}

i

y ,

i

x

(

{

is minimized and for all

i

ξ

Σ

C + w

T

½ w = ) w ( Φ i for all ≥

i

ξ and

i

ξ

1

≥ ) b +

i

x

T

w (

i

y

Sec. 15.2.1

SLIDE 42

42

Summary

} Support vector machines (SVM)

} Choose hyperplane based on support vectors

} Support vector = “critical” point close to decision boundary

} Perhaps best performing text classifier

} But there are other methods that perform about as well as SVM, such as

regularized logistic regression (Zhang & Oles 2001)

} Partly popular due to availability of good software

} SVMlight is accurate and fast – and free (for research) } Now lots of good software: libsvm,TinySVM, ….

SLIDE 43

43

Linear classifiers: binary and multiclass classification

} Consider 2 class problems

} Deciding between two classes, perhaps, government and non-

government

} Multi-class

} How do we define (and find) the separating surface? } How do we decide which region a test doc is in?

Sec.14.4

SLIDE 44

44

More than two classes

} One-of classification (multi-class classification)

} Classes are mutually exclusive. } Each doc belongs to exactly one class

} Any-of classification

} Classes are not mutually exclusive. } A doc can belong to 0, 1, or >1 classes. } For simplicity, decompose into K binary problems } Quite common for docs

Sec.14.5

SLIDE 45

45

Set of binary classifiers: any of

} Build a separator between each class and its complementary

set (docs from all other classes).

} Given test doc, evaluate it for membership in each class. } Apply decision criterion of classifiers independently

} It works although considering dependencies between categories may

be more accurate

Sec.14.5

SLIDE 46

46

Multi-class: set of binary classifiers

} Build

a separator between each class and its complementary set (docs from all other classes).

} Given test doc, evaluate it for membership in each class. } Assign doc to class with:

} maximum score } maximum confidence } maximum probability

? ? ?

Sec.14.5

SLIDE 47

47

k Nearest Neighbor Classification

§ kNN = k Nearest Neighbor § To classify a document d:

§ Define k-neighborhood as the k nearest neighbors

f d

§ Pick

the majority class label in the k- neighborhood

Sec.14.3

SLIDE 48

48

Nearest-Neighbor (1NN) classifier

} Learning phase:

} Just storing the representations of the training examples in D. } Does not explicitly compute category prototypes.

} Testing instance 𝑦 (under 1NN):

} Compute similarity between x and all examples in D. } Assign x the category of the most similar example in D.

} Rationale of kNN: contiguity hypothesis

} We expect a test doc 𝑒 to have the same label as the training docs

located in the local region surrounding 𝑒.

Sec.14.3

SLIDE 49

49

Test Document = Science

Government Science Arts

Sec.14.1

SLIDE 50

50

k Nearest Neighbor (kNN) classifier

} 1NN: subject to errors due to

} A single atypical example. } Noise (i.e., an error) in the category label of a single training

example.

} More robust alternative:

} find the k most-similar examples } return the majority category of these k examples.

Sec.14.3

SLIDE 51

51

kNN example: k=6

Government Science Arts

P(science| )?

Sec.14.3

SLIDE 52

52

kNN decision boundaries

Government Science Arts Boundaries are in principle arbitrary surfaces (polyhedral)

kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision (unlike Rocchio, etc.)

Sec.14.3

SLIDE 53

53

k Nearest Neighbor

} Using only the closest example (1NN) is subject to errors due

to:

} A single atypical example. } Noise (i.e., an error) in the category label of a single training

example.

} More robust: find the k examples and return the majority

category of these k

} k is typically odd to avoid ties; 3 and 5 are most common

Sec.14.3

SLIDE 54

1NN: Voronoi tessellation

54

The decision boundaries between classes are piecewise linear.

SLIDE 55

kNN algorithm

55

SLIDE 56

Time complexity of kNN

56

} kNN test time proportional to the size of the training

set!

} kNN is inefficient for very large training sets.

SLIDE 57

57

Similarity metrics

} Nearest neighbor method depends on a similarity (or

distance) metric.

} Euclidean distance: Simplest for continuous vector space. } Hamming distance: Simplest for binary instance space.

} number of feature values that differ

} For text, cosine similarity of tf.idf weighted vectors is

typically most effective.

Sec.14.3

SLIDE 58

58

Illustration of kNN (k=3) for text vector space

Sec.14.3

SLIDE 59

59

3-NN vs. Rocchio

} Nearest

Neighbor tends to handle polymorphic categories better than Rocchio/NB.

SLIDE 60

60

Nearest neighbor with inverted index

} Naively, finding nearest neighbors requires a linear search

through |𝐸| docs in collection

} Similar to determining the 𝑙 best retrievals using the test doc

as a query to a database of training docs.

} Use standard vector space inverted index methods to

find the k nearest neighbors.

} Testing Time: O(B|Vt|)

} Typically B << |D| if a large list of stopwords is used.

Sec.14.3

B is the average number of training docs in which at least one word of test-document appears

SLIDE 61

A nonlinear problem

} Linear

classifiers do badly on this task

} kNN will do very well

(assuming enough training data)

61

Sec.14.4

SLIDE 62

Overfitting example

62

SLIDE 63

Bias vs. capacity – notions and terminology

} Consider asking a botanist: Is an object a tree?

} T

o much capacity, low bias

} Botanist who memorizes } Will always say “no” to new object (e.g., different # of

leaves)

} Not enough capacity, high bias

} Lazy botanist } Says “yes” if the object is green

} You want the middle ground

63

(Example due to C. Burges)

Sec.14.6

SLIDE 64

64

Choosing the correct model capacity

Sec.14.6

SLIDE 65

kNN vs. linear classifiers

} Bias/Variance tradeoff } Variance ≈ Capacity } kNN has high variance and low bias. } Infinite memory } Rocchio has low variance and high bias. } Linear decision surface between classes

65

Sec.14.6

SLIDE 66

66

kNN: summary

} No training phase necessary

} Actually: We always preprocess the training set, so in reality training time of

kNN is linear.

} May be expensive at test time } kNN is very accurate if training set is large.

} In most cases it’s more accurate than linear classifiers } Optimality result: asymptotically zero error if Bayes rate is zero.

} Scales well with large number of classes

} Don’t need to train C classifiers for C classes

} Classes can influence each other

} Small changes to one class can have ripple effect

Sec.14.3

SLIDE 67

67

Resources

} IIR, Chapter 14, 15.1, 15.2.1.

Ch. 14