[PPT] - Text classification II CE-324: Modern Information Retrieval Sharif PowerPoint Presentation

SLIDE 1

Text classification II

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Fall 2017

Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

Outline

 Vector space classification

 Rocchio  Linear classifiers  kNN

2

SLIDE 3

Standing queries

 The path from IR to text classification:

 You have an information need to monitor, say:

 Unrest in the Niger delta region

 You want to rerun an appropriate query periodically to find new news

items on this topic

 You will be sent new documents that are found

 I.e., it’s not ranking but classification (relevant vs. not relevant)

 Such queries are called standing queries

 Long used by “information professionals”  A modern mass instantiation is Google Alerts

 Standing queries are (hand-written) text classifiers

Ch. 13

SLIDE 4

4

Recall: vector space representation

 Each doc is a vector

 One component for each term (= word).

 Terms are axes

 Usually normalize vectors to unit length.

 High-dimensional vector space:

 10,000+ dimensions, or even 100,000+  Docs are vectors in this space

 How can we do classification in this space?

Sec.14.1

SLIDE 5

5

Classification using vector spaces

 Training set: a set of docs, each labeled with its class (e.g.,

topic)

 This set corresponds to a labeled set of points (or, equivalently,

vectors) in the vector space

 Premise 1: Docs in the same class form a contiguous

regions of space

 Premise 2: Docs from different classes don’t overlap

(much)

 We define surfaces to delineate classes in the space

Sec.14.1

SLIDE 6

6

Documents in a vector space

Government Science Arts

Sec.14.1

SLIDE 7

7

Test document of what class?

Government Science Arts

Sec.14.1

SLIDE 8

8

Test document of what class?

Government Science Arts

Is this similarity hypothesis true in general?

Our main topic today is how to find good separators

Sec.14.1

Government

SLIDE 9

Relevance feedback relation to classification

9

 In

relevance feedback, the user marks docs as relevant/non-relevant.

 Relevant/non-relevant can be viewed as classes or categories.

 For each doc, the user decides which of these two classes

is correct.

 Relevance feedback is a form of text classification.

SLIDE 10

Rocchino for text classification

 Relevance

feedback methods can be adapted for text categorization

 Relevance feedback can be viewed as 2-class classification

 Use standard tf-idf weighted vectors to represent text docs

 For training docs in each category, compute a prototype as

centroid of the vectors of the training docs in the category.

 Prototype = centroid of members of class

 Assign test docs to the category with the closest prototype

vector based on cosine similarity.

10

Sec.14.2

SLIDE 11

Definition of centroid

𝜈 𝑑 = 1 𝐸𝑑

𝑒∈𝐸𝑑

𝑒

 𝐸𝑑: docs that belong to class 𝑑 

𝑒 : vector space representation of 𝑒.

 Centroid will in general not be a unit vector even when

the inputs are unit vectors.

11

Sec.14.2

SLIDE 12

Rocchino algorithm

12

SLIDE 13

Rocchio: example

13

 We will see that Rocchino finds linear boundaries

between classes

Government Science Arts

SLIDE 14

Illustration of Rocchio: text classification

14

Sec.14.2

SLIDE 15

15

Rocchio properties

 Forms a simple generalization of the examples in each

class (a prototype).

 Prototype vector does not need to be normalized.

 Classification is based on similarity to class prototypes.  Does not guarantee classifications are consistent with the

given training data.

Sec.14.2

SLIDE 16

16

Rocchio anomaly

 Prototype

models have problems with polymorphic (disjunctive) categories.

Sec.14.2

SLIDE 17

Rocchio classification: summary

 Rocchio forms a simple representation for each class:

 Centroid/prototype  Classification is based on similarity to the prototype

 It does not guarantee that classifications are consistent

with the given training data

 It is little used outside text classification

 It has been used quite effectively for text classification  But in general worse than many other classifiers

 Rocchio does not handle nonconvex, multimodal classes

correctly.

17

Sec.14.2

SLIDE 18

Linear classifiers

18

 Assumption:The classes are linearly separable.  Classification decision: 𝑗=1

𝑛 𝑥𝑗𝑦𝑗 + 𝑥0 > 0?

 First, we only consider binary classifiers.

 Geometrically, this corresponds to a line (2D), a plane (3D) or

a hyperplane (higher dimensionalities) decision boundary.

 Find the parameters 𝑥0, 𝑥1, … , 𝑥𝑛 based on training set.

 Methods for finding these parameters: Perceptron, Rocchio, …

SLIDE 19

19

Separation by hyperplanes

 A simplifying assumption is linear separability:

 in 2 dimensions, can separate classes by a line  in higher dimensions, need hyperplanes

Sec.14.4

SLIDE 20

Two-class Rocchio as a linear classifier

 Line or hyperplane defined by:  For Rocchio, set:

𝑥 = 𝜈 𝑑1 − 𝜈 𝑑2

𝑥0 = 1 2

𝜈 𝑑1

2 −

𝜈 𝑑2

2

20

Sec.14.2

𝑥0 +

𝑗=1 𝑁

𝑥𝑗𝑒𝑗 = 𝑥0 + 𝑥𝑈 𝑒 ≥ 0

SLIDE 21

21

Linear classifier: example

 Class:“interest” (as in interest rate)  Example features of a linear classifier



wi ti wi ti

 To classify, find dot product of feature vector and weights

0.70 prime
0.67 rate
0.63 interest
0.60 rates
0.46 discount
0.43 bundesbank
−0.71 dlrs
−0.35 world
−0.33 sees
−0.25 year
−0.24 group
−0.24 dlr

Sec.14.4

SLIDE 22

Linear classifier: example

22

 Class “interest” in Reuters-21578  𝑒1:“rate discount dlrs world”  𝑒2:“prime dlrs”  𝑥𝑈

𝑒1 = 0.07 ⇒ 𝑒1 is assigned to the “interest” class

 𝑥𝑈

𝑒2 = −0.01 ⇒ 𝑒2 is not assigned to this class

𝑥0 = 0

SLIDE 23

Naïve Bayes as a linear classifier

23

𝑄 𝐷1

𝑗=1 𝑁

𝑄 𝑢𝑗 𝐷1 𝑢𝑔𝑗,𝑒 > 𝑄(𝐷2)

𝑗=1 𝑁

𝑄 𝑢𝑗 𝐷2 𝑢𝑔𝑗,𝑒 log 𝑄(𝐷1) +

𝑗=1 𝑁

𝑢𝑔

𝑗,𝑒 × log 𝑄 𝑢𝑗 𝐷1

> log 𝑄(𝐷2) +

𝑗=1 𝑁

𝑢𝑔

𝑗,𝑒 × log 𝑄 𝑢𝑗 𝐷2

𝑥𝑗 = log

𝑄 𝑢𝑗 𝐷1 𝑄 𝑢𝑗 𝐷2

𝑦𝑗 = 𝑢𝑔

𝑗,𝑒

𝑥0 = log

𝑄 𝐷1 𝑄 𝐷1

SLIDE 24

24

Linear programming / Perceptron

Find a,b,c, such that ax + by > c for red points ax + by < c for blue points

Sec.14.4

SLIDE 25

25

Which hyperplane?

In general, lots of possible solutions for a,b,c.

Sec.14.4

SLIDE 26

26

Which hyperplane?

 Lots of possible solutions for a,b,c.  Some methods find a separating hyperplane, but not the

ptimal one [according to some criterion of expected goodness]

 Which points should influence optimality?

 All points

 E.g., Rocchino

 Only “difficult points” close to decision boundary

 E.g., SupportVector Machine (SVM)

Sec.14.4

SLIDE 27

27

Support Vector Machine (SVM)

Support vectors Maximizes margin

 SVMs maximize the margin around

the separating hyperplane.

 A.k.a. large margin classifiers

 Solving

SVMs is a quadratic programming problem

 Seen

by many as the most successful current text classification method*

*but other discriminative methods

ften perform very similarly
Sec. 15.1

Narrower margin

SLIDE 28

28

Linear classifiers

 Many common text classifiers are linear classifiers  Classifiers more powerful than linear often don’t perform

better on text problems.Why?

 Despite

the similarity

f

linear classifiers, noticeable performance differences between them

 For separable problems, there is an infinite number of separating

hyperplanes.

 Different training methods pick different hyperplanes.

 Also different strategies for non-separable problems

Sec.14.4

SLIDE 29

29

Linear classifiers: binary and multiclass classification

 Consider 2 class problems

 Deciding between two classes, perhaps, government and non-

government

 Multi-class

 How do we define (and find) the separating surface?  How do we decide which region a test doc is in?

Sec.14.4

SLIDE 30

30

More than two classes

 One-of classification (multi-class classification)

 Classes are mutually exclusive.  Each doc belongs to exactly one class

 Any-of classification

 Classes are not mutually exclusive.  A doc can belong to 0, 1, or >1 classes.  For simplicity, decompose into K binary problems  Quite common for docs

Sec.14.5

SLIDE 31

31

Set of binary classifiers: any of

 Build a separator between each class and its complementary

set (docs from all other classes).

 Given test doc, evaluate it for membership in each class.  Apply decision criterion of classifiers independently

 It works although considering dependencies between categories may

be more accurate

Sec.14.5

SLIDE 32

32

Multi-class: set of binary classifiers

 Build

a separator between each class and its complementary set (docs from all other classes).

 Given test doc, evaluate it for membership in each class.  Assign doc to class with:

 maximum score  maximum confidence  maximum probability

? ? ?

Sec.14.5

SLIDE 33

33

k Nearest Neighbor Classification

kNN = k Nearest Neighbor
To classify a document d:
Define k-neighborhood as the k nearest neighbors
f d
Pick

the majority class label in the k- neighborhood

Sec.14.3

SLIDE 34

34

Nearest-Neighbor (1NN) classifier

 Learning phase:

 Just storing the representations of the training examples in D.  Does not explicitly compute category prototypes.

 Testing instance 𝑦 (under 1NN):

 Compute similarity between x and all examples in D.  Assign x the category of the most similar example in D.

 Rationale of kNN: contiguity hypothesis

 We expect a test doc 𝑒 to have the same label as the training docs

located in the local region surrounding 𝑒.

Sec.14.3

SLIDE 35

35

Test Document = Science

Government Science Arts

Sec.14.1

SLIDE 36

36

k Nearest Neighbor (kNN) classifier

 1NN: subject to errors due to

 A single atypical example.  Noise (i.e., an error) in the category label of a single training

example.

 More robust alternative:

 find the k most-similar examples  return the majority category of these k examples.

Sec.14.3

SLIDE 37

38

kNN example: k=6

Government Science Arts

P(science| )?

Sec.14.3

SLIDE 38

39

kNN decision boundaries

Government Science Arts Boundaries are in principle arbitrary surfaces (polyhedral)

kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision (unlike Rocchio, etc.)

Sec.14.3

SLIDE 39

1NN: Voronoi tessellation

40

The decision boundaries between classes are piecewise linear.

SLIDE 40

kNN algorithm

41

SLIDE 41

Time complexity of kNN

42

 kNN test time proportional to the size of the training

set!

 kNN is inefficient for very large training sets.

SLIDE 42

43

Similarity metrics

 Nearest neighbor method depends on a similarity (or

distance) metric.

 Euclidean distance: Simplest for continuous vector space.  Hamming distance: Simplest for binary instance space.

 number of feature values that differ

 For text, cosine similarity of tf.idf weighted vectors is

typically most effective.

Sec.14.3

SLIDE 43

44

Illustration of kNN (k=3) for text vector space

Sec.14.3

SLIDE 44

45

3-NN vs. Rocchio

 Nearest

Neighbor tends to handle polymorphic categories better than Rocchio/NB.

SLIDE 45

46

Nearest neighbor with inverted index

 Naively, finding nearest neighbors requires a linear search

through |𝐸| docs in collection

 Similar to determining the 𝑙 best retrievals using the test doc

as a query to a database of training docs.

 Use standard vector space inverted index methods to

find the k nearest neighbors.

 TestingTime: O(B|Vt|)

 Typically B << |D| if a large list of stopwords is used.

Sec.14.3

B is the average number of training docs in which at least one word of test-document appears

SLIDE 46

A nonlinear problem

 Linear

classifiers do badly on this task

 kNN will do very well

(assuming enough training data)

47

Sec.14.4

SLIDE 47

Overfitting example

48

SLIDE 48

49

kNN: summary

 No training phase necessary

 Actually: We always preprocess the training set, so in reality training time of

kNN is linear.

 May be expensive at test time  kNN is very accurate if training set is large.

 In most cases it’s more accurate than linear classifiers  Optimality result: asymptotically zero error if Bayes rate is zero.

 But kNN can be very inaccurate if training set is small.

 Scales well with large number of classes

 Don’t need to train C classifiers for C classes

 Classes can influence each other

 Small changes to one class can have ripple effect

Sec.14.3

SLIDE 49

50

Choosing the correct model capacity

Sec.14.6

SLIDE 50

Linear classifiers for doc classification

51

 We typically encounter high-dimensional spaces in text

applications.

 With increased dimensionality, the likelihood of linear

separability increases rapidly

 Many of the best-known text classification algorithms are

linear.

 More powerful nonlinear learning methods are more sensitive

to noise in the training data.

 Nonlinear learning methods sometimes perform better if

the training set is large, but by no means in all cases.

SLIDE 51

Which classifier do I use for a given text classification problem?

 Is there a learning method that is optimal for all text

classification problems?

 No, because there is a tradeoff between complexity of the

classifier and its performance on new data points.

 Factors to take into account:

 How much training data is available?  How simple/complex is the problem?  How noisy is the data?  How stable is the problem over time?

 For an unstable problem, it’s better to use a simple and robust

classifier.

52

SLIDE 52

Reuters collection

53

 Only about 10 out of 118 categories are large

SLIDE 53

Evaluating classification

54

 Evaluation

must be done

n

test data that are independent of the training data

 training and test sets are disjoint.

 Measures: Precision, recall, F1, accuracy

 F1 allows us to trade off precision against recall (harmonic

mean of P and R).

SLIDE 54

Precision P and recall R

55

 Precision P = tp/(tp + fp)  Recall

R = tp/(tp + fn)

actually in the class actually in the class predicted to be in the class

tp fp

Predicted not to be in the class

fn tn

SLIDE 55

56

Good practice department: Make a confusion matrix

 This (i, j) entry means 53 of the docs actually in class i were

put in class j by the classifier.

 In a perfect classification, only the diagonal has non-zero entries  Look at common confusions and how they might be addressed

53 Class assigned by classifier Actual Class

Sec. 15.2.4

𝑑𝑗𝑘

SLIDE 56

57

Per class evaluation measures

 Recall: Fraction of docs in class i classified correctly:  Precision: Fraction of docs assigned class i that are

actually about class i:

 Accuracy: (1 - error rate) Fraction of docs classified

correctly:

 

j i ij i ii

c c



j ji ii

c c



j ij ii

c c

Sec. 15.2.4

SLIDE 57

Averaging: macro vs. micro

58

 We now have an evaluation measure (F1) for one class.  But we also want a single number that shows aggregate

performance over all classes

SLIDE 58

59

Micro- vs. Macro-Averaging

 If we have more than one class, how do we combine

multiple performance measures into one quantity?

 Macroaveraging: Compute performance for each class,

then average.

 Compute F1 for each of the C classes  Average these C numbers

 Microaveraging: Collect decisions for all classes, aggregate

them and then compute measure.

 ComputeTP

, FP , FN for each of the C classes

 Sum these C numbers (e.g., all TP to get aggregate TP)  Compute F1 for aggregate TP, FP

, FN

Sec. 15.2.4

SLIDE 59

60

Micro- vs. Macro-Averaging: Example

Truth: yes Truth: no Classifier: yes 10 10 Classifier: no 10 970 Truth: yes Truth: no Classifier: yes 90 10 Classifier: no 10 890 Truth: yes Truth: no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

 Macroaveraged precision: (0.5 + 0.9)/2 = 0.7  Microaveraged precision: 100/120 = .83  Microaveraged score is dominated by score

n common classes
Sec. 15.2.4

SLIDE 60

61

Evaluation measure: F1

SLIDE 61

62

Amount of data?

 Little amount of data

 stick to less powerful classifiers (i.e. linear ones)

 Naïve Bayes should do well in such circumstances (Ng and Jordan 2002

NIPS)

 The practical answer is to get more labeled data as soon as you can

 Reasonable amount of data

 We can use all our clever classifiers

 Huge amount of data

 Expensive methods like SVMs (train time) or kNN (test time) are

quite impractical

 Naïve Bayes can come back into its own again!

 Or other advanced methods with linear training/test complexity

 With enough data the choice of classifier may not matter much, and

the best choice may be unclear

Sec. 15.3.1

SLIDE 62

63

How many categories?

 A few (well separated ones)?

 Easy!

 A zillion closely related ones?

 Think:Yahoo! Directory  Quickly gets difficult!

 May need a hybrid automatic/manual solution

Sec. 15.3.2

SLIDE 63

dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculture biology physics CS space ... ... ... … (30) www.yahoo.com/Science ... ...

Yahoo! Hierarchy

64

SLIDE 64

65

How can one tweak performance?

 Aim to exploit any domain-specific useful features that

give special meanings or that zone the data

 Aim to collapse things that would be treated as different

but shouldn’t be.

 E.g., ISBNs, part numbers, chemical formulas

 Does putting in “hacks” help?

 You bet!

 Feature design and non-linear weighting is very important in the

performance of real-world systems

Sec. 15.3.2

SLIDE 65

66

Upweighting

 You can get a lot of value by differentially weighting

contributions from different document zones.

 That is, you count as two instances of a word when you

see it in, say, the abstract

 Upweighting title words helps (Cohen & Singer 1996)

 Doubling the weighting on the title words is a good rule of thumb

 Upweighting the first sentence of each paragraph helps

(Murata, 1999)

 Upweighting sentences that contain title words helps (Ko et al,

2002)

Sec. 15.3.2

SLIDE 66

67

Two techniques for zones

1.

Have a completely separate set of features/parameters for different zones like the title

2.

Use the same features (pooling/tying their parameters) across zones, but upweight the contribution of different zones



Commonly the second method is more successful:



it costs you nothing in terms of sparsifying the data, but can give a very useful performance boost



Which is best is a contingent fact about the data

Sec. 15.3.2

SLIDE 67

68

Does stemming/lowercasing/… help?

 As always, it’s hard to tell, and empirical evaluation is

normally the gold standard.

 But note that the role of tools like stemming is rather

different forTextCat vs. IR:

 For IR, you want to improve recall  For T

extCat, with sufficient training data, stemming does no good.

 It only helps in compensating for data sparseness

Sec. 15.3.2

SLIDE 68

69

Resources

 IIR, Chapter 14

Ch. 14