Text classification II CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation

text classification ii
SMART_READER_LITE
LIVE PREVIEW

Text classification II CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline Vector space


slide-1
SLIDE 1

Text classification II

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Fall 2017

Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

Outline

 Vector space classification

 Rocchio  Linear classifiers  kNN

2

slide-3
SLIDE 3

Standing queries

 The path from IR to text classification:

 You have an information need to monitor, say:

 Unrest in the Niger delta region

 You want to rerun an appropriate query periodically to find new news

items on this topic

 You will be sent new documents that are found

 I.e., it’s not ranking but classification (relevant vs. not relevant)

 Such queries are called standing queries

 Long used by “information professionals”  A modern mass instantiation is Google Alerts

 Standing queries are (hand-written) text classifiers

  • Ch. 13
slide-4
SLIDE 4

4

Recall: vector space representation

 Each doc is a vector

 One component for each term (= word).

 Terms are axes

 Usually normalize vectors to unit length.

 High-dimensional vector space:

 10,000+ dimensions, or even 100,000+  Docs are vectors in this space

 How can we do classification in this space?

Sec.14.1

slide-5
SLIDE 5

5

Classification using vector spaces

 Training set: a set of docs, each labeled with its class (e.g.,

topic)

 This set corresponds to a labeled set of points (or, equivalently,

vectors) in the vector space

 Premise 1: Docs in the same class form a contiguous

regions of space

 Premise 2: Docs from different classes don’t overlap

(much)

 We define surfaces to delineate classes in the space

Sec.14.1

slide-6
SLIDE 6

6

Documents in a vector space

Government Science Arts

Sec.14.1

slide-7
SLIDE 7

7

Test document of what class?

Government Science Arts

Sec.14.1

slide-8
SLIDE 8

8

Test document of what class?

Government Science Arts

Is this similarity hypothesis true in general?

Our main topic today is how to find good separators

Sec.14.1

Government

slide-9
SLIDE 9

Relevance feedback relation to classification

9

 In

relevance feedback, the user marks docs as relevant/non-relevant.

 Relevant/non-relevant can be viewed as classes or categories.

 For each doc, the user decides which of these two classes

is correct.

 Relevance feedback is a form of text classification.

slide-10
SLIDE 10

Rocchino for text classification

 Relevance

feedback methods can be adapted for text categorization

 Relevance feedback can be viewed as 2-class classification

 Use standard tf-idf weighted vectors to represent text docs

 For training docs in each category, compute a prototype as

centroid of the vectors of the training docs in the category.

 Prototype = centroid of members of class

 Assign test docs to the category with the closest prototype

vector based on cosine similarity.

10

Sec.14.2

slide-11
SLIDE 11

Definition of centroid

𝜈 𝑑 = 1 𝐸𝑑

𝑒∈𝐸𝑑

𝑒

 𝐸𝑑: docs that belong to class 𝑑 

𝑒 : vector space representation of 𝑒.

 Centroid will in general not be a unit vector even when

the inputs are unit vectors.

11

Sec.14.2

slide-12
SLIDE 12

Rocchino algorithm

12

slide-13
SLIDE 13

Rocchio: example

13

 We will see that Rocchino finds linear boundaries

between classes

Government Science Arts

slide-14
SLIDE 14

Illustration of Rocchio: text classification

14

Sec.14.2

slide-15
SLIDE 15

15

Rocchio properties

 Forms a simple generalization of the examples in each

class (a prototype).

 Prototype vector does not need to be normalized.

 Classification is based on similarity to class prototypes.  Does not guarantee classifications are consistent with the

given training data.

Sec.14.2

slide-16
SLIDE 16

16

Rocchio anomaly

 Prototype

models have problems with polymorphic (disjunctive) categories.

Sec.14.2

slide-17
SLIDE 17

Rocchio classification: summary

 Rocchio forms a simple representation for each class:

 Centroid/prototype  Classification is based on similarity to the prototype

 It does not guarantee that classifications are consistent

with the given training data

 It is little used outside text classification

 It has been used quite effectively for text classification  But in general worse than many other classifiers

 Rocchio does not handle nonconvex, multimodal classes

correctly.

17

Sec.14.2

slide-18
SLIDE 18

Linear classifiers

18

 Assumption:The classes are linearly separable.  Classification decision: 𝑗=1

𝑛 𝑥𝑗𝑦𝑗 + 𝑥0 > 0?

 First, we only consider binary classifiers.

 Geometrically, this corresponds to a line (2D), a plane (3D) or

a hyperplane (higher dimensionalities) decision boundary.

 Find the parameters 𝑥0, 𝑥1, … , 𝑥𝑛 based on training set.

 Methods for finding these parameters: Perceptron, Rocchio, …

slide-19
SLIDE 19

19

Separation by hyperplanes

 A simplifying assumption is linear separability:

 in 2 dimensions, can separate classes by a line  in higher dimensions, need hyperplanes

Sec.14.4

slide-20
SLIDE 20

Two-class Rocchio as a linear classifier

 Line or hyperplane defined by:  For Rocchio, set:

𝑥 = 𝜈 𝑑1 − 𝜈 𝑑2

𝑥0 = 1 2

𝜈 𝑑1

2 −

𝜈 𝑑2

2

20

Sec.14.2

𝑥0 +

𝑗=1 𝑁

𝑥𝑗𝑒𝑗 = 𝑥0 + 𝑥𝑈 𝑒 ≥ 0

slide-21
SLIDE 21

21

Linear classifier: example

 Class:“interest” (as in interest rate)  Example features of a linear classifier

wi ti wi ti

 To classify, find dot product of feature vector and weights

  • 0.70 prime
  • 0.67 rate
  • 0.63 interest
  • 0.60 rates
  • 0.46 discount
  • 0.43 bundesbank
  • −0.71 dlrs
  • −0.35 world
  • −0.33 sees
  • −0.25 year
  • −0.24 group
  • −0.24 dlr

Sec.14.4

slide-22
SLIDE 22

Linear classifier: example

22

 Class “interest” in Reuters-21578  𝑒1:“rate discount dlrs world”  𝑒2:“prime dlrs”  𝑥𝑈

𝑒1 = 0.07 ⇒ 𝑒1 is assigned to the “interest” class

 𝑥𝑈

𝑒2 = −0.01 ⇒ 𝑒2 is not assigned to this class

𝑥0 = 0

slide-23
SLIDE 23

Naïve Bayes as a linear classifier

23

𝑄 𝐷1

𝑗=1 𝑁

𝑄 𝑢𝑗 𝐷1 𝑢𝑔𝑗,𝑒 > 𝑄(𝐷2)

𝑗=1 𝑁

𝑄 𝑢𝑗 𝐷2 𝑢𝑔𝑗,𝑒 log 𝑄(𝐷1) +

𝑗=1 𝑁

𝑢𝑔

𝑗,𝑒 × log 𝑄 𝑢𝑗 𝐷1

> log 𝑄(𝐷2) +

𝑗=1 𝑁

𝑢𝑔

𝑗,𝑒 × log 𝑄 𝑢𝑗 𝐷2

𝑥𝑗 = log

𝑄 𝑢𝑗 𝐷1 𝑄 𝑢𝑗 𝐷2

𝑦𝑗 = 𝑢𝑔

𝑗,𝑒

𝑥0 = log

𝑄 𝐷1 𝑄 𝐷1

slide-24
SLIDE 24

24

Linear programming / Perceptron

Find a,b,c, such that ax + by > c for red points ax + by < c for blue points

Sec.14.4

slide-25
SLIDE 25

25

Which hyperplane?

In general, lots of possible solutions for a,b,c.

Sec.14.4

slide-26
SLIDE 26

26

Which hyperplane?

 Lots of possible solutions for a,b,c.  Some methods find a separating hyperplane, but not the

  • ptimal one [according to some criterion of expected goodness]

 Which points should influence optimality?

 All points

 E.g., Rocchino

 Only “difficult points” close to decision boundary

 E.g., SupportVector Machine (SVM)

Sec.14.4

slide-27
SLIDE 27

27

Support Vector Machine (SVM)

Support vectors Maximizes margin

 SVMs maximize the margin around

the separating hyperplane.

 A.k.a. large margin classifiers

 Solving

SVMs is a quadratic programming problem

 Seen

by many as the most successful current text classification method*

*but other discriminative methods

  • ften perform very similarly
  • Sec. 15.1

Narrower margin

slide-28
SLIDE 28

28

Linear classifiers

 Many common text classifiers are linear classifiers  Classifiers more powerful than linear often don’t perform

better on text problems.Why?

 Despite

the similarity

  • f

linear classifiers, noticeable performance differences between them

 For separable problems, there is an infinite number of separating

hyperplanes.

 Different training methods pick different hyperplanes.

 Also different strategies for non-separable problems

Sec.14.4

slide-29
SLIDE 29

29

Linear classifiers: binary and multiclass classification

 Consider 2 class problems

 Deciding between two classes, perhaps, government and non-

government

 Multi-class

 How do we define (and find) the separating surface?  How do we decide which region a test doc is in?

Sec.14.4

slide-30
SLIDE 30

30

More than two classes

 One-of classification (multi-class classification)

 Classes are mutually exclusive.  Each doc belongs to exactly one class

 Any-of classification

 Classes are not mutually exclusive.  A doc can belong to 0, 1, or >1 classes.  For simplicity, decompose into K binary problems  Quite common for docs

Sec.14.5

slide-31
SLIDE 31

31

Set of binary classifiers: any of

 Build a separator between each class and its complementary

set (docs from all other classes).

 Given test doc, evaluate it for membership in each class.  Apply decision criterion of classifiers independently

 It works although considering dependencies between categories may

be more accurate

Sec.14.5

slide-32
SLIDE 32

32

Multi-class: set of binary classifiers

 Build

a separator between each class and its complementary set (docs from all other classes).

 Given test doc, evaluate it for membership in each class.  Assign doc to class with:

 maximum score  maximum confidence  maximum probability

? ? ?

Sec.14.5

slide-33
SLIDE 33

33

k Nearest Neighbor Classification

  • kNN = k Nearest Neighbor
  • To classify a document d:
  • Define k-neighborhood as the k nearest neighbors
  • f d
  • Pick

the majority class label in the k- neighborhood

Sec.14.3

slide-34
SLIDE 34

34

Nearest-Neighbor (1NN) classifier

 Learning phase:

 Just storing the representations of the training examples in D.  Does not explicitly compute category prototypes.

 Testing instance 𝑦 (under 1NN):

 Compute similarity between x and all examples in D.  Assign x the category of the most similar example in D.

 Rationale of kNN: contiguity hypothesis

 We expect a test doc 𝑒 to have the same label as the training docs

located in the local region surrounding 𝑒.

Sec.14.3

slide-35
SLIDE 35

35

Test Document = Science

Government Science Arts

Sec.14.1

slide-36
SLIDE 36

36

k Nearest Neighbor (kNN) classifier

 1NN: subject to errors due to

 A single atypical example.  Noise (i.e., an error) in the category label of a single training

example.

 More robust alternative:

 find the k most-similar examples  return the majority category of these k examples.

Sec.14.3

slide-37
SLIDE 37

38

kNN example: k=6

Government Science Arts

P(science| )?

Sec.14.3

slide-38
SLIDE 38

39

kNN decision boundaries

Government Science Arts Boundaries are in principle arbitrary surfaces (polyhedral)

kNN gives locally defined decision boundaries between classes – far away points do not influence each classification decision (unlike Rocchio, etc.)

Sec.14.3

slide-39
SLIDE 39

1NN: Voronoi tessellation

40

The decision boundaries between classes are piecewise linear.

slide-40
SLIDE 40

kNN algorithm

41

slide-41
SLIDE 41

Time complexity of kNN

42

 kNN test time proportional to the size of the training

set!

 kNN is inefficient for very large training sets.

slide-42
SLIDE 42

43

Similarity metrics

 Nearest neighbor method depends on a similarity (or

distance) metric.

 Euclidean distance: Simplest for continuous vector space.  Hamming distance: Simplest for binary instance space.

 number of feature values that differ

 For text, cosine similarity of tf.idf weighted vectors is

typically most effective.

Sec.14.3

slide-43
SLIDE 43

44

Illustration of kNN (k=3) for text vector space

Sec.14.3

slide-44
SLIDE 44

45

3-NN vs. Rocchio

 Nearest

Neighbor tends to handle polymorphic categories better than Rocchio/NB.

slide-45
SLIDE 45

46

Nearest neighbor with inverted index

 Naively, finding nearest neighbors requires a linear search

through |𝐸| docs in collection

 Similar to determining the 𝑙 best retrievals using the test doc

as a query to a database of training docs.

 Use standard vector space inverted index methods to

find the k nearest neighbors.

 TestingTime: O(B|Vt|)

 Typically B << |D| if a large list of stopwords is used.

Sec.14.3

B is the average number of training docs in which at least one word of test-document appears

slide-46
SLIDE 46

A nonlinear problem

 Linear

classifiers do badly on this task

 kNN will do very well

(assuming enough training data)

47

Sec.14.4

slide-47
SLIDE 47

Overfitting example

48

slide-48
SLIDE 48

49

kNN: summary

 No training phase necessary

 Actually: We always preprocess the training set, so in reality training time of

kNN is linear.

 May be expensive at test time  kNN is very accurate if training set is large.

 In most cases it’s more accurate than linear classifiers  Optimality result: asymptotically zero error if Bayes rate is zero.

 But kNN can be very inaccurate if training set is small.

 Scales well with large number of classes

 Don’t need to train C classifiers for C classes

 Classes can influence each other

 Small changes to one class can have ripple effect

Sec.14.3

slide-49
SLIDE 49

50

Choosing the correct model capacity

Sec.14.6

slide-50
SLIDE 50

Linear classifiers for doc classification

51

 We typically encounter high-dimensional spaces in text

applications.

 With increased dimensionality, the likelihood of linear

separability increases rapidly

 Many of the best-known text classification algorithms are

linear.

 More powerful nonlinear learning methods are more sensitive

to noise in the training data.

 Nonlinear learning methods sometimes perform better if

the training set is large, but by no means in all cases.

slide-51
SLIDE 51

Which classifier do I use for a given text classification problem?

 Is there a learning method that is optimal for all text

classification problems?

 No, because there is a tradeoff between complexity of the

classifier and its performance on new data points.

 Factors to take into account:

 How much training data is available?  How simple/complex is the problem?  How noisy is the data?  How stable is the problem over time?

 For an unstable problem, it’s better to use a simple and robust

classifier.

52

slide-52
SLIDE 52

Reuters collection

53

 Only about 10 out of 118 categories are large

Common categories (#train, #test)

  • Earn (2877, 1087)
  • Acquisitions (1650, 179)
  • Money-fx (538, 179)
  • Grain (433, 149)
  • Crude (389, 189)
  • Trade (369,119)
  • Interest (347, 131)
  • Ship (197, 89)
  • Wheat (212, 71)
  • Corn (182, 56)
slide-53
SLIDE 53

Evaluating classification

54

 Evaluation

must be done

  • n

test data that are independent of the training data

 training and test sets are disjoint.

 Measures: Precision, recall, F1, accuracy

 F1 allows us to trade off precision against recall (harmonic

mean of P and R).

slide-54
SLIDE 54

Precision P and recall R

55

 Precision P = tp/(tp + fp)  Recall

R = tp/(tp + fn)

actually in the class actually in the class predicted to be in the class

tp fp

Predicted not to be in the class

fn tn

slide-55
SLIDE 55

56

Good practice department: Make a confusion matrix

 This (i, j) entry means 53 of the docs actually in class i were

put in class j by the classifier.

 In a perfect classification, only the diagonal has non-zero entries  Look at common confusions and how they might be addressed

53 Class assigned by classifier Actual Class

  • Sec. 15.2.4

𝑑𝑗𝑘

slide-56
SLIDE 56

57

Per class evaluation measures

 Recall: Fraction of docs in class i classified correctly:  Precision: Fraction of docs assigned class i that are

actually about class i:

 Accuracy: (1 - error rate) Fraction of docs classified

correctly:

 

j i ij i ii

c c

j ji ii

c c

j ij ii

c c

  • Sec. 15.2.4
slide-57
SLIDE 57

Averaging: macro vs. micro

58

 We now have an evaluation measure (F1) for one class.  But we also want a single number that shows aggregate

performance over all classes

slide-58
SLIDE 58

59

Micro- vs. Macro-Averaging

 If we have more than one class, how do we combine

multiple performance measures into one quantity?

 Macroaveraging: Compute performance for each class,

then average.

 Compute F1 for each of the C classes  Average these C numbers

 Microaveraging: Collect decisions for all classes, aggregate

them and then compute measure.

 ComputeTP

, FP , FN for each of the C classes

 Sum these C numbers (e.g., all TP to get aggregate TP)  Compute F1 for aggregate TP, FP

, FN

  • Sec. 15.2.4
slide-59
SLIDE 59

60

Micro- vs. Macro-Averaging: Example

Truth: yes Truth: no Classifier: yes 10 10 Classifier: no 10 970 Truth: yes Truth: no Classifier: yes 90 10 Classifier: no 10 890 Truth: yes Truth: no Classifier: yes 100 20 Classifier: no 20 1860

Class 1 Class 2 Micro Ave. Table

 Macroaveraged precision: (0.5 + 0.9)/2 = 0.7  Microaveraged precision: 100/120 = .83  Microaveraged score is dominated by score

  • n common classes
  • Sec. 15.2.4
slide-60
SLIDE 60

61

Evaluation measure: F1

slide-61
SLIDE 61

62

Amount of data?

 Little amount of data

 stick to less powerful classifiers (i.e. linear ones)

 Naïve Bayes should do well in such circumstances (Ng and Jordan 2002

NIPS)

 The practical answer is to get more labeled data as soon as you can

 Reasonable amount of data

 We can use all our clever classifiers

 Huge amount of data

 Expensive methods like SVMs (train time) or kNN (test time) are

quite impractical

 Naïve Bayes can come back into its own again!

 Or other advanced methods with linear training/test complexity

 With enough data the choice of classifier may not matter much, and

the best choice may be unclear

  • Sec. 15.3.1
slide-62
SLIDE 62

63

How many categories?

 A few (well separated ones)?

 Easy!

 A zillion closely related ones?

 Think:Yahoo! Directory  Quickly gets difficult!

 May need a hybrid automatic/manual solution

  • Sec. 15.3.2
slide-63
SLIDE 63

dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculture biology physics CS space ... ... ... … (30) www.yahoo.com/Science ... ...

Yahoo! Hierarchy

64

slide-64
SLIDE 64

65

How can one tweak performance?

 Aim to exploit any domain-specific useful features that

give special meanings or that zone the data

 Aim to collapse things that would be treated as different

but shouldn’t be.

 E.g., ISBNs, part numbers, chemical formulas

 Does putting in “hacks” help?

 You bet!

 Feature design and non-linear weighting is very important in the

performance of real-world systems

  • Sec. 15.3.2
slide-65
SLIDE 65

66

Upweighting

 You can get a lot of value by differentially weighting

contributions from different document zones.

 That is, you count as two instances of a word when you

see it in, say, the abstract

 Upweighting title words helps (Cohen & Singer 1996)

 Doubling the weighting on the title words is a good rule of thumb

 Upweighting the first sentence of each paragraph helps

(Murata, 1999)

 Upweighting sentences that contain title words helps (Ko et al,

2002)

  • Sec. 15.3.2
slide-66
SLIDE 66

67

Two techniques for zones

1.

Have a completely separate set of features/parameters for different zones like the title

2.

Use the same features (pooling/tying their parameters) across zones, but upweight the contribution of different zones

Commonly the second method is more successful:

it costs you nothing in terms of sparsifying the data, but can give a very useful performance boost

Which is best is a contingent fact about the data

  • Sec. 15.3.2
slide-67
SLIDE 67

68

Does stemming/lowercasing/… help?

 As always, it’s hard to tell, and empirical evaluation is

normally the gold standard.

 But note that the role of tools like stemming is rather

different forTextCat vs. IR:

 For IR, you want to improve recall  For T

extCat, with sufficient training data, stemming does no good.

 It only helps in compensating for data sparseness

  • Sec. 15.3.2
slide-68
SLIDE 68

69

Resources

 IIR, Chapter 14

  • Ch. 14