Text classification II
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Fall 2017
Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Text classification II CE-324: Modern Information Retrieval Sharif - - PowerPoint PPT Presentation
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Outline Vector space
Sharif University of Technology
Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Rocchio Linear classifiers kNN
2
You have an information need to monitor, say:
Unrest in the Niger delta region
You want to rerun an appropriate query periodically to find new news
You will be sent new documents that are found
I.e., it’s not ranking but classification (relevant vs. not relevant)
Long used by “information professionals” A modern mass instantiation is Google Alerts
4
One component for each term (= word).
Terms are axes
Usually normalize vectors to unit length.
10,000+ dimensions, or even 100,000+ Docs are vectors in this space
Sec.14.1
5
This set corresponds to a labeled set of points (or, equivalently,
Sec.14.1
6
Sec.14.1
7
Sec.14.1
8
Sec.14.1
9
Relevant/non-relevant can be viewed as classes or categories.
Relevance feedback can be viewed as 2-class classification
Use standard tf-idf weighted vectors to represent text docs
Prototype = centroid of members of class
10
Sec.14.2
11
Sec.14.2
12
13
14
Sec.14.2
15
Prototype vector does not need to be normalized.
Sec.14.2
16
Sec.14.2
Centroid/prototype Classification is based on similarity to the prototype
It has been used quite effectively for text classification But in general worse than many other classifiers
17
Sec.14.2
18
Geometrically, this corresponds to a line (2D), a plane (3D) or
Methods for finding these parameters: Perceptron, Rocchio, …
19
in 2 dimensions, can separate classes by a line in higher dimensions, need hyperplanes
Sec.14.4
20
Sec.14.2
𝑗=1 𝑁
21
To classify, find dot product of feature vector and weights
Sec.14.4
22
Class “interest” in Reuters-21578 𝑒1:“rate discount dlrs world” 𝑒2:“prime dlrs” 𝑥𝑈
𝑥𝑈
23
24
Sec.14.4
25
Sec.14.4
26
All points
E.g., Rocchino
Only “difficult points” close to decision boundary
E.g., SupportVector Machine (SVM)
Sec.14.4
27
A.k.a. large margin classifiers
*but other discriminative methods
28
For separable problems, there is an infinite number of separating
Different training methods pick different hyperplanes.
Also different strategies for non-separable problems
Sec.14.4
29
Deciding between two classes, perhaps, government and non-
How do we define (and find) the separating surface? How do we decide which region a test doc is in?
Sec.14.4
30
Classes are mutually exclusive. Each doc belongs to exactly one class
Classes are not mutually exclusive. A doc can belong to 0, 1, or >1 classes. For simplicity, decompose into K binary problems Quite common for docs
Sec.14.5
31
It works although considering dependencies between categories may
Sec.14.5
32
maximum score maximum confidence maximum probability
Sec.14.5
33
Sec.14.3
34
Just storing the representations of the training examples in D. Does not explicitly compute category prototypes.
Compute similarity between x and all examples in D. Assign x the category of the most similar example in D.
We expect a test doc 𝑒 to have the same label as the training docs
Sec.14.3
35
Sec.14.1
36
A single atypical example. Noise (i.e., an error) in the category label of a single training
find the k most-similar examples return the majority category of these k examples.
Sec.14.3
38
Sec.14.3
39
Sec.14.3
40
41
42
kNN is inefficient for very large training sets.
43
number of feature values that differ
Sec.14.3
44
Sec.14.3
45
46
Similar to determining the 𝑙 best retrievals using the test doc
Typically B << |D| if a large list of stopwords is used.
Sec.14.3
47
Sec.14.4
48
49
No training phase necessary
Actually: We always preprocess the training set, so in reality training time of
May be expensive at test time kNN is very accurate if training set is large.
In most cases it’s more accurate than linear classifiers Optimality result: asymptotically zero error if Bayes rate is zero.
But kNN can be very inaccurate if training set is small.
Scales well with large number of classes
Don’t need to train C classifiers for C classes
Classes can influence each other
Small changes to one class can have ripple effect
Sec.14.3
50
Sec.14.6
51
More powerful nonlinear learning methods are more sensitive
No, because there is a tradeoff between complexity of the
How much training data is available? How simple/complex is the problem? How noisy is the data? How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust
52
53
Only about 10 out of 118 categories are large
54
training and test sets are disjoint.
F1 allows us to trade off precision against recall (harmonic
55
56
In a perfect classification, only the diagonal has non-zero entries Look at common confusions and how they might be addressed
57
j i ij i ii
j ji ii
j ij ii
58
59
Compute F1 for each of the C classes Average these C numbers
ComputeTP
Sum these C numbers (e.g., all TP to get aggregate TP) Compute F1 for aggregate TP, FP
60
Truth: yes Truth: no Classifier: yes 10 10 Classifier: no 10 970 Truth: yes Truth: no Classifier: yes 90 10 Classifier: no 10 890 Truth: yes Truth: no Classifier: yes 100 20 Classifier: no 20 1860
Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Microaveraged score is dominated by score
61
62
stick to less powerful classifiers (i.e. linear ones)
Naïve Bayes should do well in such circumstances (Ng and Jordan 2002
The practical answer is to get more labeled data as soon as you can
We can use all our clever classifiers
Expensive methods like SVMs (train time) or kNN (test time) are
Naïve Bayes can come back into its own again!
Or other advanced methods with linear training/test complexity
With enough data the choice of classifier may not matter much, and
63
Easy!
Think:Yahoo! Directory Quickly gets difficult!
May need a hybrid automatic/manual solution
64
65
E.g., ISBNs, part numbers, chemical formulas
You bet!
Feature design and non-linear weighting is very important in the
66
Upweighting title words helps (Cohen & Singer 1996)
Doubling the weighting on the title words is a good rule of thumb
Upweighting the first sentence of each paragraph helps
Upweighting sentences that contain title words helps (Ko et al,
67
68
For IR, you want to improve recall For T
It only helps in compensating for data sparseness
69