INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/
IR 20/25: Linear Classifiers and Flat clustering
Paul Ginsparg
Cornell University, Ithaca, NY
10 Nov 2011
1 / 121
INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 10 Nov 2011 1 /
Cornell University, Ithaca, NY
1 / 121
2 / 121
1
2
3
4
5
6
7
8
3 / 121
1
2
3
4
5
6
7
8
4 / 121
180 1000 180 1000 + 20 1000
5 / 121
6 / 121
7 / 121
8 / 121
1
intro vector space classification
2
very simple vector space classification: Rocchio
3
kNN
9 / 121
10 / 121
11 / 121
x x x x
12 / 121
1
2
3
4
5
6
7
8
13 / 121
14 / 121
15 / 121
µNR
16 / 121
17 / 121
The training set is given as part of the input in text classification. It is interactively created in relevance feedback.
18 / 121
The centroid is the average of all documents in the class.
19 / 121
20 / 121
1 |Dj|
21 / 121
x x x x
22 / 121
We can interpret the centroid as the prototype of the class.
23 / 121
24 / 121
25 / 121
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a b b b b b b b b b b b b b b b b b b b
26 / 121
1
2
3
4
5
6
7
8
27 / 121
28 / 121
We expect a test document d to have the same label as the training documents located in the local region surrounding d.
29 / 121
30 / 121
x x x x x x x x x x x
31 / 121
32 / 121
33 / 121
34 / 121
35 / 121
But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear.
36 / 121
1
2
3
4
5
6
7
8
37 / 121
i wixi of the feature values.
i wixi > θ?
38 / 121
39 / 121
40 / 121
41 / 121
M
42 / 121
43 / 121
M
44 / 121
x x x x x x x x x x x
Classification decision based on majority of k nearest neighbors. The decision boundaries between classes are piecewise linear . . . . . . but they are not linear classifiers that can be described as M
i=1 wixi = θ.
45 / 121
ti wi x1i x2i ti wi x1i x2i prime 0.70 1 dlrs
1 1 rate 0.67 1 world
1 interest 0.63 sees
rates 0.60 year
discount 0.46 1 group
bundesbank 0.43 dlr
This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation x1: “rate discount dlrs world” x2: “prime dlrs” Exercise: Which class is x1 assigned to? Which class is x2 assigned to? We assign document d1 “rate discount dlrs world” to interest since
d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b. We assign d2 “prime dlrs” to the complement class (not in interest) since
d2 = −0.01 ≤ b. (dlr and world have negative weights because they are indicators for the competing class currency)
46 / 121
47 / 121
48 / 121
Huge differences in performance on test documents
49 / 121
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
50 / 121
51 / 121
How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust classifier.
52 / 121
1
2
3
4
5
6
7
8
53 / 121
54 / 121
Classes are mutually exclusive. Each document belongs to exactly one class. Example: language of a document (assumption: no document contains multiple languages)
55 / 121
Run each classifier separately Rank classifiers (e.g., according to score) Pick the class with the highest score
56 / 121
A document can be a member of 0, 1, or many classes. A decision on one class leaves decisions open on all other classes. A type of “independence” (but not statistical independence) Example: topic classification Usually: make decisions on the region, on the subject area, on the industry and so on “independently”
57 / 121
Simply run each two-class classifier separately on the test document and assign document accordingly
58 / 121
1
2
3
4
5
6
7
8
59 / 121
60 / 121
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5
61 / 121
However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . .
62 / 121
1
2
3
4
5
6
7
8
63 / 121
64 / 121
Application What is Benefit Example clustered? Search result clustering search results more effective infor- mation presentation to user next slide Scatter-Gather (subsets of) collection alternative user inter- face: “search without typing” two slides ahead Collection clustering collection effective information presentation for ex- ploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton 1971
65 / 121
66 / 121
A collection of news stories is clustered (“scattered”) into eight clusters (top row), user manually gathers three into smaller collection ‘International Stories’ and performs another scattering. Process repeats until a small cluster with relevant documents is found (e.g., Trinidad).
67 / 121
68 / 121
69 / 121
70 / 121
Cartia Themescapes Google News
71 / 121
72 / 121
73 / 121
74 / 121
Cluster docs in collection a priori When a query matches a doc d, also return other docs in the cluster containing d
Because clustering groups together docs containing “car” with those containing “automobile”. Both types of documents contain words like “parts”, “dealer”, “mercedes”, “road trip”.
75 / 121
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5
76 / 121
77 / 121
But how do we formalize this?
Initially, we will assume the number of clusters K is given.
Example: avoid very small and very large clusters
78 / 121
Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K-means
Create a hierarchy Bottom-up, agglomerative Top-down, divisive
79 / 121
More common and easier to do
Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters:
sports apparel shoes
You can only do that with a soft clustering approach.
Today: Flat, hard clustering Next time: Hierarchical, hard clustering
80 / 121
Not tractable
81 / 121
1
2
3
4
5
6
7
8
82 / 121
83 / 121
reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment
84 / 121
1 |ωk|
85 / 121
b b b b b b b b b b b b b b b b b b b b
86 / 121
b b b b b b b b b b b b b b b b b b b b
87 / 121
b b b b b b b b b b b b b b b b b b b b
88 / 121
89 / 121
× ×
90 / 121
b b b b b b b b b b b b b b b b b b
b b
91 / 121
92 / 121
× ×
93 / 121
b b b b b b b b b b b b b b b b b b b
b
94 / 121
95 / 121
× ×
96 / 121
b b b b b b b b b b b b b b b b b b b
b
97 / 121
98 / 121
× ×
99 / 121
b b b b b b b b b b b b b b b b b
b bb
100 / 121
101 / 121
× ×
102 / 121
b b b b b b b b b b b b b b b b b b b
b
103 / 121
104 / 121
× ×
105 / 121
b b b b b b b b b b b b b b b b b b b
b
106 / 121
107 / 121
× ×
108 / 121
109 / 121
b b b b b b b b b b b b b b b b b b b b
110 / 121
b b b b b b b b b b b b b b b b b b b b
111 / 121
112 / 121
RSS = K
k=1 RSSk – the residual sum of squares (the “goodness”
measure) RSSk( v) =
x2 =
M
(vm − xm)2 ∂RSSk( v) ∂vm =
2(vm − xm) = 0 vm = 1 |ωk|
xm The last line is the componentwise definition of the centroid! We minimize RSSk when the old centroid is replaced with the new centroid. RSS, the sum of the RSSk, must then also decrease during recomputation.
113 / 121
114 / 121
115 / 121
× × × × × ×
116 / 121
× × × × × ×
117 / 121
Select seeds not randomly, but using some heuristic (e.g., filter
the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS
118 / 121
119 / 121