INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/
IR 20/25: Linear Classifiers and Flat clustering
Paul Ginsparg
Cornell University, Ithaca, NY
11 Nov 2009
1 / 98
INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 11 Nov 2009 1 /
Cornell University, Ithaca, NY
1 / 98
2 / 98
http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/
http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/
3 / 98
1
2
3
4
5
6
4 / 98
1
2
3
4
5
6
5 / 98
x x x x
6 / 98
x x x x
7 / 98
8 / 98
x x x x x x x x x x x
9 / 98
10 / 98
But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear.
11 / 98
180 1000 180 1000 + 20 1000
12 / 98
13 / 98
14 / 98
15 / 98
1
2
3
4
5
6
16 / 98
i wixi of the feature values.
i wixi > θ?
17 / 98
18 / 98
19 / 98
20 / 98
M
21 / 98
22 / 98
M
23 / 98
x x x x x x x x x x x
Classification decision based on majority of k nearest neighbors. The decision boundaries between classes are piecewise linear . . . . . . but they are not linear classifiers that can be described as M
i=1 wixi = θ.
24 / 98
ti wi x1i x2i ti wi x1i x2i prime 0.70 1 dlrs
1 1 rate 0.67 1 world
1 interest 0.63 sees
rates 0.60 year
discount 0.46 1 group
bundesbank 0.43 dlr
This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation x1: “rate discount dlrs world” x2: “prime dlrs” Exercise: Which class is x1 assigned to? Which class is x2 assigned to? We assign document d1 “rate discount dlrs world” to interest since
d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b. We assign d2 “prime dlrs” to the complement class (not in interest) since
d2 = −0.01 ≤ b. (dlr and world have negative weights because they are indicators for the competing class currency)
25 / 98
26 / 98
27 / 98
Huge differences in performance on test documents
28 / 98
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
29 / 98
30 / 98
How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust classifier.
31 / 98
1
2
3
4
5
6
32 / 98
33 / 98
Classes are mutually exclusive. Each document belongs to exactly one class. Example: language of a document (assumption: no document contains multiple languages)
34 / 98
Run each classifier separately Rank classifiers (e.g., according to score) Pick the class with the highest score
35 / 98
A document can be a member of 0, 1, or many classes. A decision on one class leaves decisions open on all other classes. A type of “independence” (but not statistical independence) Example: topic classification Usually: make decisions on the region, on the subject area, on the industry and so on “independently”
36 / 98
Simply run each two-class classifier separately on the test document and assign document accordingly
37 / 98
1
2
3
4
5
6
38 / 98
39 / 98
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5
40 / 98
However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . .
41 / 98
1
2
3
4
5
6
42 / 98
43 / 98
Application What is Benefit Example clustered? Search result clustering search results more effective infor- mation presentation to user next slide Scatter-Gather (subsets of) collection alternative user inter- face: “search without typing” two slides ahead Collection clustering collection effective information presentation for ex- ploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton 1971
44 / 98
45 / 98
A collection of news stories is clustered (“scattered”) into eight clusters (top row), user manually gathers three into smaller collection ‘International Stories’ and performs another scattering. Process repeats until a small cluster with relevant documents is found (e.g., Trinidad).
46 / 98
47 / 98
48 / 98
49 / 98
Cartia Themescapes Google News
50 / 98
51 / 98
52 / 98
53 / 98
Cluster docs in collection a priori When a query matches a doc d, also return other docs in the cluster containing d
Because clustering groups together docs containing “car” with those containing “automobile”. Both types of documents contain words like “parts”, “dealer”, “mercedes”, “road trip”.
54 / 98
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5
55 / 98
56 / 98
But how do we formalize this?
Initially, we will assume the number of clusters K is given.
Example: avoid very small and very large clusters
57 / 98
Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K-means
Create a hierarchy Bottom-up, agglomerative Top-down, divisive
58 / 98
More common and easier to do
Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters:
sports apparel shoes
You can only do that with a soft clustering approach.
Today: Flat, hard clustering Next time: Hierarchical, hard clustering
59 / 98
Not tractable
60 / 98
1
2
3
4
5
6
61 / 98
62 / 98
reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment
63 / 98
1 |ωk|
64 / 98
b b b b b b b b b b b b b b b b b b b b
65 / 98
b b b b b b b b b b b b b b b b b b b b
66 / 98
b b b b b b b b b b b b b b b b b b b b
67 / 98
68 / 98
× ×
69 / 98
b b b b b b b b b b b b b b b b b b
b b
70 / 98
71 / 98
× ×
72 / 98
b b b b b b b b b b b b b b b b b b b
b
73 / 98
74 / 98
× ×
75 / 98
b b b b b b b b b b b b b b b b b b b
b
76 / 98
77 / 98
× ×
78 / 98
b b b b b b b b b b b b b b b b b
b bb
79 / 98
80 / 98
× ×
81 / 98
b b b b b b b b b b b b b b b b b b b
b
82 / 98
83 / 98
× ×
84 / 98
b b b b b b b b b b b b b b b b b b b
b
85 / 98
86 / 98
× ×
87 / 98
88 / 98
b b b b b b b b b b b b b b b b b b b b
89 / 98
b b b b b b b b b b b b b b b b b b b b
90 / 98
91 / 98
RSS = K
k=1 RSSk – the residual sum of squares (the “goodness”
measure) RSSk( v) =
x2 =
M
(vm − xm)2 ∂RSSk( v) ∂vm =
2(vm − xm) = 0 vm = 1 |ωk|
xm The last line is the componentwise definition of the centroid! We minimize RSSk when the old centroid is replaced with the new centroid. RSS, the sum of the RSSk, must then also decrease during recomputation.
92 / 98
93 / 98
94 / 98
× × × × × ×
95 / 98
× × × × × ×
96 / 98
Select seeds not randomly, but using some heuristic (e.g., filter
the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS
97 / 98
98 / 98