INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/
IR 20/26: Linear Classifiers and Flat clustering
Paul Ginsparg
Cornell University, Ithaca, NY
10 Nov 2009
1 / 92
INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 20/26: Linear Classifiers and Flat clustering Paul Ginsparg Cornell University, Ithaca, NY 10 Nov 2009 1 /
Cornell University, Ithaca, NY
1 / 92
http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/
http://michaelnielsen.org/blog/lecture-course-the-google-technology-stack/
2 / 92
1
2
3
4
5
6
3 / 92
1
2
3
4
5
6
4 / 92
N! (N−m)! = N(N − 1) · · · (N − m + 1) ≈ Nm,
m
N! m!(N−m)! ≈ Nm m! , and
N→∞
m! is known as a Poisson distribution
m=0 p(m) = e−µ ∞ m=0 µm m! = e−µ ·eµ = 1).
5 / 92
m!
0.02 0.04 0.06 0.08 0.1 0.12 0.14 5 10 15 20 25 30
6 / 92
x x x x
7 / 92
x x x x
8 / 92
9 / 92
x x x x x x x x x x x
10 / 92
11 / 92
But linear preprocessing of documents is as expensive as training Naive Bayes. You will always preprocess the training set, so in reality training time of kNN is linear.
12 / 92
1
2
3
4
5
6
13 / 92
i wixi of the feature values.
i wixi > θ?
14 / 92
15 / 92
16 / 92
17 / 92
M
18 / 92
19 / 92
M
20 / 92
x x x x x x x x x x x
Classification decision based on majority of k nearest neighbors. The decision boundaries between classes are piecewise linear . . . . . . but they are not linear classifiers that can be described as M
i=1 wixi = θ.
21 / 92
ti wi x1i x2i ti wi x1i x2i prime 0.70 1 dlrs
1 1 rate 0.67 1 world
1 interest 0.63 sees
rates 0.60 year
discount 0.46 1 group
bundesbank 0.43 dlr
This is for the class interest in Reuters-21578. For simplicity: assume a simple 0/1 vector representation x1: “rate discount dlrs world” x2: “prime dlrs” Exercise: Which class is x1 assigned to? Which class is x2 assigned to? We assign document d1 “rate discount dlrs world” to interest since
d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b. We assign d2 “prime dlrs” to the complement class (not in interest) since
d2 = −0.01 ≤ b. (dlr and world have negative weights because they are indicators for the competing class currency)
22 / 92
23 / 92
24 / 92
Huge differences in performance on test documents
25 / 92
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
26 / 92
27 / 92
How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the problem? How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robust classifier.
28 / 92
1
2
3
4
5
6
29 / 92
30 / 92
Classes are mutually exclusive. Each document belongs to exactly one class. Example: language of a document (assumption: no document contains multiple languages)
31 / 92
Run each classifier separately Rank classifiers (e.g., according to score) Pick the class with the highest score
32 / 92
A document can be a member of 0, 1, or many classes. A decision on one class leaves decisions open on all other classes. A type of “independence” (but not statistical independence) Example: topic classification Usually: make decisions on the region, on the subject area, on the industry and so on “independently”
33 / 92
Simply run each two-class classifier separately on the test document and assign document accordingly
34 / 92
1
2
3
4
5
6
35 / 92
36 / 92
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5
37 / 92
However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, . . .
38 / 92
1
2
3
4
5
6
39 / 92
40 / 92
Application What is Benefit Example clustered? Search result clustering search results more effective infor- mation presentation to user Scatter-Gather (subsets
col- lection alternative user inter- face: “search without typing” Collection clustering collection effective information presentation for ex- ploratory browsing McKeown et al. 2002, news.google.com Cluster-based retrieval collection higher efficiency: faster search Salton 1971
41 / 92
42 / 92
43 / 92
44 / 92
45 / 92
46 / 92
Cartia Themescapes Google News
47 / 92
48 / 92
49 / 92
50 / 92
Cluster docs in collection a priori When a query matches a doc d, also return other docs in the cluster containing d
Because clustering groups together docs containing “car” with those containing “automobile”. Both types of documents contain words like “parts”, “dealer”, “mercedes”, “road trip”.
51 / 92
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5
52 / 92
53 / 92
But how do we formalize this?
Initially, we will assume the number of clusters K is given.
Example: avoid very small and very large clusters
54 / 92
Usually start with a random (partial) partitioning of docs into groups Refine iteratively Main algorithm: K-means
Create a hierarchy Bottom-up, agglomerative Top-down, divisive
55 / 92
More common and easier to do
Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters:
sports apparel shoes
You can only do that with a soft clustering approach.
Today: Flat, hard clustering Next time: Hierarchical, hard clustering
56 / 92
Not tractable
57 / 92
1
2
3
4
5
6
58 / 92
59 / 92
reassignment: assign each vector to its closest centroid recomputation: recompute each centroid as the average of the vectors that were assigned to it in reassignment
60 / 92
1 |ωk|
61 / 92
b b b b b b b b b b b b b b b b b b b b
62 / 92
b b b b b b b b b b b b b b b b b b b b
63 / 92
b b b b b b b b b b b b b b b b b b b b
64 / 92
65 / 92
× ×
66 / 92
b b b b b b b b b b b b b b b b b b b b
67 / 92
68 / 92
× ×
69 / 92
b b b b b b b b b b b b b b b b b b b b
70 / 92
71 / 92
× ×
72 / 92
b b b b b b b b b b b b b b b b b b b b
73 / 92
74 / 92
× ×
75 / 92
b b b b b b b b b b b b b b b b b b b b
76 / 92
77 / 92
× ×
78 / 92
b b b b b b b b b b b b b b b b b b b b
79 / 92
80 / 92
× ×
81 / 92
b b b b b b b b b b b b b b b b b b b b
82 / 92
83 / 92
× ×
84 / 92
85 / 92
RSS = sum of all squared distances between document vector and closest centroid
86 / 92
RSS = K
k=1 RSSk – the residual sum of squares (the “goodness”
measure) RSSk( v) =
x2 =
M
(vm − xm)2 ∂RSSk( v) ∂vm =
2(vm − xm) = 0 vm = 1 |ωk|
xm The last line is the componentwise definition of the centroid! We minimize RSSk when the old centroid is replaced with the new
during recomputation.
87 / 92
88 / 92
89 / 92
× × × × × ×
90 / 92
Select seeds not randomly, but using some heuristic (e.g., filter
the document space) Use hierarchical clustering to find good seeds (next class) Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS
91 / 92
92 / 92