[PPT] - Clustering CE-324: Modern Information Retrieval Sharif University PowerPoint Presentation

SLIDE 1

Clustering

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Fall 2018

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

What is clustering?

} Clustering: grouping a set of objects into similar ones

} Docs within a cluster should be similar. } Docs from different clusters should be dissimilar.

} The commonest form of unsupervised learning

} Unsupervised learning

} learning from raw data, as opposed to supervised data where a

classification of examples is given

} A common and important task that finds many applications in

IR and other places

Ch. 16

2

SLIDE 3

A data set with clear cluster structure

} How

would you design an algorithm for finding the three clusters in this case?

Ch. 16

3

SLIDE 4

Applications of clustering in IR

} For better navigation of search results

} Effective “user recall” will be higher

} Whole corpus analysis/navigation

} Better user interface: search without typing

} For improving recall in search applications

} Better search results (like pseudo RF)

} For speeding up vector space retrieval

} Cluster-based retrieval gives faster search

Sec. 16.1

4

SLIDE 5

Applications of clustering in IR

5

SLIDE 6

Search result clustering

6

SLIDE 7

7

yippy.com – grouping search results

SLIDE 8

Clustering the collection

8

} Cluster-based navigation is an interesting alternative to

keyword searching (i.e., the standard IR paradigm)

} User may prefer browsing over searching when they are

unsure about which terms to use

} Well suited to a collection of news stories

} News reading is not really search, but rather a process of

selecting a subset of stories about recent events

SLIDE 9

Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering

dairy crops agronomy forestry AI HCI craft missions botany evolution cell magnetism relativity courses agriculture biology physics CS space ... ... ... … (30) www.yahoo.com/Science ... ...

Yahoo! Hierarchy

9

SLIDE 10

Google News: automatic clustering gives an effective news presentation metaphor

10

SLIDE 11

11

SLIDE 12

To improve efficiency and effectiveness of search system

12

} Improve

language modeling: replacing the collection model used for smoothing by a model derived from doc’s cluster

} Clustering can speed-up search (via an inexact algorithm) } Clustering can improve recall

SLIDE 13

For improving search recall

} Cluster hypothesis: Docs in the same cluster behave similarly

with respect to relevance to information needs

} Therefore, to improve search recall:

} Cluster docs in corpus a priori } When a query matches a doc 𝑒, also return other docs in the cluster

containing 𝑒

} Query car: also return docs containing automobile

} Because clustering grouped together docs containing car with

those containing automobile.

Why might this happen?

Sec. 16.1

13

SLIDE 14

Issues for clustering

} Representation for clustering

} Doc representation

} Vector space? Normalization? } Centroids aren’t length normalized

} Need a notion of similarity/distance

} How many clusters?

} Fixed a priori? } Completely data driven?

} Avoid “trivial” clusters - too large or small

¨ too large: for navigation purposes you've wasted an extra user click

without whittling down the set of docs much

Sec. 16.2

14

SLIDE 15

Notion of similarity/distance

} Ideal: semantic similarity. } Practical: term-statistical similarity

} We will use cosine similarity.

} For many algorithms, easier to think in terms of a distance (rather than

similarity)

} We will mostly speak of Euclidean distance

¨ But real implementations use cosine similarity

15

SLIDE 16

Clustering algorithms categorization

} Flat algorithms (k-means)

} Usually start with a random (partial) partitioning } Refine it iteratively

} Hierarchical algorithms

} Bottom-up, agglomerative } T

p-down, divisive

16

SLIDE 17

Hard vs. soft clustering

} Hard clustering: Each doc belongs to exactly one cluster

} More common and easier to do

} Soft clustering:A doc can belong to more than one cluster.

17

SLIDE 18

Partitioning algorithms

} Construct a partition of 𝑂 docs into 𝐿 clusters

} Given: a set of docs and the number 𝐿 } Find: a partition of docs into 𝐿 clusters that optimizes the

chosen partitioning criterion

} Finding a global optimum is intractable for many objective functions of

clustering

} Effective heuristic methods: K-means and K-medoids algorithms

18

SLIDE 19

K-means

} Assumes docs are real-valued vectors 𝒚(&), … , 𝒚(*). } Clusters based on centroids (aka the center of gravity or

mean) of points in a cluster: 𝝂, = 1 𝒟

,

𝒚(1)

𝒚(2)∈𝒟4

} K-means cost function:

𝐾(𝒟) = 0 𝒚(1) – 𝝂𝑘

9

𝒚(2)∈𝒟4

; ,<&

Sec. 16.4

19

𝒟 = {𝒟1, 𝒟2, … , 𝒟𝐿} 𝒟𝑘 : the set of data points assigned to j-th cluster

SLIDE 20

K-means algorithm

Select K random points {𝝂1, 𝝂2, … 𝝂𝐿} as clusters’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚(1): Assign 𝒚(1) to the cluster 𝒟

, such that 𝑒𝑗𝑡𝑢(𝒚(1), 𝝂𝑘) is minimal.

For each cluster 𝐷𝑘

𝝂𝑘 =

∑

𝒚(2)

2∈𝒟4

𝒟4

Sec. 16.4

Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities)

20

SLIDE 21

21

[Bishop]

SLIDE 22

22

SLIDE 23

Termination conditions

} Several possibilities for terminal condition, e.g.,

} A fixed number of iterations } Doc partition unchanged } 𝐾 < 𝜄: cost function falls below a threshold } ∆𝐾 < 𝜄: the decrease in the cost function (in two successive

iterations) falls below a threshold

Sec. 16.4

23

SLIDE 24

Convergence of K-means

} K-means algorithm ever reaches a fixed point in which

clusters don’t change.

} We must use tie-breaking when there are samples with the

same distance from two or more clusters (by assigning it to the lower index cluster)

Sec. 16.4

24

SLIDE 25

K-means decreases 𝐾(𝒟) in each iteration (before convergence)

} First, reassignment monotonically decreases 𝐾(𝒟) since each

vector is assigned to the closest centroid.

} Second,

recomputation monotonically decreases each ∑ 𝒚(1)– 𝝂𝑙

2

1∈IJ

:

} ∑

𝒚(1)– 𝝂𝑙

2

1∈IJ

reaches minimum for 𝝂𝑙 =

& IK ∑

𝒚(1)

1∈IJ

} K-means typically converges quickly

Sec. 16.4

25

SLIDE 26

Time complexity of K-means

} Computing distance between two docs: 𝑃(𝑁)

} 𝑁 is the dimensionality of the vectors.

} Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒

𝑃(𝐿𝑂𝑁).

} Computing centroids: Each doc gets added once to some

centroid: 𝑃(𝑂𝑁).

} Assume these two steps are each done once for 𝐽

iterations: 𝑃(𝐽𝐿𝑂𝑁).

Sec. 16.4

26

SLIDE 27

Seed choice

} Results can vary based on random

selection of initial centroids.

} Some

initializations get poor convergence rate, or convergence to sub-optimal clustering

} Exclude outliers from the seed set } Try

ut

multiple starting points and choosing the clustering with lowest cost

} Select good seeds using a heuristic (e.g., doc

least similar to any existing mean)

} Obtaining seeds from another method such

as hierarchical clustering

If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F, you converge to {A,B,D,E} {C,F}

Example showing sensitivity to seeds

Sec. 16.4

27

SLIDE 28

K-means issues, variations, etc.

} Computes the centroid only after all points are re-

assigned

} Instead,

we can re-compute the centroid after every assignment

} It can improve speed of convergence of K-means

} Assumes clusters are spherical in vector space

} Sensitive to coordinate changes, weighting etc.

} Disjoint and exhaustive

} Doesn’t have a notion of “outliers” by default

} But can add outlier filtering

Sec. 16.4

Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters

28

SLIDE 29

How many clusters?

} Number of clusters 𝐿 is given

} Partition n docs into predetermined number of clusters

} Finding the “right” number is part of the problem

} Given docs, partition into an “appropriate” no. of subsets.

} E.g., for query results - ideal value of K not known up front - though

UI may impose limits.

29

SLIDE 30

How many clusters?

How many clusters? Four Clusters Two Clusters Six Clusters

30

SLIDE 31

Selecting k

31

SLIDE 32

K not specified in advance

} Tradeoff between having better focus within each cluster

and having too many clusters

} Solve an optimization problem: penalize having lots of

clusters

} application dependent

} e.g., compressed summary of search results list.

𝑙∗ = min

K 𝐾T1U 𝑙 + 𝜇𝑙

32

btained in e.g.

𝐾 {𝒟1,𝒟2, … ,𝒟𝑙} show the minimum value of 𝐾T1U 𝑙 : 100 runs of k-means (with different initializations)

SLIDE 33

Penalize lots of clusters

} Benefit for a doc: cosine similarity to its centroid } Total Benefit: sum of the individual doc Benefits.

} Why is there always a clustering of Total Benefit n?

} For each cluster, we have a Cost C. } For K clusters, the Total Cost is KC. } Value of a clustering = Total Benefit - Total Cost. } Find clustering of highest value, over all choices of K.

} T

tal benefit increases with increasing K.

} But can stop when it doesn’t increase by “much”. The Cost term

enforces this.

33

SLIDE 34

What is a good clustering?

} Internal criterion:

} intra-class (that is, intra-cluster) similarity is high } inter-class similarity is low

} The measured quality of a clustering depends on both the

doc representation and the similarity measure

Sec. 16.3

34

SLIDE 35

External criteria for clustering quality

} Quality: ability to discover some or all of patterns in gold

standard data

} Assesses a clustering with respect to ground truth …

requires labeled data

} 𝐷

gold standard classes 𝐷&, … , 𝐷Y , while clustering produces K clusters, ω1, ω2, …, ωK.

Sec. 16.3

35

SLIDE 36

External criteria

} Purity } Rand Index } F measure

36

SLIDE 37

Rand Index (RI)

Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth

20

24

Different classes in ground truth

20

72

Sec. 16.3

measures between pair decisions RI = 0.68

37

SLIDE 38

Rand index and F-measure

TP P TP FP = +

TP TN RI TP FP TN FN + = + + +

Compare with standard Precision and Recall: People also define and use a cluster F-measure, which is probably a better measure.

Sec. 16.3

38

𝐺

[ = 𝛾9 + 1 𝑄𝑆

𝛾9𝑄 + 𝑆

𝑆 = 𝑈𝑄 𝑈𝑄 + 𝐺𝑂

SLIDE 39

External evaluation of cluster quality

} Purity: ratio between the dominant class in the cluster 𝜕1

and the size of cluster 𝜕1 𝑞𝑣𝑠𝑗𝑢𝑧 𝜕1 = max

,<&,…,Y 𝜕1 ∩ 𝐷 ,

𝜕1 𝑈𝑝𝑢𝑏𝑚𝑄𝑣𝑠𝑗𝑢𝑧 𝜕&, … , 𝜕Y = 1 𝑂 0 max

,<&,…,Y 𝜕1 ∩ 𝐷 , ; 1<&

= 0 𝑞𝑣𝑠𝑗𝑢𝑧 𝜕1

; 1<&

× 𝜕1 𝑂

} Biased because having 𝑂 clusters maximizes purity

Sec. 16.3

39

SLIDE 40

•
•
•
•
•
Cluster I

Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity example

Sec. 16.3

40

Total Purity = 1/17 (5+4+3) = 12/17

SLIDE 41

Hierarchical clustering

} Build a tree-based hierarchical taxonomy (dendrogram)

from a set of docs.

} Outputs a hierarchy, a structure that is more informative } Does not require a pre-specified number of clusters } But they have lower efficiency.

animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Ch. 17

41

SLIDE 42

Hierarchical clustering: Approaches

42

} Divisive approach: recursive application of a partitional

clustering algorithm.

} Agglomerative approach:

} treat each doc as a singleton cluster at the outset } then successively merge (or agglomerate) pairs of clusters until

all clusters have been merged into a single cluster

SLIDE 43

Hierarchical clustering

43

SLIDE 44

Dendrogram: hierarchical clustering

} Each merge is represented by a

horizontal line

} y-coordinate

f

the horizontal line is the similarity of the two clusters that were merged

} Clustering obtained by cutting the

dendrogram at a desired level:

}

Each connected component forms a cluster.

44

SLIDE 45

Hierarchical Agglomerative Clustering (HAC)

} Starts with each doc in a separate cluster

} then repeatedly joins the closest pair of clusters, until there is

nly one cluster.

} The history of merging forms a binary tree or hierarchy.

Sec. 17.1

Note: the resulting clusters are still “hard” and induce a partition

45

SLIDE 46

Example

} Hierarchical Agglomerative Clustering (HAC)

46

7 6 5 3 2 4 1 7 6 5 4 3 2 1

SLIDE 47

47

SLIDE 48

How to cut the hierarchy?

48

} Cut at a prespecified level of similarity } Cut where the gap between two successive combination

similarities is largest.

} prespecify the number of clusters K and select the cutting

point that produces K clusters.

} Find the number of clusters as : 𝑙∗ = min

K 𝐾T1U 𝑙 + 𝜇𝑙

SLIDE 49

Closest pair of clusters

} Many variants to defining closest pair of clusters

} Single-link

} Similarity of the most similar pair (single-link)

} Complete-link

} Similarity of the “furthest” points, the least similar pair

} Centroid

} Clusters whose centroids (centers of gravity) are the most similar

} Average-link

} Average similarities between pairs of elements

Sec. 17.2

49

SLIDE 50

Distances between cluster pairs

50

Single-link Complete-link Centroid Average-link

SLIDE 51

Single link

} Use maximum similarity of pairs: } Can result in “straggly” (long and thin) clusters

} due to chaining effect.

} After merging 𝐷𝑗 and 𝐷𝑘, the similarity of the resulting

cluster to another cluster, 𝐷𝑙:

,

( , ) max ( , )

i j

i j x C y C

sim C C sim x y

Î Î

=

(( ), ) max( ( , ), ( , ))

i j k i k j k

sim C C C sim C C sim C C = U

Sec. 17.2

51

SLIDE 52

Single link: example

Sec. 17.2

52

SLIDE 53

Complete link

} Use minimum similarity of pairs: } Makes “tighter,” spherical clusters that are typically

preferable.

} After merging 𝐷𝑗 and 𝐷𝑘, the similarity of the resulting

cluster to another cluster, 𝐷𝑙:

,

( , ) min ( , )

i j

i j x C y C

sim C C sim x y

Î Î

=

(( ), ) min( ( , ), ( , ))

i j k i k j k

sim C C C sim C C sim C C = U

Sec. 17.2

53

SLIDE 54

Complete link: example

Sec. 17.2

54

SLIDE 55

Graph-theoretic interpretations

55

} 𝑡K : similarity of the two clusters merged in step k } 𝐻(𝑡K): graph that links all data points with a similarity of

at least 𝑡K.

} The clusters after step k in:

} single-link clustering are the connected components of G(sk) } complete-link clustering are maximal cliques of G(sk).

SLIDE 56

Noise

56

SLIDE 57

Group average

} Similarity of two clusters = average similarity of all pairs

within merged cluster.

} Compromise between single and complete link.

( ) ( ):

1 ( , ) ( , ) ( 1)

i j i j

i j x C C y C C y x i j i j

sim C C sim x y C C C C

Î Î ¹

=

å

å

r r r r U U

r r U U

Sec. 17.3

57

SLIDE 58

Computing group average similarity

} When we use dot product as the similarity measure we

can compute similarity of two clusters in constant time if we maintain sum of vectors in each cluster.

} Compute similarity of clusters:

j

j x C

s x

Î

= å

r

r r

( ) ( ) (| | | |) ( , ) (| | | |)(| | | | 1)

i j i j i j i j i j i j

s s s s C C sim c c C C C C +

+
+

= + +

r

r r r

Sec. 17.3

58

If the length of doc vectors are assumed one GAAC requires: (i) documents represented as vectors (ii) length normalization of vectors, so that self-similarities are 1.0 (iii) the dot product as the similarity measure between vectors and sums of vectors.

SLIDE 59

Centroid Clustering

59

SLIDE 60

Inversions in centroid clustering

60

SLIDE 61

Computational complexity

} First iteration: all HAC methods need to compute similarity of

all pairs of instances, O(N2) similarity computation.

} N-1 merging iterations: compute the distance between the

most recently created cluster and all other existing ones.

} In order to maintain an overall O(N2) performance, computing

similarity to each other cluster must be done in constant time.

} We can find the closest pair in O(N log N)

} Thus, the overall complexity is O(N3) if done naively or

O(N2 log N) if done more cleverly using priority queues

Sec. 17.2.1

61

SLIDE 62

62

SLIDE 63

Cluster labeling

63

} Cluster internal labeling: depends on the cluster itself

} Title of the doc closest to the centroid

} Titles are easiest to read than a list of terms } However, single doc is unlikely to be representative of all docs in a

cluster

} A list of terms with high weights in the cluster centroid

} Differential cluster labeling: comparing the distribution

f terms in one cluster with that of other clusters.

} We can use measures such as “Mutual Information” } 𝐽 𝐷K; 𝑌1 = ∑

∑ 𝑄 𝑦1, 𝑑K log

t u2,YJ t u2 t YJ u2 YJ∈{v,&}

SLIDE 64

Cluster Labeling

64