Clustering CE-324: Modern Information Retrieval Sharif University - - PowerPoint PPT Presentation

clustering
SMART_READER_LITE
LIVE PREVIEW

Clustering CE-324: Modern Information Retrieval Sharif University - - PowerPoint PPT Presentation

Clustering CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Ch. 16 What is clustering? } Clustering:


slide-1
SLIDE 1

Clustering

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Spring 2020

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

What is clustering?

} Clustering: grouping a set of objects into similar ones

} Docs within a cluster should be similar. } Docs from different clusters should be dissimilar.

} The commonest form of unsupervised learning

} Unsupervised learning

} learning from raw data, as opposed to supervised data where a

classification of examples is given

} A common and important task that finds many applications in

IR and other places

  • Ch. 16

2

slide-3
SLIDE 3

A data set with clear cluster structure

} How

would you design an algorithm for finding the three clusters in this case?

  • Ch. 16

3

slide-4
SLIDE 4

Applications of clustering in IR

} For better navigation of search results

} Effective “user recall” will be higher

} Whole corpus analysis/navigation

} Better user interface: search without typing

} For improving recall in search applications

} Better search results (like pseudo RF)

} For speeding up vector space retrieval

} Cluster-based retrieval gives faster search

  • Sec. 16.1

4

slide-5
SLIDE 5

Applications of clustering in IR

5

slide-6
SLIDE 6

Search result clustering

6

slide-7
SLIDE 7

7

yippy.com – grouping search results

slide-8
SLIDE 8

Clustering the collection

8

} Cluster-based navigation is an interesting alternative to

keyword searching (i.e., the standard IR paradigm)

} User may prefer browsing over searching when they are

unsure about which terms to use

} Well suited to a collection of news stories

} News reading is not really search, but rather a process of

selecting a subset of stories about recent events

slide-9
SLIDE 9

Google News: automatic clustering gives an effective news presentation metaphor

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

To improve efficiency and effectiveness of search system

11

} Improve

language modeling: replacing the collection model used for smoothing by a model derived from doc’s cluster

} Clustering can speed-up search (via an inexact algorithm) } Clustering can improve recall

slide-12
SLIDE 12

For improving search recall

} Cluster hypothesis: Docs in the same cluster behave similarly

with respect to relevance to information needs

} Therefore, to improve search recall:

} Cluster docs in corpus a priori } When a query matches a doc 𝑒, also return other docs in the cluster

containing 𝑒

} Query car: also return docs containing automobile

} Because clustering grouped together docs containing car with

those containing automobile.

Why might this happen?

  • Sec. 16.1

12

slide-13
SLIDE 13

Issues for clustering

} Representation for clustering

} Doc representation

} Vector space? Normalization? } Centroids aren’t length normalized

} Need a notion of similarity/distance

} How many clusters?

} Fixed a priori? } Completely data driven?

} Avoid “trivial” clusters - too large or small

¨ too large: for navigation purposes you've wasted an extra user click

without whittling down the set of docs much

  • Sec. 16.2

13

slide-14
SLIDE 14

Notion of similarity/distance

} Ideal: semantic similarity. } Practical: term-statistical similarity

} We will use cosine similarity.

} For many algorithms, easier to think in terms of a distance (rather than

similarity)

} We will mostly speak of Euclidean distance

¨ But real implementations use cosine similarity

14

slide-15
SLIDE 15

Clustering algorithms categorization

} Flat algorithms (k-means)

} Usually start with a random (partial) partitioning } Refine it iteratively

} Hierarchical algorithms

} Bottom-up, agglomerative } T

  • p-down, divisive

15

slide-16
SLIDE 16

Hard vs. soft clustering

} Hard clustering: Each doc belongs to exactly one cluster

} More common and easier to do

} Soft clustering:A doc can belong to more than one cluster.

16

slide-17
SLIDE 17

Partitioning algorithms

} Construct a partition of 𝑂 docs into 𝐿 clusters

} Given: a set of docs and the number 𝐿 } Find: a partition of docs into 𝐿 clusters that optimizes the

chosen partitioning criterion

} Finding a global optimum is intractable for many objective functions of

clustering

} Effective heuristic methods: K-means and K-medoids algorithms

17

slide-18
SLIDE 18

K-means Clustering

18

} Input: data {𝒚 & , … , 𝒚(*)} and number of clusters 𝑙 } Output: 𝒟&, … , 𝒟/ } Optimization problem:

𝐾(𝒟) = 2 2 𝒚(3) – 𝒅𝑘

8

  • 𝒚(:)∈𝒟<

/ =>&

} This is an NP-Hard problem in general.

slide-19
SLIDE 19

K-means

} Assumes docs are real-valued vectors 𝒚(&), … , 𝒚(*). } Clusters based on centroids (aka the center of gravity or

mean) of points in a cluster: 𝝂= = 1 𝒟

=

2 𝒚(3)

𝒚(:)∈𝒟<

} K-means cost function:

𝐾(𝒟) = 2 2 𝒚(3) – 𝝂𝑘

8

  • 𝒚(:)∈𝒟<

/ =>&

  • Sec. 16.4

19

𝒟 = {𝒟1, 𝒟2, … , 𝒟𝐿} 𝒟𝑘 : the set of data points assigned to j-th cluster

slide-20
SLIDE 20

K-means algorithm

Select K random points {𝝂1, 𝝂2, … 𝝂𝐿} as clusters’ initial centroids. Until clustering converges (or other stopping criterion): For each doc 𝒚(3): Assign 𝒚(3) to the cluster 𝒟

= such that 𝑒𝑗𝑡𝑢(𝒚(3), 𝝂𝑘) is minimal.

For each cluster 𝐷𝑘

𝝂𝑘 =

𝒚(:)

  • :∈𝒟<

𝒟<

  • Sec. 16.4

Reassignment of instances to clusters is based on distance to the current cluster centroids (can equivalently be in terms of similarities)

20

slide-21
SLIDE 21

21

[Bishop]

slide-22
SLIDE 22

22

slide-23
SLIDE 23

Termination conditions

} Several possibilities for terminal condition, e.g.,

} A fixed number of iterations } Doc partition unchanged } 𝐾 < 𝜄: cost function falls below a threshold } ∆𝐾 < 𝜄: the decrease in the cost function (in two successive

iterations) falls below a threshold

  • Sec. 16.4

23

slide-24
SLIDE 24

Convergence of K-means

} K-means algorithm ever reaches a fixed point in which

clusters don’t change.

} We must use tie-breaking when there are samples with the

same distance from two or more clusters (by assigning it to the lower index cluster)

  • Sec. 16.4

24

slide-25
SLIDE 25

K-means decreases 𝐾(𝒟) in each iteration (before convergence)

} First, reassignment monotonically decreases 𝐾(𝒟) since each

vector is assigned to the closest centroid.

} Second,

recomputation monotonically decreases each ∑ 𝒚(3)– 𝝂𝑙

2

  • 3∈JK

:

} ∑

𝒚(3)– 𝝂𝑙

2

  • 3∈JK

reaches minimum for 𝝂𝑙 =

& JL ∑

𝒚(3)

  • 3∈JK

} K-means typically converges quickly

  • Sec. 16.4

25

slide-26
SLIDE 26

Time complexity of K-means

} Computing distance between two docs: 𝑃(𝑁)

} 𝑁 is the dimensionality of the vectors.

} Reassigning clusters: 𝑃(𝐿𝑂) distance computations ⇒

𝑃(𝐿𝑂𝑁).

} Computing centroids: Each doc gets added once to some

centroid: 𝑃(𝑂𝑁).

} Assume these two steps are each done once for 𝐽

iterations: 𝑃(𝐽𝐿𝑂𝑁).

  • Sec. 16.4

26

slide-27
SLIDE 27

Seed choice

} Results can vary based on random

selection of initial centroids.

} Some

initializations get poor convergence rate, or convergence to sub-optimal clustering

} Exclude outliers from the seed set } Try

  • ut

multiple starting points and choosing the clustering with lowest cost

} Select good seeds using a heuristic (e.g., doc

least similar to any existing mean)

} Obtaining seeds from another method such

as hierarchical clustering

If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F, you converge to {A,B,D,E} {C,F}

Example showing sensitivity to seeds

  • Sec. 16.4

27

slide-28
SLIDE 28

K-means issues, variations, etc.

} Computes the centroid after all points are re-assigned

} Instead,

we can re-compute the centroid after every assignment

} It can improve speed of convergence of K-means

} Assumes clusters are spherical in vector space

} Sensitive to coordinate changes, weighting etc.

} Disjoint and exhaustive

} Doesn’t have a notion of “outliers” by default

} But can add outlier filtering

  • Sec. 16.4

Dhillon et al. ICDM 2002 – variation to fix some issues with small document clusters

28

slide-29
SLIDE 29

How many clusters?

} Number of clusters 𝐿 is given

} Partition n docs into predetermined number of clusters

} Finding the “right” number is part of the problem

} Given docs, partition into an “appropriate” no. of subsets.

} E.g., for query results - ideal value of K not known up front - though

UI may impose limits.

29

slide-30
SLIDE 30

How many clusters?

How many clusters? Four Clusters Two Clusters Six Clusters

30

slide-31
SLIDE 31

Selecting k

31

} Is it possible by assessing the cost function for different

number of clusters?

} Keep

adding cluster until adding more no longer decreases error significantly (e.g., finding knees)

slide-32
SLIDE 32

K not specified in advance

} Tradeoff between having better focus within each cluster

and having too many clusters

} Solve an optimization problem: penalize having lots of

clusters

} application dependent

} e.g., compressed summary of search results list.

𝑙∗ = min

L 𝐾U3V 𝑙 + 𝜇𝑙

32

  • btained in e.g.

𝐾 {𝒟1,𝒟2, … ,𝒟𝑙} show the minimum value of 𝐾U3V 𝑙 : 100 runs of k-means (with different initializations)

slide-33
SLIDE 33

What is a good clustering?

} Internal criterion:

} intra-class (that is, intra-cluster) similarity is high } inter-class similarity is low

} The measured quality of a clustering depends on both the

doc representation and the similarity measure

  • Sec. 16.3

33

slide-34
SLIDE 34

External criteria for clustering quality

} Quality: ability to discover some or all of patterns in gold

standard data

} Assesses a clustering with respect to ground truth …

requires labeled data

} 𝐷

gold standard classes 𝐷&, … , 𝐷Z , while clustering produces K clusters, ω1, ω2, …, ωK.

  • Sec. 16.3

34

slide-35
SLIDE 35

External criteria

} Purity } Rand Index } F measure } NMI

35

slide-36
SLIDE 36
  • Cluster I

Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Purity example

  • Sec. 16.3

36

Total Purity = 1/17 (5+4+3) = 12/17

slide-37
SLIDE 37

External evaluation of cluster quality

} Purity: ratio between the dominant class in the cluster 𝜕3

and the size of cluster 𝜕3 𝑞𝑣𝑠𝑗𝑢𝑧 𝜕3 = max

=>&,…,Z 𝜕3 ∩ 𝐷 =

𝜕3 𝑈𝑝𝑢𝑏𝑚𝑄𝑣𝑠𝑗𝑢𝑧 𝜕&, … , 𝜕Z = 1 𝑂 2 max

=>&,…,Z 𝜕3 ∩ 𝐷 = / 3>&

= 2 𝑞𝑣𝑠𝑗𝑢𝑧 𝜕3

/ 3>&

× 𝜕3 𝑂

} Biased because having 𝑂 clusters maximizes purity

  • Sec. 16.3

37

slide-38
SLIDE 38

Rand Index (RI)

Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth

20

24

Different classes in ground truth

20

72

  • Sec. 16.3

measures between pair decisions RI = 0.68

38

slide-39
SLIDE 39

Rand index and F-measure

TP P TP FP = +

TP TN RI TP FP TN FN + = + + +

Compare with standard Precision and Recall: People also define and use a cluster F-measure, which is probably a better measure.

  • Sec. 16.3

39

𝐺

j = 𝛾8 + 1 𝑄𝑆

𝛾8𝑄 + 𝑆

𝑆 = 𝑈𝑄 𝑈𝑄 + 𝐺𝑂

slide-40
SLIDE 40

Normalized Mutual Information (NMI)

40

𝐽 Ω; 𝐷 = 2 2 𝑄(𝜕L, 𝑑

=)

  • =
  • L

log 𝑄(𝜕L, 𝑑

=)

𝑄(𝜕L)𝑄(𝑑

=)

𝑄 𝜕L, 𝑑

= = #data assigned to 𝜕L & their desired class is 𝑑 =

𝑂 𝑄 𝜕L = #data assigned to 𝜕L 𝑂 𝑄 𝑑

= = #data with desired class 𝑑 =

𝑂

𝑂𝑁𝐽 Ω; 𝐷 = 𝐽 Ω; 𝐷 𝐼 Ω + 𝐼 𝐷 /2

slide-41
SLIDE 41

Hierarchical clustering

} Build a tree-based hierarchical taxonomy (dendrogram)

from a set of docs.

} Outputs a hierarchy, a structure that is more informative } Does not require a pre-specified number of clusters } But they have lower efficiency.

animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

  • Ch. 17

41

slide-42
SLIDE 42

Hierarchical clustering: Approaches

42

} Divisive approach: recursive application of a partitional

clustering algorithm.

} Agglomerative approach:

} treat each doc as a singleton cluster at the outset } then successively merge (or agglomerate) pairs of clusters until

all clusters have been merged into a single cluster

slide-43
SLIDE 43

Hierarchical clustering

} Flat clustering may not discover enough structure

43

slide-44
SLIDE 44

Hierarchical clustering

} Flat clustering may not discover enough structure

44

slide-45
SLIDE 45

Hierarchical clustering

} Flat clustering may not discover enough structure

45

slide-46
SLIDE 46

Hierarchical Agglomerative Clustering (HAC)

} Starts with each doc in a separate cluster

} then repeatedly joins the closest pair of clusters, until there is

  • nly one cluster.

} The history of merging forms a binary tree or hierarchy.

} keeping track of the order in which clusters were merged results in a

“hierarchy”

  • Sec. 17.1

Note: the resulting clusters are still “hard” and induce a partition

46

slide-47
SLIDE 47

Example

} Hierarchical Agglomerative Clustering (HAC)

47

7 6 5 3 2 4 1 7 6 5 4 3 2 1

slide-48
SLIDE 48

Dendrogram: hierarchical clustering

} Each merge is represented by a

horizontal line

} y-coordinate

  • f

the horizontal line is the similarity of the two clusters that were merged

} Clustering obtained by cutting the

dendrogram at a desired level:

}

Each connected component forms a cluster.

48

slide-49
SLIDE 49

49

slide-50
SLIDE 50

How to cut the hierarchy?

50

} Cut where the gap between two successive combination

similarities is largest.

} Cut at a prespecified level of similarity } prespecify the number of clusters K and select the cutting

point that produces K clusters.

} Find the number of clusters as : 𝑙∗ = min

L 𝐾U3V 𝑙 + 𝜇𝑙

slide-51
SLIDE 51

Works even when we only access similarities

51

} Hierarchical methods work even when data are not

defined by feature vectors but by their relationships or similarities

slide-52
SLIDE 52

Closest pair of clusters

} Many variants to defining closest pair of clusters

} Single-link

} Similarity of the most similar pair (single-link)

} Complete-link

} Similarity of the “furthest” points, the least similar pair

} Centroid

} Clusters whose centroids (centers of gravity) are the most similar

} Average-link

} Average similarities between pairs of elements

  • Sec. 17.2

52

slide-53
SLIDE 53

Distances between cluster pairs

53

Single-link Complete-link Centroid Average-link

slide-54
SLIDE 54

Single link

} Use maximum similarity of pairs: } Can result in “straggly” (long and thin) clusters

} due to chaining effect.

} After merging 𝐷𝑗 and 𝐷𝑘, the similarity of the resulting

cluster to another cluster, 𝐷𝑙:

,

( , ) max ( , )

i j

i j x C y C

sim C C sim x y

Î Î

=

(( ), ) max( ( , ), ( , ))

i j k i k j k

sim C C C sim C C sim C C = U

  • Sec. 17.2

54

slide-55
SLIDE 55

Single link: example

  • Sec. 17.2

55

slide-56
SLIDE 56

Complete link

} Use minimum similarity of pairs: } Makes “tighter,” spherical clusters that are typically

preferable.

} After merging 𝐷𝑗 and 𝐷𝑘, the similarity of the resulting

cluster to another cluster, 𝐷𝑙:

,

( , ) min ( , )

i j

i j x C y C

sim C C sim x y

Î Î

=

(( ), ) min( ( , ), ( , ))

i j k i k j k

sim C C C sim C C sim C C = U

  • Sec. 17.2

56

slide-57
SLIDE 57

Complete link: example

  • Sec. 17.2

57

slide-58
SLIDE 58

Graph-theoretic interpretations

58

} 𝑡L : similarity of the two clusters merged in step k } 𝐻(𝑡L): graph that links all data points with a similarity of

at least 𝑡L.

} The clusters after step k in:

} single-link clustering are the connected components of G(sk) } complete-link clustering are maximal cliques of G(sk).

slide-59
SLIDE 59

Noise

59

slide-60
SLIDE 60

Group average

} Similarity of two clusters = average similarity of all pairs

within merged cluster.

} Compromise between single and complete link.

( ) ( ):

1 ( , ) ( , ) ( 1)

i j i j

i j x C C y C C y x i j i j

sim C C sim x y C C C C

Î Î ¹

=

  • å

å

r r r r U U

r r U U

  • Sec. 17.3

60

slide-61
SLIDE 61

Computing group average similarity

} When we use dot product as the similarity measure we

can compute similarity of two clusters in constant time if we maintain sum of vectors in each cluster.

} Compute similarity of clusters:

j

j x C

s x

Î

= å

r

r r

( ) ( ) (| | | |) ( , ) (| | | |)(| | | | 1)

i j i j i j i j i j i j

s s s s C C sim c c C C C C +

  • +
  • +

= + +

  • r

r r r

  • Sec. 17.3

61

If the length of doc vectors are assumed one GAAC requires: (i) documents represented as vectors (ii) length normalization of vectors, so that self-similarities are 1.0 (iii) the dot product as the similarity measure between vectors and sums of vectors.

slide-62
SLIDE 62

Centroid Clustering

62

slide-63
SLIDE 63

Inversions in centroid clustering

63

slide-64
SLIDE 64

Computational complexity

} First iteration: all HAC methods need to compute similarity of

all pairs of instances, O(N2) similarity computation.

} N-1 merging iterations: compute the distance between the

most recently created cluster and all other existing ones.

} In order to maintain an overall O(N2) performance, computing

similarity to each other cluster must be done in constant time.

} We can find the closest pair in O(N log N)

} Thus, the overall complexity is O(N3) if done naively or

O(N2 log N) if done more cleverly using priority queues

  • Sec. 17.2.1

64

slide-65
SLIDE 65

65

slide-66
SLIDE 66

Hierarchical clustering methods

66

slide-67
SLIDE 67

Cluster labeling

67

} Cluster internal labeling: depends on the cluster itself

} Title of the doc closest to the centroid

} Titles are easiest to read than a list of terms } However, single doc is unlikely to be representative of all docs in a

cluster

} A list of terms with high weights in the cluster centroid

} Differential cluster labeling: comparing the distribution

  • f terms in one cluster with that of other clusters.

} We can use measures such as “Mutual Information” } 𝐽 𝐷L; 𝑌3 = ∑

∑ 𝑄 𝑦3, 𝑑L log

‚ ƒ:,ZK ‚ ƒ: ‚ ZK ƒ: ZK∈{„,&}

slide-68
SLIDE 68

Cluster Labeling

68

slide-69
SLIDE 69

Final word and resources

} In clustering, clusters are inferred from the data without

human input (unsupervised learning)

} However, in practice, it’s a bit less clear

} many ways of influencing the outcome of clustering: number of

clusters, similarity measure, representation of docs, . . .

} Resources

} IIR 16 except 16.5 } IIR 17 except 17.5

69