Evaluation of Hierarchical Clustering Algorithms for Document - - PowerPoint PPT Presentation

evaluation of hierarchical clustering algorithms for
SMART_READER_LITE
LIVE PREVIEW

Evaluation of Hierarchical Clustering Algorithms for Document - - PowerPoint PPT Presentation

Evaluation of Hierarchical Clustering Algorithms for Document Datasets Paper by Ying Zhao and George Karypis University of Minnesota (2002) CS 6501 Paper Presentation - April 6, 2016 Matthew Hawthorn, Nikhil Mascarenhas, Shannon Mitchell


slide-1
SLIDE 1

Evaluation of Hierarchical Clustering Algorithms for Document Datasets

Paper by Ying Zhao and George Karypis University of Minnesota (2002) CS 6501 Paper Presentation - April 6, 2016 Matthew Hawthorn, Nikhil Mascarenhas, Shannon Mitchell

slide-2
SLIDE 2

Motivation

  • Hierarchical clustering of documents

○ Intuitive, clustering of different levels of granularity.

  • Two major approaches

○ Partitional ○ Agglomerative

  • General view was that partitional algorithms are inferior
  • Authors ran an experiment to compare these approaches.
  • Defined a new algorithm, a hybrid “constrained agglomerative algorithm”
slide-3
SLIDE 3

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
slide-4
SLIDE 4

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-5
SLIDE 5

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-6
SLIDE 6

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-7
SLIDE 7

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-8
SLIDE 8

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-9
SLIDE 9

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-10
SLIDE 10

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-11
SLIDE 11

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-12
SLIDE 12

Hierarchical Clustering: Partitional Algorithms

Top-down

  • Start with one cluster with all

documents

  • Start at root, divide down to leaves
  • Split the cluster which most

improves the criterion function

  • Complexity: O (n log n)
slide-13
SLIDE 13

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-14
SLIDE 14

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-15
SLIDE 15

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-16
SLIDE 16

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-17
SLIDE 17

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-18
SLIDE 18

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-19
SLIDE 19

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-20
SLIDE 20

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-21
SLIDE 21

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-22
SLIDE 22

Hierarchical Clustering: Agglomerative Algorithms

Bottom-up

  • Each document starts as its own

cluster

  • Start at leaves, merge to root
  • Complexity:

O (n2 log n)

When caching of intermediate values of the

  • bjective is possible

O(n3)

Otherwise

slide-23
SLIDE 23

Criterion Functions

Global criterion functions drive the clustering process.

Internal Functions External Functions Graph Based Functions Hybrid Functions Considers only documents within a cluster Considers how various clusters are different from each

  • ther.

Constructs a graph which represents the relationships between documents. Simultaneously consider internal and external criterion functions

slide-24
SLIDE 24

Internal Criterion Functions

m Number of terms n Number of documents k Number of clusters S1, S2 ,... Sk Each one of k clusters n1, n2 ,…. nk Size of each cluster d1, d2, …. dn Tf idf vector for a document DA Sum of all vectors in cluster A CA Centroid vector of cluster A

slide-25
SLIDE 25

External Criterion Functions

m Number of terms n Number of documents k Number of clusters S1, S2 ,... Sk Each one of k clusters n1, n2 ,…. nk Size of each cluster d1, d2, …. dn Tf idf vector for a document DA Sum of all vectors in cluster A CA Centroid vector of cluster A

slide-26
SLIDE 26

Single-linkage

minimum distance ‘slink’

Complete-linkage

maximum distance

‘clink’ Group average

average of distances ‘UPGMA’

Traditional Agglomerative Clustering Criteria

Authors’ abbreviation:

slide-27
SLIDE 27

Hierarchical Clustering: Constrained Agglomerative

  • Hybrid technique
  • Constrains agglomerative clustering by initializing with intermediate

hierarchical partitional clustering

  • More likely to avoid early merge mistakes of agglomerative techniques
  • But takes advantage of the ease with which agglomerative techniques find

small and cohesive clusters

slide-28
SLIDE 28

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

slide-29
SLIDE 29

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

slide-30
SLIDE 30

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

slide-31
SLIDE 31

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

slide-32
SLIDE 32

Hierarchical Clustering: Constrained Agglomerative

1. Find k clusters using partitional clustering

slide-33
SLIDE 33

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

slide-34
SLIDE 34

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

slide-35
SLIDE 35

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

slide-36
SLIDE 36

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

slide-37
SLIDE 37

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

slide-38
SLIDE 38

Hierarchical Clustering: Constrained Agglomerative

2. Cluster the documents in these clusters using agglomerative clustering

slide-39
SLIDE 39

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

slide-40
SLIDE 40

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

slide-41
SLIDE 41

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

slide-42
SLIDE 42

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

slide-43
SLIDE 43

Hierarchical Clustering: Constrained Agglomerative

3. Cluster the k clusters using agglomerative clustering

slide-44
SLIDE 44

Computational Complexity

  • Partitional clustering of data into k clusters:

< O(n log(n)) (the cost of an entire partitional clustering)

log(n) levels O(n) comparison and reassignment operations at each level

slide-45
SLIDE 45

Computational Complexity

  • Partitional clustering of data into k clusters:

< O(n log(n)) (the cost of an entire partitional clustering)

log(n) levels O(n) comparison and reassignment operations at each level n log(n)

slide-46
SLIDE 46

Computational Complexity

  • Partitional clustering of data into k clusters:

< O(n log(n)) (the cost of an entire partitional clustering)

log(n) levels O(n) comparison and reassignment operations at each level n log(n) Truncate at k clusters

slide-47
SLIDE 47

Computational Complexity

  • Agglomerative clustering of docs in the k clusters:

O(k (n/k)2 log(n/k))

k clusters, size ≈ n/k Cost to cluster one cluster: O(size2 log(size)) ≈ (n/k)2 log(n/k)

slide-48
SLIDE 48

Computational Complexity

  • Agglomerative clustering of docs in the k clusters:

O(k (n/k)2 log(n/k))

k clusters, size ≈ n/k Cost to cluster one cluster: O(size2 log(size)) ≈ (n/k)2 log(n/k)

k (n/k)2 log(n/k)

slide-49
SLIDE 49

Computational Complexity

  • Agglomerative clustering of docs in the k clusters:

O(k2 log(k))

k clusters cost to cluster agglomeratively = O(k2 log(k))

slide-50
SLIDE 50

Computational Complexity

  • Agglomerative clustering of docs in the k clusters:

O(k2 log(k))

k clusters cost to cluster agglomeratively = O(k2 log(k))

slide-51
SLIDE 51

Computational Complexity

  • Agglomerative clustering of docs in the k clusters:

O(k2 log(k))

k clusters cost to cluster agglomeratively = O(k2 log(k))

slide-52
SLIDE 52

Computational Complexity

  • Putting it all together:

O(n log(n)) + O(k (n/k)2 log(n/k)) + O(k2log(k))

Initial partitional clustering Agglomerative clustering within initial clusters Agglomerative clustering between initial clusters

slide-53
SLIDE 53

Computational Complexity

  • Putting it all together:

O(n log(n)) + O(k (n/k)2 log(n/k)) + O(k2log(k))

Dominant term for reasonable choices of k

slide-54
SLIDE 54

Computational Complexity

  • Putting it all together:

O(k (n/k)2 log(n/k))

  • If we let k ≈ √n, this reduces to:

O(n3/2 log(n))

slide-55
SLIDE 55

Computational Complexity

  • Putting it all together:

O(k (n/k)2 log(n/k))

  • If we let k ≈ √n, this reduces to:

O(n3/2 log(n))

  • Or in general, if k ≈ nα with 0<α<1, complexity is:

O(nα+2(1-α) log(n))

  • Better than Agglomerative: O(n2 log(n))

Worse than Partitional: O(n log(n))

  • But a slightly better performer than either on average
slide-56
SLIDE 56

Evaluation: Experimental Design

12 document collections were analyzed with each of the hierarchical methods

Partitional Agglomerative Constrained Agglomerative Criterion Functions Internal-1 Internal-2 Exernal Hybrid (Internal-1) Hybrid (Internal-2) Graph-Based Internal-1 Internal-2 Exernal Hybrid (Internal-1) Hybrid (Internal-2) Graph-Based Single Link (slink) Complete Link (clink) Group Average (UPGMA) Internal-1 Internal-2 Exernal Hybrid (Internal-1) Hybrid (Internal-2) Graph-Based Number of initial clusters 10 20 n/40 n/20

slide-57
SLIDE 57

Document Collections (12)

slide-58
SLIDE 58

Vector Space Model

Model design:

  • TF-IDF term weighting
  • Normalized by document length
  • Cosine similarity
slide-59
SLIDE 59

FScore Metric

FScore for a class Lr and a cluster Si: how well does the cluster align with the class?

slide-60
SLIDE 60

FScore Metric

FScore for a class Lr and a cluster Si: how well does the cluster align with the class? Define F for class Lr as maximum over all clusters Si in the clustering tree:

slide-61
SLIDE 61

FScore Metric

FScore for a class Lr and a cluster Si: how well does the cluster align with the class? Define F for class Lr as maximum over all clusters Si in the clustering tree: FScore for entire clustering: F(Lr) summed across classes,weighted by class size

slide-62
SLIDE 62

Results: Agglomerative vs. Partitional

Hierarchical Method

slide-63
SLIDE 63

Results: Constrained Agglomerative

Criterion Functions Best Partitional Best Agglomerative Constrained Agglomerative (number of initial partitions)

F-score Constrained vs. Partitional vs. Agglomerative

slide-64
SLIDE 64

Conclusion

Zhao and Karypis did a thorough comparison of hierarchical clustering methods on large document collections Partitional algorithms consistently outperformed agglomerative methods Constrained agglomerative methods outperformed the partitional methods in many cases

slide-65
SLIDE 65

Thank You

slide-66
SLIDE 66

Clustering Method Comparisons

Partitional Agglomerative Constrained Agglomerative Complexity:

  • O(n log n)

Complexity:

  • O (n2 log n)

With caching in a binary heap

  • O (n3)

If the similarity function is not cacheable

Complexity

  • O (k((n/k)2 log (n/k)) + k2 log k)
  • =O (n3/2 log n) when number of

partitional clusters ≈ √n Limited studies show agglomerative methods

  • utperform k-means with small

datasets Initial merging may contain errors, which can multiply during agglomeration Partitional cluster constraint prevents initial merging errors (merging across cluster boundaries) Suited for large datasets due to low computational requirements Easy to group documents in small, cohesive clusters Common belief (2001) that k- means methods are inferior than agglomerative