evaluation of hierarchical clustering algorithms for
play

Evaluation of Hierarchical Clustering Algorithms for Document - PowerPoint PPT Presentation

Evaluation of Hierarchical Clustering Algorithms for Document Datasets Paper by Ying Zhao and George Karypis University of Minnesota (2002) CS 6501 Paper Presentation - April 6, 2016 Matthew Hawthorn, Nikhil Mascarenhas, Shannon Mitchell


  1. Evaluation of Hierarchical Clustering Algorithms for Document Datasets Paper by Ying Zhao and George Karypis University of Minnesota (2002) CS 6501 Paper Presentation - April 6, 2016 Matthew Hawthorn, Nikhil Mascarenhas, Shannon Mitchell

  2. Motivation ● Hierarchical clustering of documents ○ Intuitive, clustering of different levels of granularity. ● Two major approaches ○ Partitional ○ Agglomerative ● General view was that partitional algorithms are inferior ● Authors ran an experiment to compare these approaches. ● Defined a new algorithm, a hybrid “constrained agglomerative algorithm”

  3. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves

  4. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  5. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  6. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  7. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  8. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  9. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  10. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  11. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  12. Hierarchical Clustering: Partitional Algorithms Top-down - Start with one cluster with all documents - Start at root, divide down to leaves - Split the cluster which most improves the criterion function - Complexity: O (n log n)

  13. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  14. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  15. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  16. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  17. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  18. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  19. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  20. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  21. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  22. Hierarchical Clustering: Agglomerative Algorithms Bottom-up - Each document starts as its own cluster - Start at leaves, merge to root - Complexity: O (n 2 log n) When caching of intermediate values of the objective is possible O(n 3 ) Otherwise

  23. Criterion Functions Global criterion functions drive the clustering process. Graph Based Internal Functions External Functions Hybrid Functions Functions Considers only Considers how Simultaneously Constructs a graph documents within various clusters are consider internal which represents the a cluster different from each and external relationships other. criterion functions between documents.

  24. m Number of terms n Number of documents Internal Criterion Functions k Number of clusters S 1 , S 2 ,... S k Each one of k clusters n 1 , n 2 ,…. n k Size of each cluster d 1 , d 2 , …. d n Tf idf vector for a document D A Sum of all vectors in cluster A C A Centroid vector of cluster A

  25. m Number of terms n Number of documents External Criterion Functions k Number of clusters S 1 , S 2 ,... S k Each one of k clusters n 1 , n 2 ,…. n k Size of each cluster d 1 , d 2 , …. d n Tf idf vector for a document D A Sum of all vectors in cluster A C A Centroid vector of cluster A

  26. Traditional Agglomerative Clustering Criteria Single-linkage Group average Complete-linkage minimum distance average of distances maximum distance Authors’ abbreviation: ‘slink’ ‘UPGMA’ ‘clink’

  27. Hierarchical Clustering: Constrained Agglomerative ● Hybrid technique ● Constrains agglomerative clustering by initializing with intermediate hierarchical partitional clustering ● More likely to avoid early merge mistakes of agglomerative techniques ● But takes advantage of the ease with which agglomerative techniques find small and cohesive clusters

  28. Hierarchical Clustering: Constrained Agglomerative 1. Find k clusters using partitional clustering

  29. Hierarchical Clustering: Constrained Agglomerative 1. Find k clusters using partitional clustering

  30. Hierarchical Clustering: Constrained Agglomerative 1. Find k clusters using partitional clustering

  31. Hierarchical Clustering: Constrained Agglomerative 1. Find k clusters using partitional clustering

  32. Hierarchical Clustering: Constrained Agglomerative 1. Find k clusters using partitional clustering

  33. Hierarchical Clustering: Constrained Agglomerative 2. Cluster the documents in these clusters using agglomerative clustering

  34. Hierarchical Clustering: Constrained Agglomerative 2. Cluster the documents in these clusters using agglomerative clustering

  35. Hierarchical Clustering: Constrained Agglomerative 2. Cluster the documents in these clusters using agglomerative clustering

  36. Hierarchical Clustering: Constrained Agglomerative 2. Cluster the documents in these clusters using agglomerative clustering

  37. Hierarchical Clustering: Constrained Agglomerative 2. Cluster the documents in these clusters using agglomerative clustering

  38. Hierarchical Clustering: Constrained Agglomerative 2. Cluster the documents in these clusters using agglomerative clustering

  39. Hierarchical Clustering: Constrained Agglomerative 3. Cluster the k clusters using agglomerative clustering

  40. Hierarchical Clustering: Constrained Agglomerative 3. Cluster the k clusters using agglomerative clustering

  41. Hierarchical Clustering: Constrained Agglomerative 3. Cluster the k clusters using agglomerative clustering

  42. Hierarchical Clustering: Constrained Agglomerative 3. Cluster the k clusters using agglomerative clustering

  43. Hierarchical Clustering: Constrained Agglomerative 3. Cluster the k clusters using agglomerative clustering

  44. Computational Complexity ● Partitional clustering of data into k clusters: < O(n log(n)) (the cost of an entire partitional clustering) log(n) levels O(n) comparison and reassignment operations at each level

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend