cluster validity
play

Cluster Validity Hypothesis Random Graph Hypothesis Random Label - PowerPoint PPT Presentation

Cluster Validity 10/14/2010 1 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Cluster Validity Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Erin Wirch & Wenbo Wang


  1. Cluster Validity 10/14/2010 1 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Cluster Validity Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Erin Wirch & Wenbo Wang Methodology Clustering Indices - Hard Clustering Questions Oct. 28, 2010

  2. Cluster Validity Outline 10/14/2010 2 Erin Wirch & Wenbo Wang Outline Outline Hypothesis Testing Random Position Hypothesis Testing Hypothesis Random Graph Hypothesis Random Position Hypothesis Random Label Hypothesis Random Graph Hypothesis Relative Criteria Random Label Hypothesis Methodology Clustering Indices - Hard Clustering Questions Relative Criteria Methodology Clustering Indices - Hard Clustering Questions

  3. Cluster Validity Agenda 10/14/2010 3 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis ◮ Hypothesis Testing Random Graph Hypothesis Random Label ◮ Review of Hypothesis Testing Hypothesis ◮ Random Position Hypothesis Relative Criteria Methodology ◮ Random Graph Hypothesis Clustering Indices - Hard Clustering ◮ Random Label Hypothesis Questions ◮ Relative Criteria

  4. Cluster Validity Review of Hypothesis Testing 10/14/2010 4 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position ◮ Test a parameter against a specific value Hypothesis Random Graph Hypothesis ◮ Begin with H 0 and H 1 as the null and alternative Random Label Hypothesis hypotheses Relative Criteria Methodology ◮ Power function: Clustering Indices - Hard Clustering W ( θ ) = P ( q ǫ D p | θǫ Θ 1 ) Questions ◮ Goal: make correct decision

  5. Cluster Validity Hypothesis Testing in Cluster Validity 10/14/2010 5 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position ◮ Test whether the data of X possess a random structure Hypothesis Random Graph Hypothesis ◮ First step: generate data to model a random structure Random Label Hypothesis ◮ Second step: define a statistic and compare results from Relative Criteria Methodology our data set and a reference set Clustering Indices - Hard Clustering ◮ Three methods exist to generate the population under Questions the randomness hypothesis ◮ Choose best method for the situation

  6. Cluster Validity Random Position Hypothesis 10/14/2010 6 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position ◮ Suitable for ratio data Hypothesis Random Graph Hypothesis ◮ Requirement: “All the arrangements of N vectors in a Random Label Hypothesis specific region of the l-dimensional space are equally Relative Criteria likely to occur.” Methodology Clustering Indices - Hard Clustering ◮ This can be accomplished with random insertion of Questions points in the region according to uniform distribution ◮ Can be used with internal or external criteria

  7. Cluster Validity External Criteria 10/14/2010 7 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis ◮ Impose clustering algorithm on X a priori based on Random Label Hypothesis intuitions Relative Criteria Methodology ◮ Evaluate resulting clustering structure in terms of Clustering Indices - Hard Clustering independently drawn structure Questions

  8. Cluster Validity Internal Criteria 10/14/2010 8 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis ◮ Evaluate clustering structure in terms of vectors in X Relative Criteria ◮ Example: proximity matrix Methodology Clustering Indices - Hard Clustering Questions

  9. Cluster Validity Random Graph Hypothesis 10/14/2010 9 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis ◮ Suitable when only internal information is available Random Graph Hypothesis ◮ Definition: NxN matrix A as symmetric matrix with Random Label Hypothesis zero diagonal elements Relative Criteria Methodology ◮ A(i,j) only gives information about dissimilarity between Clustering Indices - Hard Clustering x i and x j Questions ◮ Thus comparing dissimilarities is meaningless

  10. Cluster Validity Random Graph Hypothesis, cont’d 10/14/2010 10 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis ◮ Let A i be an NxN rank order without ties Random Label Hypothesis ◮ Reference population consists of matrices A i generated Relative Criteria by randomly iserted integers in the range [1, N ( N − 1) Methodology ] Clustering Indices - 2 Hard Clustering ◮ H 0 rejected if q is too large or too small Questions

  11. Cluster Validity Random Label Hypothesis 10/14/2010 11 Erin Wirch & Wenbo Wang Outline Hypothesis Testing ′ of x in m groups ◮ Consider all possible partitions, P Random Position Hypothesis Random Graph Hypothesis ◮ Assume that all possible mappings are equally likely Random Label Hypothesis ◮ Statistic q can be defined to measure degree Relative Criteria Methodology information in X matches specific partition Clustering Indices - Hard Clustering ◮ Use q to test degree of match between P and P against Questions q i ’s corresponding to random partitions ◮ H 0 rejected if q is too large or too small

  12. Cluster Validity Methodology 10/14/2010 12 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph ◮ To choose the best parameters A for a specific Hypothesis Random Label clustering algorithm to best fit the data set X Hypothesis Relative Criteria ◮ Parameter set A Methodology Clustering Indices - ◮ the cluster size estimation m Hard Clustering ◮ the initial estimates of parameter vectors related with Questions each cluster

  13. Cluster Validity Method I 10/14/2010 13 Erin Wirch & ◮ cluster size m is not pre-determined in the algorithm Wenbo Wang ◮ criteria: the clustering structure is captured by a wide Outline range of A Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: (a) 2-D clusters from normal distributions with mean [0 , 0] T , [8 , 4] T and [8 , 0] T , covariance matrices 1 . 5 I . (b) clustering result (cluster size m ) with binary morphology algorithm, with respect of different resolution parameters r

  14. Cluster Validity Method I (Cont’) 10/14/2010 14 Erin Wirch & Wenbo Wang ◮ Comparing by data set with a wider range of variance: Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: (a) 2-D clusters from normal distributions with mean [0 , 0] T , [8 , 4] T and [8 , 0] T , covariance matrices 2 . 5 I . (b) clustering result (cluster size m ) with binary morphology algorithm, with respect of different resolution parameters r

  15. Cluster Validity Method II 10/14/2010 15 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis ◮ cluster size m is pre-determined in the algorithm Random Graph Hypothesis ◮ criteria: to choose the best clustering index q in the Random Label Hypothesis range of [ m min , m max ] Relative Criteria Methodology ◮ if q shows no trends with respect of m , vary parameter Clustering Indices - Hard Clustering A for each m , choose the best A Questions ◮ if q shows trends with respect of m , choose m where significant local change of q happens

  16. Cluster Validity Method II (cont’) 10/14/2010 16 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: data set generated from 4 well-separated normal distributions (feature size l ∈ { 2 , 4 , 6 , 8 } ) (a) N = 50 (b) N = 100 (c) N = 150 (d) N = 200. The sharp turns indicate the clustering structure

  17. Cluster Validity Method II (cont’) 10/14/2010 17 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: data set generated from 4 poorly-separated uniformed distributions (feature size l ∈ { 2 , 4 , 6 , 8 } ) (a) N = 50 (b) N = 100 (c) N = 150 (d) N = 200. No sharp turn exhibited

  18. Cluster Validity Hard Clustering Indices 10/14/2010 18 Erin Wirch & Wenbo Wang ◮ The modified Hubert Γ statistic: correlation between proximity matrix P and cluster distance matrix Q Outline ◮ P ( i , j ) = d ( x i , x j ), Q ( i , j ) = d ( c x i , c x j ) Hypothesis Testing Random Position Hypothesis N − 1 N Random Graph Hypothesis � � Γ = (1 / M ) X ( i , j ) Y ( i , j ) (1) Random Label Hypothesis i =1 j = i +1 Relative Criteria Methodology ◮ The Dunn and Dunn-like indices Clustering Indices - Hard Clustering ◮ dissimilarity function between two clusters: Questions d ( C i , C j ) = min x ∈ C i , y ∈ C j d ( x , y ) ◮ diameter of a cluster C: diam ( C ) = max x , y ∈ C d ( x , y ) ◮ Dunn index: d ( C i , C j ) D m = min min (2) max k =1 ,..., m diam ( C k ) i =1 ,..., m j = i +1 ,..., m

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend