Stability Analysis For Unsupervised Learning Dr. Derek Greene - - PowerPoint PPT Presentation

stability analysis for unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Stability Analysis For Unsupervised Learning Dr. Derek Greene - - PowerPoint PPT Presentation

Stability Analysis For Unsupervised Learning Dr. Derek Greene Insight @ UCD April 2014 derek.greene@ucd.ie Introduction Cluster Validation : A quantitative means of assessing the quality of a clustering. Feedback Cluster Clustering Pre-


slide-1
SLIDE 1

Stability Analysis For Unsupervised Learning

  • Dr. Derek Greene

Insight @ UCD April 2014

derek.greene@ucd.ie

slide-2
SLIDE 2

Introduction

Insight Machine Learning Workshop 2

Cluster Validation: A quantitative means of assessing the quality of a clustering.

Clustering Algorithm Dataset Model Clustering Cluster Validation Pre- Processing Feedback

  • Q. How many clusters k in

a given dataset?

  • A common application is in model selection: 


i.e. identifying optimal algorithm parameter values.

k = 2 k = 3

slide-3
SLIDE 3

Common Validation Strategies

3

External validation

  • Assess agreement between test clustering and “gold standard” clustering.
  • Approaches: Count pairwise agreements; match corresponding clusters;

information theoretic agreement.

  • Examples: Jaccard, Rand Index, Normalised Mutual Information

✓ Useful for developing and verifying algorithms. x Not directly applicable in real unsupervised tasks.

Internal validation

  • Compare solutions based on the goodness of fit between a clustering

and the raw data.

  • Approaches: Intra-cluster similarity, inter-cluster separation, …
  • Examples: Dunn’s index, DB index, Silhouette score

✓ Does not require a gold standard clustering. x Can only make comparisons between clusterings generated using the

same model/metric.

x Often make assumptions about cluster structure.

slide-4
SLIDE 4

Stability Analysis

  • Use an approach analogous to cross-validation…
  • Evaluate similarity for each model on multiple runs, and select

the model resulting in the highest level of stability.

Insight Machine Learning Workshop 4

Stability: The tendency of a clustering algorithm to produce similar clusterings on data originating from the same source.

Clustering Model Stability Criterion Original Dataset Base Clusterings

0.???

Model Evaluation

  • High level of similarity between collection of clusterings


⇒ Model is appropriate for the data set.

Original
 Dataset Collection of
 Clusterings Model
 Stability

slide-5
SLIDE 5

Measuring Stability

Stability Analysis Based on Resampling 


(Levine & Domany, 2001)

  • Evaluate the pairwise similarity between a collection of

clusterings of resampled data.

Insight Machine Learning Workshop 5

k ∈ [kmin, kmax]

Stability(k)

For clusters: 1.Apply algorithm to generate clusterings on random samples of the complete dataset. 2.Assess pairwise similarity between each pair of clusterings using an external validation index.

  • 3. = mean pairwise similarity.

➡ Select model resulting in maximum stability.

slide-6
SLIDE 6

Measuring Stability

Prediction-Based Validation (Tibshirani et al., 2001)

  • Assess degree to which a model allows us to construct a

classifier on a training set that will successfully predict a clustering of the test set.

Insight Machine Learning Workshop 6

  • Randomly generate τ training/test splits.
  • For clusters:
  • For each split:
  • 1. Apply clustering algorithm to training set.
  • 2. Predict assignments for test set.
  • 3. Apply clustering algorithm to test set.
  • 4. Evaluate classification accuracy.
  • = mean classification accuracy.

➡ Select model resulting in maximum stability.

k ∈ [kmin, kmax]

Stability(k)

slide-7
SLIDE 7

Prediction-Based Validation

Insight Machine Learning Workshop 7

Example of applying prediction-based validation to examine the suitability of k =2 for small synthetic dataset:

(a) Full dataset

µ2 µ1

(b) Training clustering (Ca) (c) Test clustering (Cb)

µ2 µ1

(d) Predicted clustering (Pb)

slide-8
SLIDE 8

Prediction-Based Validation

Insight Machine Learning Workshop 8

Example of applying prediction-based validation to examine the suitability of k =2-3 for corpus of newsgroup documents:

−2 −1 1 2 −1 −0.5 0.5 1 1.5 PC1 PC2

(a) Training (k = 2)

−2 −1 1 2 −1 −0.5 0.5 1 1.5 PC1 PC2

(b) Testing (k = 2)

−2 −1 1 2 −1 −0.5 0.5 1 1.5 PC1 PC2

(c) Training (k = 3)

−2 −1 1 2 −1 −0.5 0.5 1 1.5 PC1 PC2

(d) Testing (k = 3)

k = 2 k = 3

slide-9
SLIDE 9

Stability Analysis in Topic Modeling

  • Q. How many topics are in an unlabelled text corpus?
  • Proposal:
  • Generate topics on samples of the corpus.
  • Use stability analysis, but take a term-centric approach to agreement,

focusing on the highest ranked terms for each topic.

  • Higher agreement between terms ➢ A more stable topic model.

9 Rank Topic 1 Topic 2 1

  • il

win 2 bank players 3 election minister 4 policy party 5 government ireland 6 match club 7 senate year 8 democracy election 9 firm coalition 10 team first Rank Topic 1 Topic 2 1 cup first 2 labour sales 3 growth year 4 team minister 5 senate firm 6 minister match 7 ireland coalition 8 players team 9 year election 10 economy policy

Low agreement 
 between top
 ranked terms Run 1 Run 2 Low stability
 for k=2

slide-10
SLIDE 10

Stability Analysis in Topic Modeling

  • Q. How many topics are in an unlabelled text corpus?
  • Proposal:
  • Generate topics on samples of the corpus.
  • Use stability analysis, but take a term-centric approach to agreement,

focusing on the highest ranked terms for each topic.

  • Higher agreement between terms ➢ A more stable topic model.

10 Rank Topic 1 Topic 2 Topic 3 1 growth game labour 2 company ireland election 3 market win vote 4 economy cup party 5 bank goal governmen t 6 year match coalition 7 firm team minister 8 sales first policy 9 shares club democracy 10

  • il

players first Rank Topic 1 Topic 2 Topic 3 1 game growth labour 2 win company election 3 ireland market governmen t 4 cup economy party 5 match bank vote 6 team shares policy 7 first year minister 8 players firm democracy 9 club sales senate 10 goal

  • il

coalition

Run 1 Run 2 High agreement 
 between top
 ranked terms High stability
 for k=3

slide-11
SLIDE 11

Summary

  • Common strategies for model selection in clustering often fail or

exhibit strong biases.

  • Analogous to cross-validation in supervised learning, stability

analysis can be applied to choose between models for clustering.

✓ Do not exhibit the biases of classical measures. ✓ Can be used to compare output of different algorithms run on

different representations.

  • Drawbacks? Can be computationally expensive…

✓ We can measure the stability of “weak” models. ✓ Information from multiple runs can be subsequently used to

build ensemble models.

Insight Machine Learning Workshop 11

slide-12
SLIDE 12

References

  • Levine, E. & Domany, E. (2001). Resampling method for unsupervised

estimation of cluster validity. Neural Computation, 13.

  • Tibshirani, R., Walther, G., Botstein, D. & Coalition, P

. (2001). Cluster validation by prediction strength. Tech. rep., Dept. Statistics, Stanford University.

  • Lange, T., Roth, V., Braun, M.L. & Buhmann, J.M. (2004). Stability-

based validation of clustering solutions. Neural Computation, 16.

  • Greene, D. & Cunningham, P

. (2006). Efficient prediction-based validation for document clustering. In Proc. 17th European Conference on Machine Learning (ECML’06).

Insight Machine Learning Workshop 12