Cluster Validity Hypothesis Random Graph Hypothesis Random Label - PowerPoint PPT Presentation

Cluster Validity 10/14/2010 1 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Cluster Validity Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Erin Wirch & Wenbo Wang Methodology Clustering Indices - Hard Clustering Questions Oct. 28, 2010

Cluster Validity Outline 10/14/2010 2 Erin Wirch & Wenbo Wang Outline Outline Hypothesis Testing Random Position Hypothesis Testing Hypothesis Random Graph Hypothesis Random Position Hypothesis Random Label Hypothesis Random Graph Hypothesis Relative Criteria Random Label Hypothesis Methodology Clustering Indices - Hard Clustering Questions Relative Criteria Methodology Clustering Indices - Hard Clustering Questions

Cluster Validity Agenda 10/14/2010 3 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis ◮ Hypothesis Testing Random Graph Hypothesis Random Label ◮ Review of Hypothesis Testing Hypothesis ◮ Random Position Hypothesis Relative Criteria Methodology ◮ Random Graph Hypothesis Clustering Indices - Hard Clustering ◮ Random Label Hypothesis Questions ◮ Relative Criteria

Cluster Validity Review of Hypothesis Testing 10/14/2010 4 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position ◮ Test a parameter against a specific value Hypothesis Random Graph Hypothesis ◮ Begin with H 0 and H 1 as the null and alternative Random Label Hypothesis hypotheses Relative Criteria Methodology ◮ Power function: Clustering Indices - Hard Clustering W ( θ ) = P ( q ǫ D p | θǫ Θ 1 ) Questions ◮ Goal: make correct decision

Cluster Validity Hypothesis Testing in Cluster Validity 10/14/2010 5 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position ◮ Test whether the data of X possess a random structure Hypothesis Random Graph Hypothesis ◮ First step: generate data to model a random structure Random Label Hypothesis ◮ Second step: define a statistic and compare results from Relative Criteria Methodology our data set and a reference set Clustering Indices - Hard Clustering ◮ Three methods exist to generate the population under Questions the randomness hypothesis ◮ Choose best method for the situation

Cluster Validity Random Position Hypothesis 10/14/2010 6 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position ◮ Suitable for ratio data Hypothesis Random Graph Hypothesis ◮ Requirement: “All the arrangements of N vectors in a Random Label Hypothesis specific region of the l-dimensional space are equally Relative Criteria likely to occur.” Methodology Clustering Indices - Hard Clustering ◮ This can be accomplished with random insertion of Questions points in the region according to uniform distribution ◮ Can be used with internal or external criteria

Cluster Validity External Criteria 10/14/2010 7 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis ◮ Impose clustering algorithm on X a priori based on Random Label Hypothesis intuitions Relative Criteria Methodology ◮ Evaluate resulting clustering structure in terms of Clustering Indices - Hard Clustering independently drawn structure Questions

Cluster Validity Internal Criteria 10/14/2010 8 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis ◮ Evaluate clustering structure in terms of vectors in X Relative Criteria ◮ Example: proximity matrix Methodology Clustering Indices - Hard Clustering Questions

Cluster Validity Random Graph Hypothesis 10/14/2010 9 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis ◮ Suitable when only internal information is available Random Graph Hypothesis ◮ Definition: NxN matrix A as symmetric matrix with Random Label Hypothesis zero diagonal elements Relative Criteria Methodology ◮ A(i,j) only gives information about dissimilarity between Clustering Indices - Hard Clustering x i and x j Questions ◮ Thus comparing dissimilarities is meaningless

Cluster Validity Random Graph Hypothesis, cont’d 10/14/2010 10 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis ◮ Let A i be an NxN rank order without ties Random Label Hypothesis ◮ Reference population consists of matrices A i generated Relative Criteria by randomly iserted integers in the range [1, N ( N − 1) Methodology ] Clustering Indices - 2 Hard Clustering ◮ H 0 rejected if q is too large or too small Questions

Cluster Validity Random Label Hypothesis 10/14/2010 11 Erin Wirch & Wenbo Wang Outline Hypothesis Testing ′ of x in m groups ◮ Consider all possible partitions, P Random Position Hypothesis Random Graph Hypothesis ◮ Assume that all possible mappings are equally likely Random Label Hypothesis ◮ Statistic q can be defined to measure degree Relative Criteria Methodology information in X matches specific partition Clustering Indices - Hard Clustering ◮ Use q to test degree of match between P and P against Questions q i ’s corresponding to random partitions ◮ H 0 rejected if q is too large or too small

Cluster Validity Methodology 10/14/2010 12 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph ◮ To choose the best parameters A for a specific Hypothesis Random Label clustering algorithm to best fit the data set X Hypothesis Relative Criteria ◮ Parameter set A Methodology Clustering Indices - ◮ the cluster size estimation m Hard Clustering ◮ the initial estimates of parameter vectors related with Questions each cluster

Cluster Validity Method I 10/14/2010 13 Erin Wirch & ◮ cluster size m is not pre-determined in the algorithm Wenbo Wang ◮ criteria: the clustering structure is captured by a wide Outline range of A Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: (a) 2-D clusters from normal distributions with mean [0 , 0] T , [8 , 4] T and [8 , 0] T , covariance matrices 1 . 5 I . (b) clustering result (cluster size m ) with binary morphology algorithm, with respect of different resolution parameters r

Cluster Validity Method I (Cont’) 10/14/2010 14 Erin Wirch & Wenbo Wang ◮ Comparing by data set with a wider range of variance: Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: (a) 2-D clusters from normal distributions with mean [0 , 0] T , [8 , 4] T and [8 , 0] T , covariance matrices 2 . 5 I . (b) clustering result (cluster size m ) with binary morphology algorithm, with respect of different resolution parameters r

Cluster Validity Method II 10/14/2010 15 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis ◮ cluster size m is pre-determined in the algorithm Random Graph Hypothesis ◮ criteria: to choose the best clustering index q in the Random Label Hypothesis range of [ m min , m max ] Relative Criteria Methodology ◮ if q shows no trends with respect of m , vary parameter Clustering Indices - Hard Clustering A for each m , choose the best A Questions ◮ if q shows trends with respect of m , choose m where significant local change of q happens

Cluster Validity Method II (cont’) 10/14/2010 16 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: data set generated from 4 well-separated normal distributions (feature size l ∈ { 2 , 4 , 6 , 8 } ) (a) N = 50 (b) N = 100 (c) N = 150 (d) N = 200. The sharp turns indicate the clustering structure

Cluster Validity Method II (cont’) 10/14/2010 17 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Methodology Clustering Indices - Hard Clustering Questions Figure: data set generated from 4 poorly-separated uniformed distributions (feature size l ∈ { 2 , 4 , 6 , 8 } ) (a) N = 50 (b) N = 100 (c) N = 150 (d) N = 200. No sharp turn exhibited

Cluster Validity Hard Clustering Indices 10/14/2010 18 Erin Wirch & Wenbo Wang ◮ The modified Hubert Γ statistic: correlation between proximity matrix P and cluster distance matrix Q Outline ◮ P ( i , j ) = d ( x i , x j ), Q ( i , j ) = d ( c x i , c x j ) Hypothesis Testing Random Position Hypothesis N − 1 N Random Graph Hypothesis � � Γ = (1 / M ) X ( i , j ) Y ( i , j ) (1) Random Label Hypothesis i =1 j = i +1 Relative Criteria Methodology ◮ The Dunn and Dunn-like indices Clustering Indices - Hard Clustering ◮ dissimilarity function between two clusters: Questions d ( C i , C j ) = min x ∈ C i , y ∈ C j d ( x , y ) ◮ diameter of a cluster C: diam ( C ) = max x , y ∈ C d ( x , y ) ◮ Dunn index: d ( C i , C j ) D m = min min (2) max k =1 ,..., m diam ( C k ) i =1 ,..., m j = i +1 ,..., m

Cluster Validity Hypothesis Random Graph Hypothesis Random Label - PowerPoint PPT Presentation

Cluster Validity 10/14/2010 1 Erin Wirch & Wenbo Wang Outline Hypothesis Testing Random Position Cluster Validity Hypothesis Random Graph Hypothesis Random Label Hypothesis Relative Criteria Erin Wirch & Wenbo Wang

External Validity of NYC Macroscope Electronic Health External Validity of NYC Macroscope

External Validity March 25 1 / 16 Definition How do we define external validity? Mundane

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

RESEARCH VALIDITY Winfred Arthur, Jr. Department of Psychological and Brain Sciences and

Proving the Validity of an Argument Torben Amtoft Kansas State University Torben Amtoft Kansas

Cue validity Cue validity - predictiveness of a cue for a given category Central

First-Order Necessity and Validity First-Order Necessity and Validity Mark Criley IWU

Circuit Validity Checker D. Mitch Bailey Shuhari System, Japan WOSET 2020 CVC: Circuit Validity

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Towards On-Demand I/O Forwarding in HPC Platforms Jean Luca Bez, Francieli Zanon Boito, Ramon

SANCTIFICATION IN 1 JOHN: KEY TERMS AND DOCTRINES KEY TERMS AND DOCTRINES IN LIGHT OF THE

Business Statistics CONTENTS Post-hoc analysis ANOVA for 2 groups The equal variances

Controlling False Discovery Rate Privately Weijie Su University of Pennsylvania NIPS, Barcelona,

Stability Analysis For Unsupervised Learning Dr. Derek Greene Insight @ UCD April 2014

Status of DUNE DAQ Hardware/Firmware Development Status David Cussans DUNE DAQ Meeting 15 th

A Bayesian test of the lineage-specificity of word-order correlations Gerhard Jger Tbingen

Effective transfer learning for clinical applications Benjamin van der Burgh (LIACS) OVERVIEW

Sambuz

Useful Links

Newsletter

Mail Us