[PPT] - Cutting the dendrogram through permutation tests Dario Bruzzese PowerPoint Presentation

SLIDE 1

Dario Bruzzese Domenico Vistocco dbruzzes@unina.it vistocco@unicas.it

Dario Bruzzese, Domenico Vistocco () Compstat 2010 1 / 19

Cutting the dendrogram through permutation tests

Department of Preventive Medical Sciences UNIVERSITY OF NAPLES ITALY Department of Economics UNIVERSITY OF CASSINO ITALY

SLIDE 2

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 2 / 19

SLIDE 3

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 3 / 19

SLIDE 4

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

SLIDE 5

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

The rep1HighNoise dataset

Yeung KY, Medvedovic M, Bumgarner KY: Clustering gene-expression data with repeated measurements. Genome Biology, 2003, 4:R34

n = 200 p = 20

SLIDE 6

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

Horizontal cut

k = 3

SLIDE 7

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

An alternative cut

k = 3

SLIDE 8

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 5 / 19

SLIDE 9

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

SLIDE 10

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify;

SLIDE 11

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1)

SLIDE 12

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) C1

L

C1

R

SLIDE 13

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) C2

L

C2

R

SLIDE 14

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) C3

L

C3

R

SLIDE 15

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

SLIDE 16

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

C1

L

C1

R

h

C1

L ∪ C1 R

SLIDE 17

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

C2

L

C2

R

h

C2

L ∪ C2 R

SLIDE 18

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

C3

L

C3

R

h

C3

L ∪ C3 R

SLIDE 19

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R })

SLIDE 20

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R }) C1

L

h

C1

L

SLIDE 21

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R }) C1

R

h

C1

R

SLIDE 22

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R }) C2

L

h

C2

L

SLIDE 23

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R }) C2

R

h

C2

R

SLIDE 24

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R }) C3

L

h

C3

L

SLIDE 25

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

Ck

L ∪ Ck R

the height necessary to merge

Ck

L and Ck R

h

Ck

j

the height at which Ck

j has been obtained

(j ∈ { L, R }) C3

R

h

C3

R

SLIDE 26

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

SLIDE 27

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

SLIDE 28

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

SLIDE 29

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end remove the first element from aggregationLevelsToVisit i ← i+1

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

SLIDE 30

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end remove the first element from aggregationLevelsToVisit i ← i+1 until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

SLIDE 31

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Initialization

i ← 0

aggregationLevelsToVisit

h(C1

L ∪ C1 R)

permClusters

SLIDE 32

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

aggregationLevelsToVisit

h(C1

L ∪ C1 R)

permClusters

h

C1

L ∪ C1 R

SLIDE 33

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

aggregationLevelsToVisit

h(C1

L ∪ C1 R)

permClusters clusters to compare

H0 : C1

L ≡ C1 R → reject C1

L

C1

R

SLIDE 34

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

permClusters aggregationLevelsToVisit

h(C1

L ∪ C1 R), h(C1 R), h(C1 L) h

C1

L

h
C1

R

SLIDE 35

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

permClusters aggregationLevelsToVisit

h(C1

R), h(C1 L)

SLIDE 36

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

R), h(C1 L) h

C1

R

SLIDE 37

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

R), h(C1 L)

clusters to compare

H0 : C2

L ≡ C2 R → reject C2

L

C2

R

SLIDE 38

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

R), h(C1 L), h(C2 R), h(C2 L) h

C2

L

h
C2

R

SLIDE 39

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

SLIDE 40

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 3

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L) h

C1

L

SLIDE 41

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 3

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

clusters to compare

H0 : C3

L ≡ C3 R → reject C3

L

C3

R

SLIDE 42

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 3

h

C3

L

h
C3

R

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

SLIDE 43

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 4

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L) h

C3

R

SLIDE 44

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 4

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L) C4

L C4 R

clusters to compare

H0 : C4

L ≡ C4 R → accept

SLIDE 45

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 4

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

clusters to compare

H0 : C4

L ≡ C4 R → accept

permClusters

C4

L ∪ C4 R ⇔ C3 R C3

R

SLIDE 46

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 9

aggregationLevelsToVisit permClusters

C3

L, C3 R, C2 L, C4 L, C4 R

SLIDE 47

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 9

aggregationLevelsToVisit permClusters

C3

L, C3 R, C2 L, C4 L, C4 R

SLIDE 48

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each aggregation level k a permutation test is designed to test the Null Hypothesis that the two groups Ck

L and Ck R really belong to the same

cluster, i.e. : H0 : Ck

L ≡ Ck R

Under this null, mixing up (permuting) the statistical units of Ck

L and Ck R should not alter the aggregation

process resulting in their merging in.

SLIDE 49

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k, the difference between max

j∈{L,R} h

Ck

j

and

min

j∈{L,R} h

Ck

j

can be considered as the

minimum cost necessary to merge the two classes. .

min h(C3

j )

max h(C3

j )

SLIDE 50

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k, the difference between max

j∈{L,R} h

Ck

j

and

min

j∈{L,R} h

Ck

j

can be considered as the

minimum cost necessary to merge the two classes. The difference between h

Ck

L ∪ Ck R

and

max

j∈{L,R} h

Ck

j

can be, instead, considered as the

cost actually incurred for merging Ck

L and Ck R. h(C3

L ∪ C3 R)

max h(C3

j )

SLIDE 51

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k, the difference between max

j∈{L,R} h

Ck

j

and

min

j∈{L,R} h

Ck

j

can be considered as the

minimum cost necessary to merge the two classes. The difference between h

Ck

L ∪ Ck R

and

max

j∈{L,R} h

Ck

j

can be, instead, considered as the

cost actually incurred for merging Ck

L and Ck R.

The ratio between these two differences: cost

Ck

L ∪ Ck R

=

max

j∈{L,R} h

Ck

j

−

min

j∈{L,R} h

Ck

j

h
Ck

L ∪ Ck R

−

max

j∈{L,R} h

Ck

j

is thus a measure that characterizes the aggregation process resulting in the

new class Ck

L ∪ Ck R

SLIDE 52

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

C3

L C3 R mC3 L mC3 R

SLIDE 53

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

SLIDE 54

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

The ratio: cost

mCk

L ∪ mCk R

=

max

j∈{L,R} h

mCk

j

−

min

j∈{L,R} h

mCk

j

h
Ck

L ∪ Ck R

−

max

j∈{L,R} h

mCk

j

is thus a measure that characterizes the aggregation process resulting in the

new (potential) class mCk

L ∪ mCk R

C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

SLIDE 55

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

Under H0 the aggregation process resulting in the new cluster Ck

L ∪ Ck R should be very similar

to the one that potentially produces mCk

L ∪ mCk R; thus the two values cost

mCk

L ∪ mCk R

and

cost

Ck

L ∪ Ck R

should be close enough.

C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

SLIDE 56

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

The permutation procedure is repeated M times and each time a new couple mCk

L , mCk R is

btained. The pvalue Montecarlo is thus computed as:

p = #

cost
mCk

L ∪ mCk R

≤ cost
Ck

L ∪ Ck R

+ 1

M + 1 C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

SLIDE 57

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 11 / 19

SLIDE 58

Some results - Real datasets

The yeast galactose dataset

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner RE, Goodlett DR, Aebersold R, Hood L Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science 2001, 292:929-934.

n = 205 p = 80

Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19

SLIDE 59

Some results - Real datasets

Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19

% of misclassification = 1.5

SLIDE 60

Some results - Real datasets

The diabetes dataset

Banfield JD, Raftery AE Model–based Gaussian and Non–Gaussian Clustering. Biometrics, 1993, 49, 803-821.

n = 145 p = 3

Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19

SLIDE 61

Some results - Real datasets

Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19

% of misclassification = 15.2

SLIDE 62

Some results - Synthetic dataset

QIU W.-L, JOE H. (2009). clusterGeneration: random cluster generation (with specified degree of separation). R package version 1.2.7. different number of clusters (k = 2; 3; 4; 5; 6; 7) separation index = 0.01 different number of variables (p = 5; 10; 15) 100 replications for each combination of k and p

Dario Bruzzese, Domenico Vistocco () Compstat 2010 14 / 19

SLIDE 63

Some results - Synthetic dataset (p=5)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 15 / 19

SLIDE 64

Some results - Synthetic dataset (p=10)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 16 / 19

SLIDE 65

Some results - Synthetic dataset (p=15)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 17 / 19

SLIDE 66

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 18 / 19

SLIDE 67

ToDo List

Statistical issues Introducing a penalty term in the permutation test step Quality measures of the obtained partition Multiple Testing Problem (???) Computational issues profiling and optimizing the R code

◮ use of compiled code ◮ deploying a package Dario Bruzzese, Domenico Vistocco () Compstat 2010 19 / 19