Cutting the dendrogram through permutation tests Dario Bruzzese - - PowerPoint PPT Presentation

cutting the dendrogram through permutation tests
SMART_READER_LITE
LIVE PREVIEW

Cutting the dendrogram through permutation tests Dario Bruzzese - - PowerPoint PPT Presentation

Cutting the dendrogram through permutation tests Dario Bruzzese Domenico Vistocco dbruzzes@unina.it vistocco@unicas.it Department of Department of Preventive Medical Sciences Economics U NIVERSITY OF N APLES ITALY U NIVERSITY OF C ASSINO


slide-1
SLIDE 1

Dario Bruzzese Domenico Vistocco dbruzzes@unina.it vistocco@unicas.it

Dario Bruzzese, Domenico Vistocco () Compstat 2010 1 / 19

Cutting the dendrogram through permutation tests

Department of Preventive Medical Sciences UNIVERSITY OF NAPLES ITALY Department of Economics UNIVERSITY OF CASSINO ITALY

slide-2
SLIDE 2

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 2 / 19

slide-3
SLIDE 3

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 3 / 19

slide-4
SLIDE 4

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-5
SLIDE 5

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

The rep1HighNoise dataset

Yeung KY, Medvedovic M, Bumgarner KY: Clustering gene-expression data with repeated measurements. Genome Biology, 2003, 4:R34

n = 200 p = 20

slide-6
SLIDE 6

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

Horizontal cut

k = 3

slide-7
SLIDE 7

Motivation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

An alternative cut

k = 3

slide-8
SLIDE 8

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 5 / 19

slide-9
SLIDE 9

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

slide-10
SLIDE 10

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify;

slide-11
SLIDE 11

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1)

slide-12
SLIDE 12

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) C1

L

C1

R

slide-13
SLIDE 13

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) C2

L

C2

R

slide-14
SLIDE 14

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) C3

L

C3

R

slide-15
SLIDE 15

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

slide-16
SLIDE 16

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

C1

L

C1

R

h

  • C1

L ∪ C1 R

slide-17
SLIDE 17

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

C2

L

C2

R

h

  • C2

L ∪ C2 R

slide-18
SLIDE 18

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

C3

L

C3

R

h

  • C3

L ∪ C3 R

slide-19
SLIDE 19

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-20
SLIDE 20

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R }) C1

L

h

  • C1

L

slide-21
SLIDE 21

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R }) C1

R

h

  • C1

R

slide-22
SLIDE 22

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R }) C2

L

h

  • C2

L

slide-23
SLIDE 23

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R }) C2

R

h

  • C2

R

slide-24
SLIDE 24

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R }) C3

L

h

  • C3

L

slide-25
SLIDE 25

Notation

Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h

  • Ck

L ∪ Ck R

  • the height necessary to merge

Ck

L and Ck R

h

  • Ck

j

  • the height at which Ck

j has been obtained

(j ∈ { L, R }) C3

R

h

  • C3

R

slide-26
SLIDE 26

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

slide-27
SLIDE 27

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

slide-28
SLIDE 28

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

slide-29
SLIDE 29

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end remove the first element from aggregationLevelsToVisit i ← i+1

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

slide-30
SLIDE 30

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end remove the first element from aggregationLevelsToVisit i ← i+1 until aggregationLevelsToVisit is empty

Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19

slide-31
SLIDE 31

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Initialization

i ← 0

aggregationLevelsToVisit

h(C1

L ∪ C1 R)

permClusters

slide-32
SLIDE 32

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

aggregationLevelsToVisit

h(C1

L ∪ C1 R)

permClusters

h

  • C1

L ∪ C1 R

slide-33
SLIDE 33

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

aggregationLevelsToVisit

h(C1

L ∪ C1 R)

permClusters clusters to compare

H0 : C1

L ≡ C1 R → reject C1

L

C1

R

slide-34
SLIDE 34

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

permClusters aggregationLevelsToVisit

h(C1

L ∪ C1 R), h(C1 R), h(C1 L) h

  • C1

L

  • h
  • C1

R

slide-35
SLIDE 35

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 1

permClusters aggregationLevelsToVisit

h(C1

R), h(C1 L)

slide-36
SLIDE 36

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

R), h(C1 L) h

  • C1

R

slide-37
SLIDE 37

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

R), h(C1 L)

clusters to compare

H0 : C2

L ≡ C2 R → reject C2

L

C2

R

slide-38
SLIDE 38

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

R), h(C1 L), h(C2 R), h(C2 L) h

  • C2

L

  • h
  • C2

R

slide-39
SLIDE 39

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 2

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

slide-40
SLIDE 40

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 3

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L) h

  • C1

L

slide-41
SLIDE 41

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 3

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

clusters to compare

H0 : C3

L ≡ C3 R → reject C3

L

C3

R

slide-42
SLIDE 42

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 3

h

  • C3

L

  • h
  • C3

R

  • aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

slide-43
SLIDE 43

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 4

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L) h

  • C3

R

slide-44
SLIDE 44

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

permClusters Iteration

i ← 4

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L) C4

L C4 R

clusters to compare

H0 : C4

L ≡ C4 R → accept

slide-45
SLIDE 45

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 4

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

clusters to compare

H0 : C4

L ≡ C4 R → accept

permClusters

C4

L ∪ C4 R ⇔ C3 R C3

R

slide-46
SLIDE 46

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 9

aggregationLevelsToVisit permClusters

C3

L, C3 R, C2 L, C4 L, C4 R

slide-47
SLIDE 47

The algorithm - The outline

Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19

Iteration

i ← 9

aggregationLevelsToVisit permClusters

C3

L, C3 R, C2 L, C4 L, C4 R

slide-48
SLIDE 48

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each aggregation level k a permutation test is designed to test the Null Hypothesis that the two groups Ck

L and Ck R really belong to the same

cluster, i.e. : H0 : Ck

L ≡ Ck R

Under this null, mixing up (permuting) the statistical units of Ck

L and Ck R should not alter the aggregation

process resulting in their merging in.

slide-49
SLIDE 49

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k, the difference between max

j∈{L,R} h

  • Ck

j

  • and

min

j∈{L,R} h

  • Ck

j

  • can be considered as the

minimum cost necessary to merge the two classes. .

min h(C3

j )

max h(C3

j )

slide-50
SLIDE 50

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k, the difference between max

j∈{L,R} h

  • Ck

j

  • and

min

j∈{L,R} h

  • Ck

j

  • can be considered as the

minimum cost necessary to merge the two classes. The difference between h

  • Ck

L ∪ Ck R

  • and

max

j∈{L,R} h

  • Ck

j

  • can be, instead, considered as the

cost actually incurred for merging Ck

L and Ck R. h(C3

L ∪ C3 R)

max h(C3

j )

slide-51
SLIDE 51

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19

For each k, the difference between max

j∈{L,R} h

  • Ck

j

  • and

min

j∈{L,R} h

  • Ck

j

  • can be considered as the

minimum cost necessary to merge the two classes. The difference between h

  • Ck

L ∪ Ck R

  • and

max

j∈{L,R} h

  • Ck

j

  • can be, instead, considered as the

cost actually incurred for merging Ck

L and Ck R.

The ratio between these two differences: cost

  • Ck

L ∪ Ck R

  • =

max

j∈{L,R} h

  • Ck

j

min

j∈{L,R} h

  • Ck

j

  • h
  • Ck

L ∪ Ck R

max

j∈{L,R} h

  • Ck

j

  • is thus a measure that characterizes the aggregation process resulting in the

new class Ck

L ∪ Ck R

slide-52
SLIDE 52

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

C3

L C3 R mC3 L mC3 R

slide-53
SLIDE 53

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

slide-54
SLIDE 54

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

The ratio: cost

  • mCk

L ∪ mCk R

  • =

max

j∈{L,R} h

  • mCk

j

min

j∈{L,R} h

  • mCk

j

  • h
  • Ck

L ∪ Ck R

max

j∈{L,R} h

  • mCk

j

  • is thus a measure that characterizes the aggregation process resulting in the

new (potential) class mCk

L ∪ mCk R

C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

slide-55
SLIDE 55

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

Under H0 the aggregation process resulting in the new cluster Ck

L ∪ Ck R should be very similar

to the one that potentially produces mCk

L ∪ mCk R; thus the two values cost

  • mCk

L ∪ mCk R

  • and

cost

  • Ck

L ∪ Ck R

  • should be close enough.

C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

slide-56
SLIDE 56

The algorithm - The permutation Test

Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19

The permutation procedure is repeated M times and each time a new couple mCk

L , mCk R is

  • btained. The pvalue Montecarlo is thus computed as:

p = #

  • cost
  • mCk

L ∪ mCk R

  • ≤ cost
  • Ck

L ∪ Ck R

  • + 1

M + 1 C3

L C3 R mC3 L mC3 R mC3 L mC3 R

h(mC3

L)

h(mC3

R)

slide-57
SLIDE 57

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 11 / 19

slide-58
SLIDE 58

Some results - Real datasets

The yeast galactose dataset

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner RE, Goodlett DR, Aebersold R, Hood L Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science 2001, 292:929-934.

n = 205 p = 80

Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19

slide-59
SLIDE 59

Some results - Real datasets

Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19

% of misclassification = 1.5

slide-60
SLIDE 60

Some results - Real datasets

The diabetes dataset

Banfield JD, Raftery AE Model–based Gaussian and Non–Gaussian Clustering. Biometrics, 1993, 49, 803-821.

n = 145 p = 3

Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19

slide-61
SLIDE 61

Some results - Real datasets

Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19

% of misclassification = 15.2

slide-62
SLIDE 62

Some results - Synthetic dataset

QIU W.-L, JOE H. (2009). clusterGeneration: random cluster generation (with specified degree of separation). R package version 1.2.7. different number of clusters (k = 2; 3; 4; 5; 6; 7) separation index = 0.01 different number of variables (p = 5; 10; 15) 100 replications for each combination of k and p

Dario Bruzzese, Domenico Vistocco () Compstat 2010 14 / 19

slide-63
SLIDE 63

Some results - Synthetic dataset (p=5)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 15 / 19

slide-64
SLIDE 64

Some results - Synthetic dataset (p=10)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 16 / 19

slide-65
SLIDE 65

Some results - Synthetic dataset (p=15)

Dario Bruzzese, Domenico Vistocco () Compstat 2010 17 / 19

slide-66
SLIDE 66

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline

3

Some results Real datasets Synthetic dataset

4

ToDo List

Dario Bruzzese, Domenico Vistocco () Compstat 2010 18 / 19

slide-67
SLIDE 67

ToDo List

Statistical issues Introducing a penalty term in the permutation test step Quality measures of the obtained partition Multiple Testing Problem (???) Computational issues profiling and optimizing the R code

◮ use of compiled code ◮ deploying a package Dario Bruzzese, Domenico Vistocco () Compstat 2010 19 / 19