Stairstep-like dendrogram cut: a permutation test approach Dario - - PowerPoint PPT Presentation

stairstep like dendrogram cut a permutation test approach
SMART_READER_LITE
LIVE PREVIEW

Stairstep-like dendrogram cut: a permutation test approach Dario - - PowerPoint PPT Presentation

Stairstep-like dendrogram cut: a permutation test approach Dario Bruzzese Umberto Giani Domenico Vistocco dbruzzes@unina.it ugiani@unina.it vistocco@unicas.it -


slide-1
SLIDE 1

Stairstep-like dendrogram cut: a permutation test approach

Dario Bruzzese Umberto Giani Domenico Vistocco dbruzzes@unina.it ugiani@unina.it vistocco@unicas.it ————————————————————- —————————— Department of Department of Preventive Medical Sciences Economics UNIVERSITY OF NAPLES UNIVERSITY OF CASSINO ITALY ITALY

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 1 / 21

slide-2
SLIDE 2

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline The Core

3

Some results Real datasets Synthetic dataset

4

ToDo List

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 2 / 21

slide-3
SLIDE 3

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline The Core

3

Some results Real datasets Synthetic dataset

4

ToDo List

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 3 / 21

slide-4
SLIDE 4

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

Automatically determine the optimal cut-off level of a dendrogram

slide-5
SLIDE 5

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-6
SLIDE 6

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

The rep1HighNoise dataset

Yeung KY, Medvedovic M, Bumgarner KY: Clustering gene-expression data with repeated measurements. Genome Biology, 2003, 4:R34

n = 200 p = 20

It is a synthetic data set with error distributions derived from real array data.

Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-7
SLIDE 7

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

Horizontal cut

k = 3 Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-8
SLIDE 8

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

Horizontal cut

k = 3 (green clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-9
SLIDE 9

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

An alternative cut

k = 3 (rainbow clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-10
SLIDE 10

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

An alternative cut

k = 3 (rainbow clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-11
SLIDE 11

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

Horizontal cut

k = 4 (blue clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-12
SLIDE 12

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

Horizontal cut

k = 4 (blue clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-13
SLIDE 13

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

An alternative cut

k = 4 (rainbow clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-14
SLIDE 14

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

An alternative cut

k = 4 (rainbow clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-15
SLIDE 15

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

An alternative cut

k = 5 (rainbow clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-16
SLIDE 16

Motivation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 4 / 21

An alternative cut

k = 5 (rainbow clusters) Automatically determine the optimal cut-off level of a dendrogram Explore partitions different from those allowed by an horizontal cut

slide-17
SLIDE 17

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline The Core

3

Some results Real datasets Synthetic dataset

4

ToDo List

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 5 / 21

slide-18
SLIDE 18

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

Let:

slide-19
SLIDE 19

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

  • Let:

n the number of objects to classify;

slide-20
SLIDE 20

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1)

slide-21
SLIDE 21

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

C1

R

C1

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1)

slide-22
SLIDE 22

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

C2

R

C2

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1)

slide-23
SLIDE 23

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

C3

R

C3

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1)

slide-24
SLIDE 24

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

slide-25
SLIDE 25

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C1

L ∪ C1 R

” C1

R

C1

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

slide-26
SLIDE 26

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C2

L ∪ C2 R

” C2

R

C2

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

slide-27
SLIDE 27

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C3

L ∪ C3 R

” C3

R

C3

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

slide-28
SLIDE 28

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-29
SLIDE 29

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C1

L

” C1

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-30
SLIDE 30

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C1

R

” C1

R

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-31
SLIDE 31

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C2

L

” C2

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-32
SLIDE 32

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C2

R

” C2

R

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-33
SLIDE 33

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C3

L

” C3

L

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-34
SLIDE 34

Notation

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 6 / 21

h “ C3

R

” C3

R

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-35
SLIDE 35

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 7 / 21

slide-36
SLIDE 36

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 7 / 21

slide-37
SLIDE 37

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 7 / 21

slide-38
SLIDE 38

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end remove the first element from aggregationLevelsToVisit i ← i+1

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 7 / 21

slide-39
SLIDE 39

The algorithm - Pseudo Code

Input: A dataset and its related dendrogram Output: A partition of the dataset initialization: aggregationLevelsToVisit ← h(C1

L ∪ C1 R)

permClusters ← [ ] i ← 1 repeat if Ci

L ≡ Ci R then

add Ci

L ∪ Ci R to permClusters

else add h(Ci

L) and h(Ci R) to aggregationLevelsToVisit

sort aggregationLevelsToVisit in descending order end remove the first element from aggregationLevelsToVisit i ← i+1 until aggregationLevelsToVisit is empty

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 7 / 21

slide-40
SLIDE 40

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

permClusters aggregationLevelsToVisit

h(C1

L ∪ C1 R)

Initialization

i ← 0

slide-41
SLIDE 41

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

h “ C1

L ∪ C1 R

permClusters aggregationLevelsToVisit

h(C1

L ∪ C1 R)

Iteration

i ← 1

slide-42
SLIDE 42

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

C1

R

C1

L

clusters to compare

H0 : C1

L ≡ C1 R → reject

permClusters aggregationLevelsToVisit

h(C1

L ∪ C1 R)

Iteration

i ← 1

slide-43
SLIDE 43

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

h “ C1

R

” h “ C1

L

aggregationLevelsToVisit

h(C1

L ∪ C1 R), h(C1 R), h(C1 L)

permClusters Iteration

i ← 1

slide-44
SLIDE 44

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

aggregationLevelsToVisit

h(C1

R), h(C1 L)

permClusters Iteration

i ← 1

slide-45
SLIDE 45

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

h “ C1

R

aggregationLevelsToVisit

h(C1

R), h(C1 L)

Iteration

i ← 2

permClusters

slide-46
SLIDE 46

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

C2

R

C2

L

clusters to compare

H0 : C2

L ≡ C2 R → reject

aggregationLevelsToVisit

h(C1

R), h(C1 L)

Iteration

i ← 2

permClusters

slide-47
SLIDE 47

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

h “ C2

R

” h “ C2

L

aggregationLevelsToVisit

h(C1

R), h(C1 L), h(C2 R), h(C2 L)

Iteration

i ← 2

permClusters

slide-48
SLIDE 48

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

Iteration

i ← 2

permClusters

slide-49
SLIDE 49

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

h “ C1

L

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

Iteration

i ← 3

permClusters

slide-50
SLIDE 50

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

C3

R

C3

L

clusters to compare

H0 : C3

L ≡ C3 R → reject

aggregationLevelsToVisit

h(C1

L), h(C2 R), h(C2 L)

Iteration

i ← 3

permClusters

slide-51
SLIDE 51

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L) h “ C3

R

” h “ C3

L

Iteration

i ← 3

permClusters

slide-52
SLIDE 52

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

h “ C3

R

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

Iteration

i ← 4

permClusters

slide-53
SLIDE 53

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

clusters to compare

H0 : C4

L ≡ C4 R → accept C4

R

C4

L

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

Iteration

i ← 4

permClusters

slide-54
SLIDE 54

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

C3

R

permClusters

C4

L ∪ C4 R ⇔ C3 R

clusters to compare

H0 : C4

L ≡ C4 R → accept

aggregationLevelsToVisit

h(C3

R), h(C2 R), h(C2 L), h(C3 L)

Iteration

i ← 4

slide-55
SLIDE 55

The algorithm - The outline

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 8 / 21

permClusters

C3

L, C3 R, C2 L, C4 L, C4 R

aggregationLevelsToVisit Iteration

i ← 9

slide-56
SLIDE 56

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 9 / 21

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h h “ C1

L ∪ C1 R

” ≡ h(tree) i h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-57
SLIDE 57

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 9 / 21 max h(C3

j )

min h(C3

j )

For each k, the difference between max

j∈{L,R} h

“ Ck

j

” and min

j∈{L,R} h

“ Ck

j

” can be considered as the minimum cost necessary to merge the two classes. .

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h h “ C1

L ∪ C1 R

” ≡ h(tree) i h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-58
SLIDE 58

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 9 / 21 max h(C3

j )

h(C3

L ∪ C3 R)

For each k, the difference between max

j∈{L,R} h

“ Ck

j

” and min

j∈{L,R} h

“ Ck

j

” can be considered as the minimum cost necessary to merge the two classes. The difference between h “ Ck

L ∪ Ck R

” and max

j∈{L,R} h

“ Ck

j

” can be, instead, considered as the cost actually incurred for merging Ck

L and Ck R.

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h h “ C1

L ∪ C1 R

” ≡ h(tree) i h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-59
SLIDE 59

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 9 / 21

The ratio between these two costs: max

j∈{L,R} h

“ Ck

j

” − min

j∈{L,R} h

“ Ck

j

” h ` Ck

L ∪ Ck R

´ − max

j∈{L,R} h

“ Ck

j

” is thus a measure that characterizes the aggregation process resulting in the new class Ck

L ∪ Ck R

Let:

n the number of objects to classify; Ck

L and Ck R the two classes merged at level k

(k=1,...,n-1) h “ Ck

L ∪ Ck R

” the height necessary to merge Ck

L and Ck R

h h “ C1

L ∪ C1 R

” ≡ h(tree) i h “ Ck

j

” the height at which Ck

j has been obtained

(j ∈ { L, R })

slide-60
SLIDE 60

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21

C1

L C1 R

The algorithm retraces down-ward the tree, starting from the root of the dendrogram where all objects are classified in a unique cluster. ∀ k a permutation test is designed to test the Null Hypothesis that the two classes Ck

L and Ck R really

belong to the same cluster, i.e. : H0 : Ck

L ≡ Ck R

Under H0, mixing up (permuting) the statistical units of Ck

L and Ck R should not alter the aggregation

process resulting in their merging in.

slide-61
SLIDE 61

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21 mC1 L mC1 R

C1

L C1 R

Let mCk

L and mCk R be the two new classes obtained by permuting the elements in Ck L and Ck R

The algorithm retraces down-ward the tree, starting from the root of the dendrogram where all objects are classified in a unique cluster. ∀ k a permutation test is designed to test the Null Hypothesis that the two classes Ck

L and Ck R really

belong to the same cluster, i.e. : H0 : Ck

L ≡ Ck R

Under H0, mixing up (permuting) the statistical units of Ck

L and Ck R should not alter the aggregation

process resulting in their merging in.

slide-62
SLIDE 62

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21 mC1 R mC1 L mC1 L mC1 R

C1

L C1 R

Let mCk

L and mCk R be the two new classes obtained by permuting the elements in Ck L and Ck R

For each of them a new dendrogram is generated.

slide-63
SLIDE 63

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21 mC1 R mC1 L mC1 L mC1 R

C1

L C1 R

h(mC1

R)

h(mC1

L)

Let mCk

L and mCk R be the two new classes obtained by permuting the elements in Ck L and Ck R

For each of them a new dendrogram is generated. The heights at which each of the two classes are buit up again, clearly correspond to the heights of the root nodes of the corresponding dendrograms.

slide-64
SLIDE 64

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21 mC1 R mC1 L mC1 L mC1 R

C1

L C1 R

h(mC1

R)

h(mC1

L)

The ratio: cost “

mCk L ∪ mCk R

” = max

j∈{L,R} h

mCk j

” − min

j∈{L,R} h

mCk j

” h ` Ck

L ∪ Ck R

´ − max

j∈{L,R} h

mCk j

” is thus a measure that characterizes the aggregation process resulting in the new (potential) class mCk

L ∪ mCk R

slide-65
SLIDE 65

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21 mC1 R mC1 L mC1 L mC1 R

C1

L C1 R

Under H0 the aggregation process resulting in the new cluster Ck

L ∪ Ck R should be very similar

to the one that potentially produces mCk

L ∪ mCk R; thus the two values cost

mCk L ∪ mCk R

” and cost “ Ck

L ∪ Ck R

” should be close enough.

slide-66
SLIDE 66

The algorithm - The permutation Test

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 10 / 21 mC1 R mC1 L mC1 L mC1 R

C1

L C1 R

The permutation procedure is repeated M times and each time a new couple mCk

L , mCk R is

  • btained. The pvalue Montecarlo is thus computed as:

p = # ˘ cost `

mCk L ∪ mCk R

´ ≤ cost ` Ck

L ∪ Ck R

´¯ + 1 M + 1

slide-67
SLIDE 67

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline The Core

3

Some results Real datasets Synthetic dataset

4

ToDo List

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 11 / 21

slide-68
SLIDE 68

Some results - Real datasets

The yeast galactose dataset

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner RE, Goodlett DR, Aebersold R, Hood L Integrated genomic and proteomic analyses of a systemically perturbed metabolic network. Science 2001, 292:929-934.

n = 205 p = 80

It is a subset of 205 genes that reflect four functional categories in the Gene Ontology listings.

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 12 / 21

slide-69
SLIDE 69

Some results - Real datasets

Settings

distanceMethod = euclidean aggregationMethod = Ward

α = 0.01 M = 1000

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 12 / 21

slide-70
SLIDE 70

Some results - Real datasets

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 12 / 21

The proposed algorithm

% of misclassification = 1.5

slide-71
SLIDE 71

Some results - Real datasets

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 12 / 21

k- means (k=4)

% of misclassification = 8.3

The proposed algorithm

% of misclassification = 1.5

slide-72
SLIDE 72

Some results

The diabetes dataset

Banfield JD, Raftery AE Model–based Gaussian and Non–Gaussian Clustering. Biometrics, 1993, 49, 803-821.

n = 145 p = 3

It contains 145 subjects divided into three groups (normal, chemical diabetes, overt diabetes) on the basis of their

  • ral glucose tolerance

descripted by three variables

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 13 / 21

slide-73
SLIDE 73

Some results

Settings

distanceMethod = euclidean aggregationMethod = Ward

α = 0.01 M = 1000

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 13 / 21

slide-74
SLIDE 74

Some results

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 13 / 21

The proposed algorithm

% of misclassification = 15.2

slide-75
SLIDE 75

Some results

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 13 / 21

k- means (k=3)

% of misclassification = 18.6

The proposed algorithm

% of misclassification = 15.2

slide-76
SLIDE 76

Some results... for 5 variables

genRandomCluster

numClust = 2:7 numNonNoisy = 5 sepVal = 0.01

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 14 / 21

slide-77
SLIDE 77

Some results... for 5 variables (100 replications)

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 15 / 21

slide-78
SLIDE 78

Some results... for 10 variables

genRandomCluster

numClust = 2:7 numNonNoisy = 10 sepVal = 0.01

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 16 / 21

slide-79
SLIDE 79

Some results... for 10 variables (100 replications)

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 17 / 21

slide-80
SLIDE 80

Some results... for 15 variables

genRandomCluster

numClust = 2:7 numNonNoisy = 15 sepVal = 0.01

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 18 / 21

slide-81
SLIDE 81

Some results... for 15 variables (100 replications)

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 19 / 21

slide-82
SLIDE 82

La Carte

1

Motivation

2

The stairstep-like permutation procedure Notation The outline The Core

3

Some results Real datasets Synthetic dataset

4

ToDo List

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 20 / 21

slide-83
SLIDE 83

ToDo List

Statistical issues Quality measures of the obtained partition Use of different types of clusters

◮ different cardinality of clusters ◮ different type of cluster generation

Multiple Testing Problem Stepwise approach instead of a Forward approach Computational issues profiling and optimizing the R code

◮ use of compiled code ◮ use of S3–S4 methods ◮ deploying a package

  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 21 / 21

slide-84
SLIDE 84
  • D. Bruzzese, U. Giani, D. Vistocco

( ————————————————————- —————————— Stairstep-like dendrogram cut Sismec 2009 22 / 21