Visualizing Cluster Results Using Package FlexClust and Friendsd - - PowerPoint PPT Presentation

visualizing cluster results using package flexclust and
SMART_READER_LITE
LIVE PREVIEW

Visualizing Cluster Results Using Package FlexClust and Friendsd - - PowerPoint PPT Presentation

Visualizing Cluster Results Using Package FlexClust and Friendsd Friedrich Leisch University of Munich useR!, Rennes, 10.7.2009 Acknowledgements & Apology Sara Dolnicar (University of Wollongong) Theresa Scharl, Ingo Voglhuber


slide-1
SLIDE 1

Visualizing Cluster Results Using Package FlexClust and Friendsd

Friedrich Leisch University of Munich

useR!, Rennes, 10.7.2009

slide-2
SLIDE 2

Acknowledgements & Apology

  • Sara Dolnicar (University of Wollongong)
  • Theresa Scharl, Ingo Voglhuber (Vienna University of Technology)
  • Paul Murrell, Deepayan Sarkar (R Core)

Friedrich Leisch: Cluster Visualization, 10.7.2009 1

slide-3
SLIDE 3

Acknowledgements & Apology

  • Sara Dolnicar (University of Wollongong)
  • Theresa Scharl, Ingo Voglhuber (Vienna University of Technology)
  • Paul Murrell, Deepayan Sarkar (R Core)

Apology: Microarray data only in Theresa’s talk (try time-shift back to Wednesday).

Friedrich Leisch: Cluster Visualization, 10.7.2009 1

slide-4
SLIDE 4

Partitioning Clustering – KCCA

K-centroid cluster algorithms:

  • data set XN = {x1, . . . , xN}, set of centroids CK = {c1, . . . , cK}
  • distance measure d(x, y)
  • centroid c closest to x:

c(x) = argmin

c∈CK

d(x, c)

  • Most KCCA algorithms try to find a set of centroids CK for fixed K

such that the average distance D(XN, CK) = 1 N

N

  • n=1

d(xn, c(xn)) → min

CK

,

  • f each point to the closest centroid is minimal.
  • Optimization algorithm not important for rest of talk

Friedrich Leisch: Cluster Visualization, 10.7.2009 2

slide-5
SLIDE 5

Example: Australian Volunteers

Survey among 1415 Australian adults about which organziations they would consider to volunteer for, main motivation to volunteer, actual volunteering, image of organizations, . . . We use a block of 19 binary questions (“applies”, “does not apply”) about motivations to volunteer: “I want to meet people”, “I have no

  • ne else”, “I want to set an example”, . . .

Organziations investigated: Red Cross, Surf Life Savers, Rural Fire Service, Parents Association, . . . Our analyses show that there is both competition between organizations with similar profiles as well as complimentary effects (idividuals volun- teering for more than one arganization, in most cases with very different profiles).

Friedrich Leisch: Cluster Visualization, 10.7.2009 3

slide-6
SLIDE 6

8 Volunteer Clusters

Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 Cl.6 Cl.7 Cl.8 Total meet.people 9.24 15.77 28.77 97.18 82.17 80.97 90.81 34.23 49.47 no.one.else 8.21 8.34 6.16 27.72 10.51 12.47 16.75 7.23 11.38 example 12.80 14.68 63.56 93.30 35.84 73.33 80.32 45.30 47.63 socialise 14.12 10.68 3.28 88.74 52.79 54.17 83.39 6.35 35.83 help.others 0.00 100.00 92.93 95.17 56.95 89.18 86.98 86.12 66.78 give.back 21.68 29.18 87.73 96.24 43.41 87.60 96.18 89.77 63.75 career 12.04 5.54 10.46 71.34 35.01 17.49 18.29 11.54 20.57 lonely 4.55 8.71 2.30 56.55 17.16 9.76 18.93 0.75 13.14 active 17.17 15.93 23.38 93.82 81.61 53.97 77.00 23.98 44.88 community 16.17 9.64 66.66 93.75 14.67 87.80 90.34 77.21 52.72 cause 20.49 12.07 79.66 96.91 47.11 83.31 85.12 79.82 58.66 faith 10.12 7.24 22.84 66.89 10.52 42.10 27.83 19.47 24.03 services 7.00 7.10 11.63 78.15 23.99 43.72 44.98 14.64 25.23 children 6.88 11.76 11.88 28.58 14.86 16.20 14.64 8.31 13.00 good.job 19.14 23.81 100.00 94.61 49.18 85.35 75.38 0.00 51.80 benefited 10.49 15.09 14.26 74.37 15.04 100.00 0.00 12.68 26.29 network 10.75 8.47 6.29 85.86 43.59 22.83 38.98 10.28 25.94 recognition 10.30 8.03 11.29 79.59 12.80 19.56 21.40 3.49 18.73 mind.off 8.56 10.95 12.53 87.96 39.55 24.55 47.43 4.31 26.57

Friedrich Leisch: Cluster Visualization, 10.7.2009 4

slide-7
SLIDE 7

8 Volunteer Clusters

Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 Cl.6 Cl.7 Cl.8 Total meet.people 9 16 29 97 82 81 91 34 49 no.one.else 8 8 6 28 11 12 17 7 11 example 13 15 64 93 36 73 80 45 48 socialise 14 11 3 89 53 54 83 6 36 help.others 100 93 95 57 89 87 86 67 give.back 22 29 88 96 43 88 96 90 64 career 12 6 10 71 35 17 18 12 21 lonely 5 9 2 57 17 10 19 1 13 active 17 16 23 94 82 54 77 24 45 community 16 10 67 94 15 88 90 77 53 cause 20 12 80 97 47 83 85 80 59 faith 10 7 23 67 11 42 28 19 24 services 7 7 12 78 24 44 45 15 25 children 7 12 12 29 15 16 15 8 13 good.job 19 24 100 95 49 85 75 52 benefited 10 15 14 74 15 100 13 26 network 11 8 6 86 44 23 39 10 26 recognition 10 8 11 80 13 20 21 3 19 mind.off 9 11 13 88 40 25 47 4 27

Friedrich Leisch: Cluster Visualization, 10.7.2009 5

slide-8
SLIDE 8

8 Volunteer Clusters

mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people

Cluster 1: 322 (23%)

0.0 0.2 0.4 0.6 0.8 1.0

Cluster 2: 140 (10%) Cluster 3: 166 (12%)

0.0 0.2 0.4 0.6 0.8 1.0

Cluster 4: 136 (10%)

mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people 0.0 0.2 0.4 0.6 0.8 1.0

Cluster 5: 160 (11%) Cluster 6: 147 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

Cluster 7: 178 (13%) Cluster 8: 166 (12%)

Friedrich Leisch: Cluster Visualization, 10.7.2009 6

slide-9
SLIDE 9

8 Volunteer Clusters

We cannot easily test for differences between clusters, because they were constructed to be different.

Friedrich Leisch: Cluster Visualization, 10.7.2009 7

slide-10
SLIDE 10

8 Volunteer Clusters

We cannot easily test for differences between clusters, because they were constructed to be different. Improved presentation of results (following advice that is only around for a few decades):

  • Add reference lines/points
  • Sort variables by content:
  • 1. sort clusters by mean
  • 2. sort variables by hierarchical clustering
  • Highlight important points

Friedrich Leisch: Cluster Visualization, 10.7.2009 7

slide-11
SLIDE 11

Add reference points

mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people

  • Cluster 1: 322 (23%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 2: 140 (10%)
  • Cluster 3: 166 (12%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 4: 136 (10%)

mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people 0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 5: 160 (11%)
  • Cluster 6: 147 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 7: 178 (13%)
  • Cluster 8: 166 (12%)

Friedrich Leisch: Cluster Visualization, 10.7.2009 8

slide-12
SLIDE 12

Sort Clusters

mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people

  • Cluster 1: 322 (23%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 2: 140 (10%)
  • Cluster 3: 136 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 4: 166 (12%)

mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people 0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 5: 160 (11%)
  • Cluster 6: 147 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 7: 178 (13%)
  • Cluster 8: 166 (12%)

Friedrich Leisch: Cluster Visualization, 10.7.2009 9

slide-13
SLIDE 13

Sort Variables

help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited

Friedrich Leisch: Cluster Visualization, 10.7.2009 10

slide-14
SLIDE 14

Sort Variables

help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited

  • Cluster 1: 322 (23%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 2: 140 (10%)
  • Cluster 3: 136 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 4: 166 (12%)

help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited 0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 5: 160 (11%)
  • Cluster 6: 147 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 7: 178 (13%)
  • Cluster 8: 166 (12%)

Friedrich Leisch: Cluster Visualization, 10.7.2009 11

slide-15
SLIDE 15

Highlight Important Points

help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited

  • Cluster 1: 322 (23%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 2: 140 (10%)
  • Cluster 3: 136 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 4: 166 (12%)

help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited 0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 5: 160 (11%)
  • Cluster 6: 147 (10%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 7: 178 (13%)
  • Cluster 8: 166 (12%)

Friedrich Leisch: Cluster Visualization, 10.7.2009 12

slide-16
SLIDE 16

Example: Car Data

A German manufacturer of premium cars asked recent customers about main motivation to buy one of their cars. Binary data, respondents were asked to check properties like sporty, power, interior, safety, quality, resale value, etc. Exploration of the data shows no natural groups, a segmentation of the market imposes a partition on the data (which is absolutely fine for the purpose). Here: hierarchical clustering of variables, partition with 4 clusters from neural gas algorithm for customers.

Friedrich Leisch: Cluster Visualization, 10.7.2009 13

slide-17
SLIDE 17

Example: Car Data

driving_properties power sporty economy consumption resale_value service clarity character space handling concept interior styling model_continuity reputation safety quality reliability technology comfort

Friedrich Leisch: Cluster Visualization, 10.7.2009 14

slide-18
SLIDE 18

Example: Car Data

driving_properties power sporty economy consumption resale_value service clarity character space handling concept interior styling model_continuity reputation safety quality reliability technology comfort

  • Cluster 1: 237 (30%)

0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 2: 280 (35%)

driving_properties power sporty economy consumption resale_value service clarity character space handling concept interior styling model_continuity reputation safety quality reliability technology comfort 0.0 0.2 0.4 0.6 0.8 1.0

  • Cluster 3: 205 (26%)
  • Cluster 4: 71 (9%)

Friedrich Leisch: Cluster Visualization, 10.7.2009 15

slide-19
SLIDE 19

PCA Projection

  • clarity

economy driving_properties service interior quality technology model_continuity comfort reliability handling reputation concept character power resale_value styling safety sporty consumption space

Friedrich Leisch: Cluster Visualization, 10.7.2009 16

slide-20
SLIDE 20

PCA Projection

  • quality

technology comfort power resale_value consumption

Friedrich Leisch: Cluster Visualization, 10.7.2009 17

slide-21
SLIDE 21

PCA Projection

  • 1

2 3 4 quality technology comfort power resale_value consumption

Friedrich Leisch: Cluster Visualization, 10.7.2009 18

slide-22
SLIDE 22

PCA Projection

  • 1

2 3 4 quality technology comfort power resale_value consumption

Friedrich Leisch: Cluster Visualization, 10.7.2009 19

slide-23
SLIDE 23

PCA Projection

  • 1

2 3 4 quality technology comfort power resale_value consumption

Friedrich Leisch: Cluster Visualization, 10.7.2009 20

slide-24
SLIDE 24

PCA Projection

  • 1

2 3 4 quality technology comfort power resale_value consumption

Friedrich Leisch: Cluster Visualization, 10.7.2009 21

slide-25
SLIDE 25

PCA Projection

1 2 3 4 quality technology comfort power resale_value consumption

Friedrich Leisch: Cluster Visualization, 10.7.2009 22

slide-26
SLIDE 26

Convex Cluster Hulls

The main problem for 2-dimensional generalizations of boxplots is that there is no total ordering of R2 (or higher). For data partitioned using a centroid-based cluster algorithm there is a natural total ordering for each point in a cluster: The distance d(x, c(x))

  • f the point to its respective cluster centroid.

Let Ak be the set of points in cluster k, and mk = median{d(xn, ck)|xn ∈ Ak} inner area: all data points where d(xn, ck) ≤ mk

  • uter area: all data points where d(xn, ck) ≤ 2.5mk

Friedrich Leisch: Cluster Visualization, 10.7.2009 23

slide-27
SLIDE 27

Shadow Values

Second-closest centroid to x: ˜ c(x) = argmin

c∈CK\{c(x)}

d(x, c) Shadow value: s(x) = 2d(x, c(x)) d(x, c(x)) + d(x, ˜ c(x)) ∈ [0, 1] s(x) = 0: centroid s(x) = 1: on border of clusters

Friedrich Leisch: Cluster Visualization, 10.7.2009 24

slide-28
SLIDE 28

Neighborhood Graph

Let Aij =

  • xn | c(xn) = ci, ˜

c(xn) = cj

  • be the set of all points where ci is the closest centroid and cj is second-

closest. Cluster similarity: sij =

  • |Ai|−1

x∈|Aij| s(x),

Aij = ∅ 0, Aij = ∅ Use sij + sji for thickness of line connecting ci and cj.

Friedrich Leisch: Cluster Visualization, 10.7.2009 25

slide-29
SLIDE 29

Neighborhood Graph

  • 1

2 3 4 5 6 7 8

Friedrich Leisch: Cluster Visualization, 10.7.2009 26

slide-30
SLIDE 30

Neighborhood Graph

  • 1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Friedrich Leisch: Cluster Visualization, 10.7.2009 27

slide-31
SLIDE 31

Volunteer PCA-Projection

1 2 3 4 5 6 7 8

example socialise help.others give.back active community cause network

Friedrich Leisch: Cluster Visualization, 10.7.2009 28

slide-32
SLIDE 32

Volunteer PCA-Projection

  • 1

2 3 4 5 6 7 8

Friedrich Leisch: Cluster Visualization, 10.7.2009 29

slide-33
SLIDE 33

Volunteer PCA-Projection

  • 1

2 3 4 5 6 7 8

Friedrich Leisch: Cluster Visualization, 10.7.2009 30

slide-34
SLIDE 34

Nonlinear Arrangement

k1 k2 k3 k4 k5 k6 k7 k8

Friedrich Leisch: Cluster Visualization, 10.7.2009 31

slide-35
SLIDE 35

Cluster Size

k1 k2 k3 k4 k5 k6 k7 k8

Friedrich Leisch: Cluster Visualization, 10.7.2009 32

slide-36
SLIDE 36

Age Distribution

  • Friedrich Leisch: Cluster Visualization, 10.7.2009

33

slide-37
SLIDE 37

Gender Distribution

Friedrich Leisch: Cluster Visualization, 10.7.2009 34

slide-38
SLIDE 38

Nice to Look at

Friedrich Leisch: Cluster Visualization, 10.7.2009 35

slide-39
SLIDE 39

Model-based Ordinal Clustering

Survey data for two fastfood chains (McDonald’s, Subway) from 715 respondents on 10 items (yummy, fattening, greasy, fast, . . . ). We are interested in capturing scale usage heterogeneity under the assumption that different involvement with a brand may provoke different scale usage. Finite mixture model: For each item and group we estimate mean and standard deviation of a latent Gaussian. Assumption of independence between items given group membership, estimation by EM. For a 3-component model we get 30 means and 30 standard deviations.

Friedrich Leisch: Cluster Visualization, 10.7.2009 36

slide-40
SLIDE 40

Model-based Ordinal Clustering

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4

Friedrich Leisch: Cluster Visualization, 10.7.2009 37

slide-41
SLIDE 41

Subway: 2 Components

prob

0.0 0.2 0.4 0.6 0.8

yummy 1 fattening 1 greasy 1 fast 1 cheap 1 tasty 1 healthy 1 disgusting 1 convenient 1 spicy 1

0 1 2 3 4 5 6

yummy 2

0 1 2 3 4 5 6

fattening 2

0 1 2 3 4 5 6

greasy 2

0 1 2 3 4 5 6

fast 2

0 1 2 3 4 5 6

cheap 2

0 1 2 3 4 5 6

tasty 2

0 1 2 3 4 5 6

healthy 2

0 1 2 3 4 5 6

disgusting 2

0 1 2 3 4 5 6

convenient 2

0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8

spicy 2

Friedrich Leisch: Cluster Visualization, 10.7.2009 38

slide-42
SLIDE 42

Subway: 2 Components

prob

spicy convenient disgusting healthy tasty cheap fast greasy fattening yummy 0.0 0.2 0.4 0.6 0.8 1.0

1

0.0 0.2 0.4 0.6 0.8 1.0

2

Friedrich Leisch: Cluster Visualization, 10.7.2009 39

slide-43
SLIDE 43

Subway: 2 Components

Overall Scale Usage

prob

0.0 0.1 0.2 0.3 0.4 1 2 3 4 5 6

1

1 2 3 4 5 6

2

Friedrich Leisch: Cluster Visualization, 10.7.2009 40

slide-44
SLIDE 44

Subway: 2 Components

Scale Usage Corrected

prob ratio

1 2 3 4 5 6

yummy 1 fattening 1 greasy 1 fast 1 cheap 1 tasty 1 healthy 1 disgusting 1 convenient 1 spicy 1

0 1 2 3 4 5 6

yummy 2

0 1 2 3 4 5 6

fattening 2

0 1 2 3 4 5 6

greasy 2

0 1 2 3 4 5 6

fast 2

0 1 2 3 4 5 6

cheap 2

0 1 2 3 4 5 6

tasty 2

0 1 2 3 4 5 6

healthy 2

0 1 2 3 4 5 6

disgusting 2

0 1 2 3 4 5 6

convenient 2

0 1 2 3 4 5 6 1 2 3 4 5 6

spicy 2

Friedrich Leisch: Cluster Visualization, 10.7.2009 41

slide-45
SLIDE 45

McDonald’s: 3 Components

prob

spicy convenient disgusting healthy tasty cheap fast greasy fattening yummy 0.0 0.2 0.4 0.6 0.8 1.0

1

0.0 0.2 0.4 0.6 0.8 1.0

2

0.0 0.2 0.4 0.6 0.8 1.0

3

Friedrich Leisch: Cluster Visualization, 10.7.2009 42

slide-46
SLIDE 46

McDonald’s: 3 Components

Overall Scale Usage

prob

0.0 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6

1

1 2 3 4 5 6

2

1 2 3 4 5 6

3

Friedrich Leisch: Cluster Visualization, 10.7.2009 43

slide-47
SLIDE 47

McDonald’s: 3 Components

Scale Usage Corrected

prob ratio

1 2 3 4 5

yummy 1 fattening 1 greasy 1 fast 1 cheap 1 tasty 1 healthy 1 disgusting 1 convenient 1 spicy 1 yummy 2 fattening 2 greasy 2 fast 2 cheap 2 tasty 2 healthy 2 disgusting 2 convenient 2

1 2 3 4 5

spicy 2

1 2 3 4 5 0 1 2 3 4 5 6

yummy 3

0 1 2 3 4 5 6

fattening 3

0 1 2 3 4 5 6

greasy 3

0 1 2 3 4 5 6

fast 3

0 1 2 3 4 5 6

cheap 3

0 1 2 3 4 5 6

tasty 3

0 1 2 3 4 5 6

healthy 3

0 1 2 3 4 5 6

disgusting 3

0 1 2 3 4 5 6

convenient 3

0 1 2 3 4 5 6

spicy 3

Friedrich Leisch: Cluster Visualization, 10.7.2009 44

slide-48
SLIDE 48

Crosstabulation of Clusters

−2.93 −2.00 0.00 2.00 4.00 5.87 Pearson residuals: p−value = 7.7721e−11 3 2 1 1 2

Only a very small group has extreme response style for both brands,

  • therwise

almost independence.

Friedrich Leisch: Cluster Visualization, 10.7.2009 45

slide-49
SLIDE 49

Software, Papers, Outlook

Packages are available from CRAN and/or R-Forge: flexclust: KCCA clustering for arbitrary distances, shaded barcharts, projections, convex hulls, static neighborhood graphs, . . . gcExplorer: Interactive neighborhood graphs with links to Gene Ontol-

  • gy.

symbols: Grid versions of boxplot(), symbols(), stars(), . . . flexmix: Finite mixture models. Model based clustering for various distributions and mixtures of generalized linear models. Public availability of features shown in this talk depends on release version of packages. Papers available at http://www.statistik.lmu.de/~leisch Big 2do: Redo most with iPlots Extreme . . .

Friedrich Leisch: Cluster Visualization, 10.7.2009 46