Visualizing Cluster Results Using Package FlexClust and Friendsd - - PowerPoint PPT Presentation
Visualizing Cluster Results Using Package FlexClust and Friendsd - - PowerPoint PPT Presentation
Visualizing Cluster Results Using Package FlexClust and Friendsd Friedrich Leisch University of Munich useR!, Rennes, 10.7.2009 Acknowledgements & Apology Sara Dolnicar (University of Wollongong) Theresa Scharl, Ingo Voglhuber
Acknowledgements & Apology
- Sara Dolnicar (University of Wollongong)
- Theresa Scharl, Ingo Voglhuber (Vienna University of Technology)
- Paul Murrell, Deepayan Sarkar (R Core)
Friedrich Leisch: Cluster Visualization, 10.7.2009 1
Acknowledgements & Apology
- Sara Dolnicar (University of Wollongong)
- Theresa Scharl, Ingo Voglhuber (Vienna University of Technology)
- Paul Murrell, Deepayan Sarkar (R Core)
Apology: Microarray data only in Theresa’s talk (try time-shift back to Wednesday).
Friedrich Leisch: Cluster Visualization, 10.7.2009 1
Partitioning Clustering – KCCA
K-centroid cluster algorithms:
- data set XN = {x1, . . . , xN}, set of centroids CK = {c1, . . . , cK}
- distance measure d(x, y)
- centroid c closest to x:
c(x) = argmin
c∈CK
d(x, c)
- Most KCCA algorithms try to find a set of centroids CK for fixed K
such that the average distance D(XN, CK) = 1 N
N
- n=1
d(xn, c(xn)) → min
CK
,
- f each point to the closest centroid is minimal.
- Optimization algorithm not important for rest of talk
Friedrich Leisch: Cluster Visualization, 10.7.2009 2
Example: Australian Volunteers
Survey among 1415 Australian adults about which organziations they would consider to volunteer for, main motivation to volunteer, actual volunteering, image of organizations, . . . We use a block of 19 binary questions (“applies”, “does not apply”) about motivations to volunteer: “I want to meet people”, “I have no
- ne else”, “I want to set an example”, . . .
Organziations investigated: Red Cross, Surf Life Savers, Rural Fire Service, Parents Association, . . . Our analyses show that there is both competition between organizations with similar profiles as well as complimentary effects (idividuals volun- teering for more than one arganization, in most cases with very different profiles).
Friedrich Leisch: Cluster Visualization, 10.7.2009 3
8 Volunteer Clusters
Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 Cl.6 Cl.7 Cl.8 Total meet.people 9.24 15.77 28.77 97.18 82.17 80.97 90.81 34.23 49.47 no.one.else 8.21 8.34 6.16 27.72 10.51 12.47 16.75 7.23 11.38 example 12.80 14.68 63.56 93.30 35.84 73.33 80.32 45.30 47.63 socialise 14.12 10.68 3.28 88.74 52.79 54.17 83.39 6.35 35.83 help.others 0.00 100.00 92.93 95.17 56.95 89.18 86.98 86.12 66.78 give.back 21.68 29.18 87.73 96.24 43.41 87.60 96.18 89.77 63.75 career 12.04 5.54 10.46 71.34 35.01 17.49 18.29 11.54 20.57 lonely 4.55 8.71 2.30 56.55 17.16 9.76 18.93 0.75 13.14 active 17.17 15.93 23.38 93.82 81.61 53.97 77.00 23.98 44.88 community 16.17 9.64 66.66 93.75 14.67 87.80 90.34 77.21 52.72 cause 20.49 12.07 79.66 96.91 47.11 83.31 85.12 79.82 58.66 faith 10.12 7.24 22.84 66.89 10.52 42.10 27.83 19.47 24.03 services 7.00 7.10 11.63 78.15 23.99 43.72 44.98 14.64 25.23 children 6.88 11.76 11.88 28.58 14.86 16.20 14.64 8.31 13.00 good.job 19.14 23.81 100.00 94.61 49.18 85.35 75.38 0.00 51.80 benefited 10.49 15.09 14.26 74.37 15.04 100.00 0.00 12.68 26.29 network 10.75 8.47 6.29 85.86 43.59 22.83 38.98 10.28 25.94 recognition 10.30 8.03 11.29 79.59 12.80 19.56 21.40 3.49 18.73 mind.off 8.56 10.95 12.53 87.96 39.55 24.55 47.43 4.31 26.57
Friedrich Leisch: Cluster Visualization, 10.7.2009 4
8 Volunteer Clusters
Cl.1 Cl.2 Cl.3 Cl.4 Cl.5 Cl.6 Cl.7 Cl.8 Total meet.people 9 16 29 97 82 81 91 34 49 no.one.else 8 8 6 28 11 12 17 7 11 example 13 15 64 93 36 73 80 45 48 socialise 14 11 3 89 53 54 83 6 36 help.others 100 93 95 57 89 87 86 67 give.back 22 29 88 96 43 88 96 90 64 career 12 6 10 71 35 17 18 12 21 lonely 5 9 2 57 17 10 19 1 13 active 17 16 23 94 82 54 77 24 45 community 16 10 67 94 15 88 90 77 53 cause 20 12 80 97 47 83 85 80 59 faith 10 7 23 67 11 42 28 19 24 services 7 7 12 78 24 44 45 15 25 children 7 12 12 29 15 16 15 8 13 good.job 19 24 100 95 49 85 75 52 benefited 10 15 14 74 15 100 13 26 network 11 8 6 86 44 23 39 10 26 recognition 10 8 11 80 13 20 21 3 19 mind.off 9 11 13 88 40 25 47 4 27
Friedrich Leisch: Cluster Visualization, 10.7.2009 5
8 Volunteer Clusters
mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people
Cluster 1: 322 (23%)
0.0 0.2 0.4 0.6 0.8 1.0
Cluster 2: 140 (10%) Cluster 3: 166 (12%)
0.0 0.2 0.4 0.6 0.8 1.0
Cluster 4: 136 (10%)
mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people 0.0 0.2 0.4 0.6 0.8 1.0
Cluster 5: 160 (11%) Cluster 6: 147 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
Cluster 7: 178 (13%) Cluster 8: 166 (12%)
Friedrich Leisch: Cluster Visualization, 10.7.2009 6
8 Volunteer Clusters
We cannot easily test for differences between clusters, because they were constructed to be different.
Friedrich Leisch: Cluster Visualization, 10.7.2009 7
8 Volunteer Clusters
We cannot easily test for differences between clusters, because they were constructed to be different. Improved presentation of results (following advice that is only around for a few decades):
- Add reference lines/points
- Sort variables by content:
- 1. sort clusters by mean
- 2. sort variables by hierarchical clustering
- Highlight important points
Friedrich Leisch: Cluster Visualization, 10.7.2009 7
Add reference points
mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people
- Cluster 1: 322 (23%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 2: 140 (10%)
- Cluster 3: 166 (12%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 4: 136 (10%)
mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people 0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 5: 160 (11%)
- Cluster 6: 147 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 7: 178 (13%)
- Cluster 8: 166 (12%)
Friedrich Leisch: Cluster Visualization, 10.7.2009 8
Sort Clusters
mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people
- Cluster 1: 322 (23%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 2: 140 (10%)
- Cluster 3: 136 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 4: 166 (12%)
mind.off recognition network benefited good.job children services faith cause community active lonely career give.back help.others socialise example no.one.else meet.people 0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 5: 160 (11%)
- Cluster 6: 147 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 7: 178 (13%)
- Cluster 8: 166 (12%)
Friedrich Leisch: Cluster Visualization, 10.7.2009 9
Sort Variables
help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited
Friedrich Leisch: Cluster Visualization, 10.7.2009 10
Sort Variables
help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited
- Cluster 1: 322 (23%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 2: 140 (10%)
- Cluster 3: 136 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 4: 166 (12%)
help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited 0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 5: 160 (11%)
- Cluster 6: 147 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 7: 178 (13%)
- Cluster 8: 166 (12%)
Friedrich Leisch: Cluster Visualization, 10.7.2009 11
Highlight Important Points
help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited
- Cluster 1: 322 (23%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 2: 140 (10%)
- Cluster 3: 136 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 4: 166 (12%)
help.others give.back community cause example good.job active meet.people socialise mind.off network career lonely recognition faith no.one.else children services benefited 0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 5: 160 (11%)
- Cluster 6: 147 (10%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 7: 178 (13%)
- Cluster 8: 166 (12%)
Friedrich Leisch: Cluster Visualization, 10.7.2009 12
Example: Car Data
A German manufacturer of premium cars asked recent customers about main motivation to buy one of their cars. Binary data, respondents were asked to check properties like sporty, power, interior, safety, quality, resale value, etc. Exploration of the data shows no natural groups, a segmentation of the market imposes a partition on the data (which is absolutely fine for the purpose). Here: hierarchical clustering of variables, partition with 4 clusters from neural gas algorithm for customers.
Friedrich Leisch: Cluster Visualization, 10.7.2009 13
Example: Car Data
driving_properties power sporty economy consumption resale_value service clarity character space handling concept interior styling model_continuity reputation safety quality reliability technology comfort
Friedrich Leisch: Cluster Visualization, 10.7.2009 14
Example: Car Data
driving_properties power sporty economy consumption resale_value service clarity character space handling concept interior styling model_continuity reputation safety quality reliability technology comfort
- Cluster 1: 237 (30%)
0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 2: 280 (35%)
driving_properties power sporty economy consumption resale_value service clarity character space handling concept interior styling model_continuity reputation safety quality reliability technology comfort 0.0 0.2 0.4 0.6 0.8 1.0
- Cluster 3: 205 (26%)
- Cluster 4: 71 (9%)
Friedrich Leisch: Cluster Visualization, 10.7.2009 15
PCA Projection
- ●
- ●
- clarity
economy driving_properties service interior quality technology model_continuity comfort reliability handling reputation concept character power resale_value styling safety sporty consumption space
Friedrich Leisch: Cluster Visualization, 10.7.2009 16
PCA Projection
- ●
- ●
- quality
technology comfort power resale_value consumption
Friedrich Leisch: Cluster Visualization, 10.7.2009 17
PCA Projection
- 1
2 3 4 quality technology comfort power resale_value consumption
Friedrich Leisch: Cluster Visualization, 10.7.2009 18
PCA Projection
- 1
2 3 4 quality technology comfort power resale_value consumption
Friedrich Leisch: Cluster Visualization, 10.7.2009 19
PCA Projection
- 1
2 3 4 quality technology comfort power resale_value consumption
Friedrich Leisch: Cluster Visualization, 10.7.2009 20
PCA Projection
- 1
2 3 4 quality technology comfort power resale_value consumption
Friedrich Leisch: Cluster Visualization, 10.7.2009 21
PCA Projection
1 2 3 4 quality technology comfort power resale_value consumption
Friedrich Leisch: Cluster Visualization, 10.7.2009 22
Convex Cluster Hulls
The main problem for 2-dimensional generalizations of boxplots is that there is no total ordering of R2 (or higher). For data partitioned using a centroid-based cluster algorithm there is a natural total ordering for each point in a cluster: The distance d(x, c(x))
- f the point to its respective cluster centroid.
Let Ak be the set of points in cluster k, and mk = median{d(xn, ck)|xn ∈ Ak} inner area: all data points where d(xn, ck) ≤ mk
- uter area: all data points where d(xn, ck) ≤ 2.5mk
Friedrich Leisch: Cluster Visualization, 10.7.2009 23
Shadow Values
Second-closest centroid to x: ˜ c(x) = argmin
c∈CK\{c(x)}
d(x, c) Shadow value: s(x) = 2d(x, c(x)) d(x, c(x)) + d(x, ˜ c(x)) ∈ [0, 1] s(x) = 0: centroid s(x) = 1: on border of clusters
Friedrich Leisch: Cluster Visualization, 10.7.2009 24
Neighborhood Graph
Let Aij =
- xn | c(xn) = ci, ˜
c(xn) = cj
- be the set of all points where ci is the closest centroid and cj is second-
closest. Cluster similarity: sij =
- |Ai|−1
x∈|Aij| s(x),
Aij = ∅ 0, Aij = ∅ Use sij + sji for thickness of line connecting ci and cj.
Friedrich Leisch: Cluster Visualization, 10.7.2009 25
Neighborhood Graph
- 1
2 3 4 5 6 7 8
Friedrich Leisch: Cluster Visualization, 10.7.2009 26
Neighborhood Graph
- ●
- 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Friedrich Leisch: Cluster Visualization, 10.7.2009 27
Volunteer PCA-Projection
1 2 3 4 5 6 7 8
example socialise help.others give.back active community cause network
Friedrich Leisch: Cluster Visualization, 10.7.2009 28
Volunteer PCA-Projection
- ●
- 1
2 3 4 5 6 7 8
Friedrich Leisch: Cluster Visualization, 10.7.2009 29
Volunteer PCA-Projection
- ●
- 1
2 3 4 5 6 7 8
Friedrich Leisch: Cluster Visualization, 10.7.2009 30
Nonlinear Arrangement
k1 k2 k3 k4 k5 k6 k7 k8
Friedrich Leisch: Cluster Visualization, 10.7.2009 31
Cluster Size
k1 k2 k3 k4 k5 k6 k7 k8
Friedrich Leisch: Cluster Visualization, 10.7.2009 32
Age Distribution
- Friedrich Leisch: Cluster Visualization, 10.7.2009
33
Gender Distribution
Friedrich Leisch: Cluster Visualization, 10.7.2009 34
Nice to Look at
Friedrich Leisch: Cluster Visualization, 10.7.2009 35
Model-based Ordinal Clustering
Survey data for two fastfood chains (McDonald’s, Subway) from 715 respondents on 10 items (yummy, fattening, greasy, fast, . . . ). We are interested in capturing scale usage heterogeneity under the assumption that different involvement with a brand may provoke different scale usage. Finite mixture model: For each item and group we estimate mean and standard deviation of a latent Gaussian. Assumption of independence between items given group membership, estimation by EM. For a 3-component model we get 30 means and 30 standard deviations.
Friedrich Leisch: Cluster Visualization, 10.7.2009 36
Model-based Ordinal Clustering
−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 −3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4
Friedrich Leisch: Cluster Visualization, 10.7.2009 37
Subway: 2 Components
prob
0.0 0.2 0.4 0.6 0.8
yummy 1 fattening 1 greasy 1 fast 1 cheap 1 tasty 1 healthy 1 disgusting 1 convenient 1 spicy 1
0 1 2 3 4 5 6
yummy 2
0 1 2 3 4 5 6
fattening 2
0 1 2 3 4 5 6
greasy 2
0 1 2 3 4 5 6
fast 2
0 1 2 3 4 5 6
cheap 2
0 1 2 3 4 5 6
tasty 2
0 1 2 3 4 5 6
healthy 2
0 1 2 3 4 5 6
disgusting 2
0 1 2 3 4 5 6
convenient 2
0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8
spicy 2
Friedrich Leisch: Cluster Visualization, 10.7.2009 38
Subway: 2 Components
prob
spicy convenient disgusting healthy tasty cheap fast greasy fattening yummy 0.0 0.2 0.4 0.6 0.8 1.0
1
0.0 0.2 0.4 0.6 0.8 1.0
2
Friedrich Leisch: Cluster Visualization, 10.7.2009 39
Subway: 2 Components
Overall Scale Usage
prob
0.0 0.1 0.2 0.3 0.4 1 2 3 4 5 6
1
1 2 3 4 5 6
2
Friedrich Leisch: Cluster Visualization, 10.7.2009 40
Subway: 2 Components
Scale Usage Corrected
prob ratio
1 2 3 4 5 6
yummy 1 fattening 1 greasy 1 fast 1 cheap 1 tasty 1 healthy 1 disgusting 1 convenient 1 spicy 1
0 1 2 3 4 5 6
yummy 2
0 1 2 3 4 5 6
fattening 2
0 1 2 3 4 5 6
greasy 2
0 1 2 3 4 5 6
fast 2
0 1 2 3 4 5 6
cheap 2
0 1 2 3 4 5 6
tasty 2
0 1 2 3 4 5 6
healthy 2
0 1 2 3 4 5 6
disgusting 2
0 1 2 3 4 5 6
convenient 2
0 1 2 3 4 5 6 1 2 3 4 5 6
spicy 2
Friedrich Leisch: Cluster Visualization, 10.7.2009 41
McDonald’s: 3 Components
prob
spicy convenient disgusting healthy tasty cheap fast greasy fattening yummy 0.0 0.2 0.4 0.6 0.8 1.0
1
0.0 0.2 0.4 0.6 0.8 1.0
2
0.0 0.2 0.4 0.6 0.8 1.0
3
Friedrich Leisch: Cluster Visualization, 10.7.2009 42
McDonald’s: 3 Components
Overall Scale Usage
prob
0.0 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 6
1
1 2 3 4 5 6
2
1 2 3 4 5 6
3
Friedrich Leisch: Cluster Visualization, 10.7.2009 43
McDonald’s: 3 Components
Scale Usage Corrected
prob ratio
1 2 3 4 5
yummy 1 fattening 1 greasy 1 fast 1 cheap 1 tasty 1 healthy 1 disgusting 1 convenient 1 spicy 1 yummy 2 fattening 2 greasy 2 fast 2 cheap 2 tasty 2 healthy 2 disgusting 2 convenient 2
1 2 3 4 5
spicy 2
1 2 3 4 5 0 1 2 3 4 5 6
yummy 3
0 1 2 3 4 5 6
fattening 3
0 1 2 3 4 5 6
greasy 3
0 1 2 3 4 5 6
fast 3
0 1 2 3 4 5 6
cheap 3
0 1 2 3 4 5 6
tasty 3
0 1 2 3 4 5 6
healthy 3
0 1 2 3 4 5 6
disgusting 3
0 1 2 3 4 5 6
convenient 3
0 1 2 3 4 5 6
spicy 3
Friedrich Leisch: Cluster Visualization, 10.7.2009 44
Crosstabulation of Clusters
−2.93 −2.00 0.00 2.00 4.00 5.87 Pearson residuals: p−value = 7.7721e−11 3 2 1 1 2
Only a very small group has extreme response style for both brands,
- therwise
almost independence.
Friedrich Leisch: Cluster Visualization, 10.7.2009 45
Software, Papers, Outlook
Packages are available from CRAN and/or R-Forge: flexclust: KCCA clustering for arbitrary distances, shaded barcharts, projections, convex hulls, static neighborhood graphs, . . . gcExplorer: Interactive neighborhood graphs with links to Gene Ontol-
- gy.
symbols: Grid versions of boxplot(), symbols(), stars(), . . . flexmix: Finite mixture models. Model based clustering for various distributions and mixtures of generalized linear models. Public availability of features shown in this talk depends on release version of packages. Papers available at http://www.statistik.lmu.de/~leisch Big 2do: Redo most with iPlots Extreme . . .
Friedrich Leisch: Cluster Visualization, 10.7.2009 46