Comparing More than Two Observations Dmitriy Gorenshteyn Sr. Data - - PowerPoint PPT Presentation

comparing more than two observations
SMART_READER_LITE
LIVE PREVIEW

Comparing More than Two Observations Dmitriy Gorenshteyn Sr. Data - - PowerPoint PPT Presentation

DataCamp Cluster Analysis in R CLUSTER ANALYSIS IN R Comparing More than Two Observations Dmitriy Gorenshteyn Sr. Data Scientist, Memorial Sloan Kettering Cancer Center DataCamp Cluster Analysis in R The Closest Observation to a Pair Is 2


slide-1
SLIDE 1

DataCamp Cluster Analysis in R

Comparing More than Two Observations

CLUSTER ANALYSIS IN R

Dmitriy Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-2
SLIDE 2

DataCamp Cluster Analysis in R

The Closest Observation to a Pair

1 2 3 2 11.7 3 16.8 18.0 4 10.0 20.6 15.8

Is 2 is closest to group 1,4? Is 3 is closest to group 1,4?

slide-3
SLIDE 3

DataCamp Cluster Analysis in R

Linkage Criteria: Complete

1 2 3 2 11.7 3 16.8 18.0 4 10.0 20.6 15.8

Is 2 is closest to group 1,4? max(D(2,1), D(2,4)) = 20.6 Is 3 is closest to group 1,4? max(D(3,1), D(3,4)) = 16.8

slide-4
SLIDE 4

DataCamp Cluster Analysis in R

Hierarchical Clustering

Complete Linkage: maximum distance between two sets

slide-5
SLIDE 5

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-6
SLIDE 6

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-7
SLIDE 7

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-8
SLIDE 8

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-9
SLIDE 9

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-10
SLIDE 10

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-11
SLIDE 11

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-12
SLIDE 12

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-13
SLIDE 13

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-14
SLIDE 14

DataCamp Cluster Analysis in R

Grouping With Linkage & Distance

slide-15
SLIDE 15

DataCamp Cluster Analysis in R

Linkage Criteria

Complete Linkage: maximum distance between two sets Single Linkage: minimum distance between two sets Average Linkage: average distance between two sets

slide-16
SLIDE 16

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-17
SLIDE 17

DataCamp Cluster Analysis in R

Capturing K Clusters

CLUSTER ANALYSIS IN R

Dmitriy Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-18
SLIDE 18

DataCamp Cluster Analysis in R

slide-19
SLIDE 19

DataCamp Cluster Analysis in R

slide-20
SLIDE 20

DataCamp Cluster Analysis in R

slide-21
SLIDE 21

DataCamp Cluster Analysis in R

slide-22
SLIDE 22

DataCamp Cluster Analysis in R

slide-23
SLIDE 23

DataCamp Cluster Analysis in R

slide-24
SLIDE 24

DataCamp Cluster Analysis in R

slide-25
SLIDE 25

DataCamp Cluster Analysis in R

slide-26
SLIDE 26

DataCamp Cluster Analysis in R

slide-27
SLIDE 27

DataCamp Cluster Analysis in R

slide-28
SLIDE 28

DataCamp Cluster Analysis in R

slide-29
SLIDE 29

DataCamp Cluster Analysis in R

Hierarchical Clustering in R

print(players) x y <dbl> <dbl> 1 -1 1 2 -2 -3 3 8 6 4 7 -8 5 -12 8 6 -15 0 dist_players <- dist(players, method = 'euclidean') hc_players <- hclust(dist_players, method = 'complete')

slide-30
SLIDE 30

DataCamp Cluster Analysis in R

Extracting K Clusters

cluster_assignments <- cutree(hc_players, k = 2) print(cluster_assignments) [1] 1 1 1 1 2 2 library(dplyr) players_clustered <- mutate(players, cluster = cluster_assignments) print(players_clustered) x y cluster <dbl> <dbl> <int> 1 -1 1 1 2 -2 -3 1 3 8 6 1 4 7 -8 1 5 -12 8 2 6 -15 0 2

slide-31
SLIDE 31

DataCamp Cluster Analysis in R

Visualizing K-Clusters

library(ggplot2) ggplot(players_clustered, aes(x = x, y = y, color = factor(cluster))) + geom_point()

slide-32
SLIDE 32

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-33
SLIDE 33

DataCamp Cluster Analysis in R

Visualizing the Dendrogram

CLUSTER ANALYSIS IN R

Dmitriy Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-34
SLIDE 34

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-35
SLIDE 35

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-36
SLIDE 36

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-37
SLIDE 37

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-38
SLIDE 38

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-39
SLIDE 39

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-40
SLIDE 40

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-41
SLIDE 41

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-42
SLIDE 42

DataCamp Cluster Analysis in R

Building the Dendrogram

slide-43
SLIDE 43

DataCamp Cluster Analysis in R

Plotting the Dendrogram

plot(hc_players)

slide-44
SLIDE 44

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-45
SLIDE 45

DataCamp Cluster Analysis in R

Cutting the Tree

CLUSTER ANALYSIS IN R

Dmitriy Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-46
SLIDE 46

DataCamp Cluster Analysis in R

slide-47
SLIDE 47

DataCamp Cluster Analysis in R

slide-48
SLIDE 48

DataCamp Cluster Analysis in R

slide-49
SLIDE 49

DataCamp Cluster Analysis in R

Coloring the Dendrogram - Height

library(dendextend) dend_players <- as.dendrogram(hc_players) dend_colored <- color_branches(dend_players, h = 15) plot(dend_colored)

slide-50
SLIDE 50

DataCamp Cluster Analysis in R

Coloring the Dendrogram - Height

library(dendextend) dend_players <- as.dendrogram(hc_players) dend_colored <- color_branches(dend_players, h = 15) plot(dend_colored)

slide-51
SLIDE 51

DataCamp Cluster Analysis in R

Coloring the Dendrogram - Height

library(dendextend) dend_players <- as.dendrogram(hc_players) dend_colored <- color_branches(dend_players, h = 10) plot(dend_colored)

slide-52
SLIDE 52

DataCamp Cluster Analysis in R

Coloring the Dendrogram - K

library(dendextend) dend_players <- as.dendrogram(hc_players) dend_colored <- color_branches(dend_players, k = 2) plot(dend_colored)

slide-53
SLIDE 53

DataCamp Cluster Analysis in R

cutree() using height

cluster_assignments <- cutree(hc_players, h = 15) print(cluster_assignments) [1] 1 1 1 1 2 2 library(dplyr) players_clustered <- mutate(players, cluster = cluster_assignments) print(players_clustered) x y cluster <dbl> <dbl> <int> 1 -1 1 1 2 -2 -3 1 3 8 6 1 4 7 -8 1 5 -12 8 2 6 -15 0 2

slide-54
SLIDE 54

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-55
SLIDE 55

DataCamp Cluster Analysis in R

Making Sense of the Clusters

CLUSTER ANALYSIS IN R

Dmitriy Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-56
SLIDE 56

DataCamp Cluster Analysis in R

Wholesale Dataset

45 observations 3 features: Milk Spending Grocery Spending Frozen Food Spending

slide-57
SLIDE 57

DataCamp Cluster Analysis in R

Wholesale Dataset

print(customers_spend) Milk Grocery Frozen 1 11103 12469 902 2 2013 6550 909 3 1897 5234 417 4 1304 3643 3045 5 3199 6986 1455 ... ... ... ...

slide-58
SLIDE 58

DataCamp Cluster Analysis in R

Exploring More Than 2 Dimensions

Plot 2 dimensions at a time Visualize using PCA Summary statistics by feature

slide-59
SLIDE 59

DataCamp Cluster Analysis in R

Segment the Customers

CLUSTER ANALYSIS IN R