Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data - - PowerPoint PPT Presentation

introduction to k means
SMART_READER_LITE
LIVE PREVIEW

Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data - - PowerPoint PPT Presentation

DataCamp Cluster Analysis in R CLUSTER ANALYSIS IN R Introduction to K- means Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan Kettering Cancer Center DataCamp Cluster Analysis in R DataCamp Cluster Analysis in R DataCamp


slide-1
SLIDE 1

DataCamp Cluster Analysis in R

Introduction to K- means

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-2
SLIDE 2

DataCamp Cluster Analysis in R

slide-3
SLIDE 3

DataCamp Cluster Analysis in R

slide-4
SLIDE 4

DataCamp Cluster Analysis in R

slide-5
SLIDE 5

DataCamp Cluster Analysis in R

slide-6
SLIDE 6

DataCamp Cluster Analysis in R

slide-7
SLIDE 7

DataCamp Cluster Analysis in R

slide-8
SLIDE 8

DataCamp Cluster Analysis in R

slide-9
SLIDE 9

DataCamp Cluster Analysis in R

slide-10
SLIDE 10

DataCamp Cluster Analysis in R

slide-11
SLIDE 11

DataCamp Cluster Analysis in R

slide-12
SLIDE 12

DataCamp Cluster Analysis in R

kmeans()

print(lineup) x y 1 -1 1 2 -2 -3 3 8 6 4 7 -8 ... ... ... model <- kmeans(lineup, centers = 2)

slide-13
SLIDE 13

DataCamp Cluster Analysis in R

Assigning Clusters

print(model$cluster) [1] 1 1 2 2 1 1 1 2 2 2 1 2 lineup_clustered <- mutate(lineup, cluster = model$cluster) print(lineup_clustered) x y cluster <dbl> <dbl> <int> 1 -1 1 1 2 -2 -3 1 3 8 6 2 4 7 -8 2 ... ... ... ...

slide-14
SLIDE 14

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-15
SLIDE 15

DataCamp Cluster Analysis in R

Evaluating Different Values of K by Eye

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-16
SLIDE 16

DataCamp Cluster Analysis in R

T

  • tal Within-Cluster Sum of Squares: k = 1
slide-17
SLIDE 17

DataCamp Cluster Analysis in R

T

  • tal Within-Cluster Sum of Squares: k = 2
slide-18
SLIDE 18

DataCamp Cluster Analysis in R

T

  • tal Within-Cluster Sum of Squares: k = 3
slide-19
SLIDE 19

DataCamp Cluster Analysis in R

T

  • tal Within-Cluster Sum of Squares: k = 4
slide-20
SLIDE 20

DataCamp Cluster Analysis in R

Elbow Plot

slide-21
SLIDE 21

DataCamp Cluster Analysis in R

Elbow Plot

slide-22
SLIDE 22

DataCamp Cluster Analysis in R

Generating the Elbow Plot

model <- kmeans(x = lineup, centers = 2) model$tot.withinss [1] 1434.5

slide-23
SLIDE 23

DataCamp Cluster Analysis in R

Generating the Elbow Plot

library(purrr) tot_withinss <- map_dbl(1:10, function(k){ model <- kmeans(x = lineup, centers = k) model$tot.withinss }) elbow_df <- data.frame( k = 1:10, tot_withinss = tot_withinss ) print(elbow_df) k tot_withinss 1 1 3489.9167 2 2 1434.5000 3 3 881.2500 4 4 637.2500 ... ... ...

slide-24
SLIDE 24

DataCamp Cluster Analysis in R

Generating the Elbow Plot

ggplot(elbow_df, aes(x = k, y = tot_withinss)) + geom_line() + scale_x_continuous(breaks = 1:10)

slide-25
SLIDE 25

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-26
SLIDE 26

DataCamp Cluster Analysis in R

Silhouette Analysis

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-27
SLIDE 27

DataCamp Cluster Analysis in R

Soccer Lineup with K = 3

slide-28
SLIDE 28

DataCamp Cluster Analysis in R

Silhouette Width

Within Cluster Distance: C(i) Closest Neighbor Distance: N(i)

slide-29
SLIDE 29

DataCamp Cluster Analysis in R

Silhouette Width

Within Cluster Distance: C(i) Closest Neighbor Distance: N(i)

slide-30
SLIDE 30

DataCamp Cluster Analysis in R

Silhouette Width

Within Cluster Distance: C(i) Closest Neighbor Distance: N(i)

slide-31
SLIDE 31

DataCamp Cluster Analysis in R

Silhouette Width

Within Cluster Distance: C(i) Closest Neighbor Distance: N(i)

slide-32
SLIDE 32

DataCamp Cluster Analysis in R

Silhouette Width

Within Cluster Distance: C(i) Closest Neighbor Distance: N(i)

slide-33
SLIDE 33

DataCamp Cluster Analysis in R

Silhouette Width: S(i)

slide-34
SLIDE 34

DataCamp Cluster Analysis in R

Silhouette Width: S(i)

1: Well matched to cluster 0: On border between two clusters

  • 1: Better fit in neighboring

cluster

slide-35
SLIDE 35

DataCamp Cluster Analysis in R

Calculating S(i)

library(cluster) pam_k3 <- pam(lineup, k = 3) pam_k3$silinfo$widths cluster neighbor sil_width 4 1 2 0.465320054 2 1 3 0.321729341 10 1 2 0.311385893 1 1 3 0.271890169 9 2 1 0.443606497 ... ... ... ...

slide-36
SLIDE 36

DataCamp Cluster Analysis in R

Silhouette Plot

sil_plot <- silhouette(pam_k3) plot(sil_plot)

slide-37
SLIDE 37

DataCamp Cluster Analysis in R

Silhouette Plot

sil_plot <- silhouette(pam_k3) plot(sil_plot)

slide-38
SLIDE 38

DataCamp Cluster Analysis in R

Average Silhouette Width

1: Well matched to each cluster 0: On border between clusters

  • 1: Poorly matched to each cluster

pam_k3$silinfo$avg.width [1] 0.353414

slide-39
SLIDE 39

DataCamp Cluster Analysis in R

Highest Average Silhouette Width

library(purrr) sil_width <- map_dbl(2:10, function(k){ model <- pam(x = lineup, k = k) model$silinfo$avg.width }) sil_df <- data.frame( k = 2:10, sil_width = sil_width ) print(sil_df) k sil_width 1 2 0.4164141 2 3 0.3534140 3 4 0.3535534 4 5 0.3724115 ... ... ...

slide-40
SLIDE 40

DataCamp Cluster Analysis in R

Choosing K Using Average Silhouette Width

ggplot(sil_df, aes(x = k, y = sil_width)) + geom_line() + scale_x_continuous(breaks = 2:10)

slide-41
SLIDE 41

DataCamp Cluster Analysis in R

Choosing K Using Average Silhouette Width

ggplot(sil_df, aes(x = k, y = sil_width)) + geom_line() + scale_x_continuous(breaks = 2:10)

slide-42
SLIDE 42

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-43
SLIDE 43

DataCamp Cluster Analysis in R

Making Sense of the K- Means Clusters

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-44
SLIDE 44

DataCamp Cluster Analysis in R

Wholesale Dataset

45 observations 3 features: Milk Spending Grocery Spending Frozen Food Spending

print(customers_spend) Milk Grocery Frozen 1 11103 12469 902 2 2013 6550 909 3 1897 5234 417 4 1304 3643 3045 5 3199 6986 1455 ... ... ... ...

slide-45
SLIDE 45

DataCamp Cluster Analysis in R

Segmenting with Hierarchical Clustering

slide-46
SLIDE 46

DataCamp Cluster Analysis in R

Segmenting with Hierarchical Clustering

cluster Milk Grocery Frozen cluster size 1 16950 12891 991 5 2 2512 5228 1795 29 3 10452 22550 1354 5 4 1249 3916 10888 6

slide-47
SLIDE 47

DataCamp Cluster Analysis in R

Segmenting with K-means

Estimate the "best" k using average silhouette width Run k-means with the suggested k Characterize the spending habits of these clusters of customers

slide-48
SLIDE 48

DataCamp Cluster Analysis in R

Let's cluster!

CLUSTER ANALYSIS IN R