What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data - - PowerPoint PPT Presentation

what is cluster analysis
SMART_READER_LITE
LIVE PREVIEW

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data - - PowerPoint PPT Presentation

DataCamp Cluster Analysis in R CLUSTER ANALYSIS IN R What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan Kettering Cancer Center DataCamp Cluster Analysis in R What is Clustering? DataCamp Cluster


slide-1
SLIDE 1

DataCamp Cluster Analysis in R

What is Cluster Analysis?

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-2
SLIDE 2

DataCamp Cluster Analysis in R

What is Clustering?

slide-3
SLIDE 3

DataCamp Cluster Analysis in R

What is Clustering?

slide-4
SLIDE 4

DataCamp Cluster Analysis in R

What is Clustering?

slide-5
SLIDE 5

DataCamp Cluster Analysis in R

What is Clustering?

slide-6
SLIDE 6

DataCamp Cluster Analysis in R

What is Clustering?

slide-7
SLIDE 7

DataCamp Cluster Analysis in R

What is Clustering?

slide-8
SLIDE 8

DataCamp Cluster Analysis in R

What is Clustering?

slide-9
SLIDE 9

DataCamp Cluster Analysis in R

What is Clustering?

slide-10
SLIDE 10

DataCamp Cluster Analysis in R

What is Clustering?

A form of exploratory data analysis (EDA) where

  • bservations are divided into meaningful groups

that share common characteristics (features).

slide-11
SLIDE 11

DataCamp Cluster Analysis in R

The Flow of Cluster Analysis

slide-12
SLIDE 12

DataCamp Cluster Analysis in R

The Flow of Cluster Analysis

slide-13
SLIDE 13

DataCamp Cluster Analysis in R

The Flow of Cluster Analysis

slide-14
SLIDE 14

DataCamp Cluster Analysis in R

The Flow of Cluster Analysis

slide-15
SLIDE 15

DataCamp Cluster Analysis in R

The Flow of Cluster Analysis

slide-16
SLIDE 16

DataCamp Cluster Analysis in R

Structure of This Course

slide-17
SLIDE 17

DataCamp Cluster Analysis in R

Structure of This Course

slide-18
SLIDE 18

DataCamp Cluster Analysis in R

Let's Learn!

CLUSTER ANALYSIS IN R

slide-19
SLIDE 19

DataCamp Cluster Analysis in R

Distance Between Two Observations

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-20
SLIDE 20

DataCamp Cluster Analysis in R

Distance vs Similarity

slide-21
SLIDE 21

DataCamp Cluster Analysis in R

Distance vs Similarity

DISTANCE = 1 − SIMILARITY

slide-22
SLIDE 22

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-23
SLIDE 23

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-24
SLIDE 24

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-25
SLIDE 25

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-26
SLIDE 26

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-27
SLIDE 27

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-28
SLIDE 28

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-29
SLIDE 29

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-30
SLIDE 30

DataCamp Cluster Analysis in R

Distance Between T wo Players

slide-31
SLIDE 31

DataCamp Cluster Analysis in R

dist() Function

print(two_players) X Y BLUE 0 0 RED 9 12 dist(two_players, method = 'euclidean') BLUE RED 15

slide-32
SLIDE 32

DataCamp Cluster Analysis in R

More than 2 Observations

print(three_players) X Y BLUE 0 0 RED 9 12 GREEN -2 19 dist(three_players) BLUE RED RED 15.00000 GREEN 19.10497 13.03840

slide-33
SLIDE 33

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-34
SLIDE 34

DataCamp Cluster Analysis in R

The Scales of Your Features

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-35
SLIDE 35

DataCamp Cluster Analysis in R

Distance Between Individuals

Observation Height (feet) Weight (lbs) 1 6.0 200 2 6.0 202 3 8.0 200 ... ... ... ... ... ...

slide-36
SLIDE 36

DataCamp Cluster Analysis in R

slide-37
SLIDE 37

DataCamp Cluster Analysis in R

slide-38
SLIDE 38

DataCamp Cluster Analysis in R

slide-39
SLIDE 39

DataCamp Cluster Analysis in R

slide-40
SLIDE 40

DataCamp Cluster Analysis in R

slide-41
SLIDE 41

DataCamp Cluster Analysis in R

Scaling our Features

height =

scaled

sd(height) height − mean(height)

slide-42
SLIDE 42

DataCamp Cluster Analysis in R

slide-43
SLIDE 43

DataCamp Cluster Analysis in R

slide-44
SLIDE 44

DataCamp Cluster Analysis in R

scale() function

print(height_weight) Height Weight 1 6 200 2 6 202 3 8 200 ... ... ... scale(height_weight) Height Weight 1 0.60 0.67 2 0.60 0.73 3 11.3 0.67 ... ... ...

slide-45
SLIDE 45

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R

slide-46
SLIDE 46

DataCamp Cluster Analysis in R

Measuring Distance For Categorical Data

CLUSTER ANALYSIS IN R

Dmitriy (Dima) Gorenshteyn

  • Sr. Data Scientist,

Memorial Sloan Kettering Cancer Center

slide-47
SLIDE 47

DataCamp Cluster Analysis in R

Binary Data

wine beer whiskey vodka 1 TRUE TRUE FALSE FALSE 2 FALSE TRUE TRUE TRUE ... ... ... ... ...

slide-48
SLIDE 48

DataCamp Cluster Analysis in R

Jaccard Index

J(A, B) = A ∪ B A ∩ B

slide-49
SLIDE 49

DataCamp Cluster Analysis in R

Calculating Jaccard Distance

wine beer whiskey vodka 1 TRUE TRUE FALSE FALSE 2 FALSE TRUE TRUE TRUE

J(1, 2) = = = 0.25 Distance(1, 2) = 1 − J(1, 2) = 0.75 1 ∪ 2 1 ∩ 2 4 1

slide-50
SLIDE 50

DataCamp Cluster Analysis in R

Calculating Jaccard Distance in R

print(survey_a) wine beer whiskey vodka <lgl> <lgl> <lgl> <lgl> 1 TRUE TRUE FALSE FALSE 2 FALSE TRUE TRUE TRUE 3 TRUE FALSE TRUE FALSE dist(survey_a, method = "binary") 1 2 2 0.7500000 3 0.6666667 0.7500000

slide-51
SLIDE 51

DataCamp Cluster Analysis in R

More Than T wo Categories

color sport 1 red soccer 2 green hockey 3 blue hockey 4 blue soccer ... ... ... colorblue colorgreen colorred sporthockey sportsoccer 1 1 1 2 1 1 3 1 1 4 1 1 ... ... ... ... ... ...

slide-52
SLIDE 52

DataCamp Cluster Analysis in R

Dummification in R

print(survey_b) color sport 1 red soccer 2 green hockey 3 blue hockey 4 blue soccer library(dummies) dummy.data.frame(survey_b) colorblue colorgreen colorred sporthockey sportsoccer 1 0 0 1 0 1 2 0 1 0 1 0 3 1 0 0 1 0 4 1 0 0 0 1

slide-53
SLIDE 53

DataCamp Cluster Analysis in R

Generalizing Categorical Distance in R

print(survey_b) color sport 1 red soccer 2 green hockey 3 blue hockey 4 blue soccer dummy_survey_b <- dummy.data.frame(survey_b) dist(dummy_survey_b, method = 'binary') 1 2 3 2 1.0000000 3 1.0000000 0.6666667 4 0.6666667 1.0000000 0.6666667

slide-54
SLIDE 54

DataCamp Cluster Analysis in R

Let's practice!

CLUSTER ANALYSIS IN R