Principal Component Analysis Surajit Ray Reader, University of - - PowerPoint PPT Presentation

principal component analysis
SMART_READER_LITE
LIVE PREVIEW

Principal Component Analysis Surajit Ray Reader, University of - - PowerPoint PPT Presentation

DataCamp Multivariate Probability Distributions in R MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R Principal Component Analysis Surajit Ray Reader, University of Glasgow DataCamp Multivariate Probability Distributions in R Principal Component


slide-1
SLIDE 1

DataCamp Multivariate Probability Distributions in R

Principal Component Analysis

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

Surajit Ray

Reader, University of Glasgow

slide-2
SLIDE 2

DataCamp Multivariate Probability Distributions in R

Principal Component Analysis (PCA) goals

Dimension reduction Creating uncorrelated variables Capturing variability in fewer dimensions

slide-3
SLIDE 3

DataCamp Multivariate Probability Distributions in R

Algorithm

PC1 explains maximum variation in

  • range direction

PC2 uncorrelated to PC1 - explains maximum remaining variation in blue direction PC3 uncorrelated to PC1 and PC2 - explains maximum remaining variation in green direction

princomp() function calculates PCs

slide-4
SLIDE 4

DataCamp Multivariate Probability Distributions in R

Principal Component Analysis in R

Simplified format

x: a numeric matrix or data frame cor: use correlation matrix instead of covariance scores: scores/projection of the data on principal components are produced

princomp(x, cor = FALSE, scores = TRUE)

slide-5
SLIDE 5

DataCamp Multivariate Probability Distributions in R

Principal Component Analysis of mtcars dataset

mtcars dataset relates to 11 variables on fuel consumption for 32 automobiles

head(mtcars,5) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

slide-6
SLIDE 6

DataCamp Multivariate Probability Distributions in R

Selecting numeric columns from mtcars dataset

Exclude the vs and am variables - both binary Perform PCA

mtcars.sub <- mtcars[ , -c(8,9)] cars.pca <- princomp(mtcars.sub, cor = TRUE, scores = TRUE)

slide-7
SLIDE 7

DataCamp Multivariate Probability Distributions in R

princomp function output

cars.pca # Output of cars.pca Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 2.378 1.443 0.710 0.515 0.428 0.352 0.324 0.242 0.149 summary(cars.pca) # Summary of cars.pca Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.378 1.443 0.710 0.5148 0.4280 0.3518 0.3241 0.2419 0.14896 Proportion of Variance 0.628 0.231 0.056 0.0294 0.0204 0.0138 0.0117 0.0065 0.00247 Cumulative Proportion 0.628 0.860 0.916 0.9453 0.9656 0.9794 0.9910 0.9975 1.00000

slide-8
SLIDE 8

DataCamp Multivariate Probability Distributions in R

Let's apply principal component analyis!

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

slide-9
SLIDE 9

DataCamp Multivariate Probability Distributions in R

Choosing the number of components

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

Surajit Ray

Reader, University of Glasgow

slide-10
SLIDE 10

DataCamp Multivariate Probability Distributions in R

Summary of princomp object

summary(cars.pca) # Summary of cars.pca Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.378 1.443 0.710 0.5148 0.4280 0.3518 0.3241 0.2419 0.14896 Proportion of Variance 0.628 0.231 0.056 0.0294 0.0204 0.0138 0.0117 0.0065 0.00247 Cumulative Proportion 0.628 0.860 0.916 0.9453 0.9656 0.9794 0.9910 0.9975 1.00000

slide-11
SLIDE 11

DataCamp Multivariate Probability Distributions in R

Using the scree plot

Method 1 Proportion of variation explained Choice based on steepness of curve followed by a flat line

screeplot(cars.pca, type = "lines")

slide-12
SLIDE 12

DataCamp Multivariate Probability Distributions in R

Cumulative variance explained

Method 2 Cumulative variation Explain predetermined value

summary(cars.pca) # Summary of cars.pca Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Standard deviation 2.378 1.443 0.710 0.5148 0.4280 0.3518 0.3241 0.2419 0.14896 Proportion of Variance 0.628 0.231 0.056 0.0294 0.0204 0.0138 0.0117 0.0065 0.00247 Cumulative Proportion 0.628 0.860 0.916 0.9453 0.9656 0.9794 0.9910 0.9975 1.00000

slide-13
SLIDE 13

DataCamp Multivariate Probability Distributions in R

Calculating cumulative proportional variance

Cumulative proportion

# Variance explained pc.var <- cars.pca$sdev^2 # Proportion of variation pc.pvar <- pc.var / sum(pc.var) # Cumulative proportion plot(cumsum(pc.pvar), type = 'b') abline(h = 0.9)

slide-14
SLIDE 14

DataCamp Multivariate Probability Distributions in R

Calculating cumulative proportional variance

Cumulative proportion 3 PCs explain 90 percent of the variation

# Variance explained pc.var <- cars.pca$sdev^2 # Proportion of variation pc.pvar <- pc.var / sum(pc.var) # Cumulative proportion plot(cumsum(pc.pvar), type = 'b') abline(h = 0.9)

slide-15
SLIDE 15

DataCamp Multivariate Probability Distributions in R

Let's practice using these techniques!

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

slide-16
SLIDE 16

DataCamp Multivariate Probability Distributions in R

Interpreting PCA outputs

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

Surajit Ray

Reader, University of Glasgow

slide-17
SLIDE 17

DataCamp Multivariate Probability Distributions in R

Attributes of princomp object

cars.pca <- princomp(mtcars.sub, cor = TRUE, scores = TRUE) attributes(cars.pca) $names [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"

slide-18
SLIDE 18

DataCamp Multivariate Probability Distributions in R

Interpretation of loadings

cars.pca$loadings # or loadings(cars.pca) Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 mpg 0.393 -0.221 -0.321 0.720 0.381 0.125 -0.115 cyl -0.403 -0.252 0.117 0.224 0.159 -0.810 -0.163 disp -0.397 0.339 -0.487 0.182 0.662 hp -0.367 -0.269 -0.295 0.354 -0.696 0.166 -0.252 drat 0.312 -0.342 0.150 0.846 0.162 -0.135 wt -0.373 0.172 0.454 0.191 -0.187 0.428 0.198 -0.569 qsec 0.224 0.484 0.628 -0.148 0.258 -0.276 -0.356 0.169 gear 0.209 -0.551 0.207 -0.282 -0.562 -0.323 -0.316 carb -0.245 -0.484 0.464 -0.214 0.400 0.357 0.206 0.108 0.320

slide-19
SLIDE 19

DataCamp Multivariate Probability Distributions in R

Geometry of loadings - numerical values

If we choose to retain two components

cars.pca$loadings[, 1:2] Loadings: Comp.1 Comp.2 mpg 0.393 cyl -0.403 disp -0.397 hp -0.367 -0.269 drat 0.312 -0.342 wt -0.373 0.172 qsec 0.224 0.484 gear 0.209 -0.551 carb -0.245 -0.484

slide-20
SLIDE 20

DataCamp Multivariate Probability Distributions in R

Geometry of loadings - plot

biplot(cars.pca, col = c("gray","steelblue"), cex = c(0.5, 1.3))

slide-21
SLIDE 21

DataCamp Multivariate Probability Distributions in R

PCA scores

Projection of the original dataset on the principal components. Total of 9 scores available for each observations

head(cars.pca$scores) # PC scores of first 6 observations Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Mazda RX4 0.67 -1.19 -0.21 -0.128 0.764 -0.127 0.430 0.0033 0.1697 Mazda RX4 Wag 0.65 -0.99 0.11 -0.087 0.667 -0.067 0.456 -0.0575 0.0727 Datsun 710 2.34 0.33 -0.21 -0.110 -0.077 -0.576 -0.392 0.2053 -0.1163 Hornet 4 Drive 0.22 2.01 -0.33 -0.313 -0.248 0.085 -0.034 0.0241 0.1476 Hornet Sportabout -1.61 0.84 -1.05 0.150 -0.226 0.186 0.059 -0.1548 0.1571 Valiant -0.05 2.49 0.11 -0.885 -0.128 -0.234 -0.228 -0.1002 0.0043

slide-22
SLIDE 22

DataCamp Multivariate Probability Distributions in R

PCA scores on first two components

Projection of the original dataset on the principal components Scores on the first two components

head(cars.pca$scores[, 1:2]) # First two PC scores of first 6 observations Comp.1 Comp.2 Mazda RX4 0.67 -1.19 Mazda RX4 Wag 0.65 -0.99 Datsun 710 2.34 0.33 Hornet 4 Drive 0.22 2.01 Hornet Sportabout -1.61 0.84 Valiant -0.05 2.49

slide-23
SLIDE 23

DataCamp Multivariate Probability Distributions in R

Calculating, visualizing and intrepreting scores

biplot(cars.pca, col = c("steelblue", "white"), cex = c(0.8, 0.01))

slide-24
SLIDE 24

DataCamp Multivariate Probability Distributions in R

Plotting scores using ggplot

scores <- data.frame(cars.pca$scores) ggplot(data = scores, aes(x = Comp.1, y = Comp.2, label = rownames(scores))) + geom_text(size = 4, col = "steelblue")

slide-25
SLIDE 25

DataCamp Multivariate Probability Distributions in R

Plotting and coloring scores using ggplot

cylinder <- factor(mtcars$cyl) ggplot(data = scores, aes(x = Comp.1, y = Comp.2, label = rownames(scores), color = cylinder)) + geom_text(size = 4)

slide-26
SLIDE 26

DataCamp Multivariate Probability Distributions in R

Using the factoextra library

fviz_pca_biplot() fviz_pca_ind() fviz_pca_var()

slide-27
SLIDE 27

DataCamp Multivariate Probability Distributions in R

slide-28
SLIDE 28

DataCamp Multivariate Probability Distributions in R

slide-29
SLIDE 29

DataCamp Multivariate Probability Distributions in R

slide-30
SLIDE 30

DataCamp Multivariate Probability Distributions in R

Let's practice these functions!

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

slide-31
SLIDE 31

DataCamp Multivariate Probability Distributions in R

Multi-dimensional Scaling

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

Surajit Ray

Reader, University of Glasgow

slide-32
SLIDE 32

DataCamp Multivariate Probability Distributions in R

What is Multidimensional Scaling?

Classical multidimensional scaling (MDS) or principal coordinates analysis INPUT matrix of distances OUTPUT Set of points in given dimensions such that the distances closely match the INPUT distances

cmdscale() function

Non-metrics scaling

isoMDS() sammon()

cmdscale(d, k = 2, ...)

slide-33
SLIDE 33

DataCamp Multivariate Probability Distributions in R

US City distance example

# UScitiesD dataset Atlanta Chicago Denver Houston LosAngeles Miami NewYork SanFrancisco Seattle Chicago 587 Denver 1212 920 Houston 701 940 879 LosAngeles 1936 1745 831 1374 Miami 604 1188 1726 968 2339 NewYork 748 713 1631 1420 2451 1092 SanFrancisco 2139 1858 949 1645 347 2594 2571 Seattle 2182 1737 1021 1891 959 2734 2408 678 Washington.DC 543 597 1494 1220 2300 923 205 2442 2329

slide-34
SLIDE 34

DataCamp Multivariate Probability Distributions in R

MDS on US city distance dataset

usloc <- cmdscale(UScitiesD) usloc [,1] [,2] Atlanta -719 143.0 Chicago -382 -340.8 Denver 482 -25.3 Houston -161 572.8 LosAngeles 1204 390.1 Miami -1134 581.9 NewYork -1072 -519.0 SanFrancisco 1421 112.6 Seattle 1342 -579.7 Washington.DC -980 -335.5 ggplot(data = data.frame(usloc), aes(x = X1, y = X2, label = rownames(usloc))) + geom_text()

slide-35
SLIDE 35

DataCamp Multivariate Probability Distributions in R

US cities MDS output

Plot of output from cmdscale Plot after rotation

slide-36
SLIDE 36

DataCamp Multivariate Probability Distributions in R

Multidimensional scaling on mtcars dataset

cars.dist <- dist(mtcars) cars.mds <- cmdscale(cars.dist, k = 2) cars.mds <- data.frame(cars.mds) ggplot(data = cars.mds, aes(x = X1, y = X2, label = rownames(cars.mds))) + geom_text()

slide-37
SLIDE 37

DataCamp Multivariate Probability Distributions in R

Multidimensional scaling in more than two dimensions

cars.dist <- dist(mtcars) cmds3 <- data.frame(cmdscale(cars.dist, k = 3)) scatterplot3d(cmds3, type = "h", pch = 19, lty.hplot = 2)

slide-38
SLIDE 38

DataCamp Multivariate Probability Distributions in R

Multidimensional scaling in more than two dimensions

cars.dist <- dist(mtcars) cmds3 <- data.frame(cmdscale(cars.dist, k = 3)) scatterplot3d(cmds3, type = "h", pch = 19, lty.hplot = 2, color = mtcars$cyl)

slide-39
SLIDE 39

DataCamp Multivariate Probability Distributions in R

Now let's try using MDS!

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

slide-40
SLIDE 40

DataCamp Multivariate Probability Distributions in R

Congratulations

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R

Surajit Ray

Reader, University of Glasgow

slide-41
SLIDE 41

DataCamp Multivariate Probability Distributions in R

Reading, summarizing and plotting multivariate data

Reading and reformatting data Summary statistics Mean Vector Variance-covariance matrix Correlation matrix Plotting in 2D and 3D Multivariate probability distributions: Normal T Skew-normal Skew-t Dimension reduction: PCA MDS

slide-42
SLIDE 42

DataCamp Multivariate Probability Distributions in R

What we have not covered

Mathematical details of multivariate distributions Other continuous distributions Wishart Discrete multivariate distributions Check out other DataCamp courses Linear Algebra behind PCA

slide-43
SLIDE 43

DataCamp Multivariate Probability Distributions in R

Congratulations!

MULTIVARIATE PROBABILITY DISTRIBUTIONS IN R