multidimensional scaling
play

Multidimensional Scaling Max Turgeon STAT 4690Applied Multivariate - PowerPoint PPT Presentation

Multidimensional Scaling Max Turgeon STAT 4690Applied Multivariate Analysis Recap: PCA We discussed several interpretations of PCA. Pearson : PCA gives the best linear approximation to the data (at a fjxed dimension). We also


  1. Multidimensional Scaling Max Turgeon STAT 4690–Applied Multivariate Analysis

  2. Recap: PCA • We discussed several interpretations of PCA. • Pearson : PCA gives the best linear approximation to the data (at a fjxed dimension). • We also used PCA to visualized multivariate data: • Fit PCA • Plot PC1 against PC2. 2

  3. Multidimensional scaling • Multidimensional scaling is a method that looks at these two goals explicitely. • It has PCA has a special case. • But it is much more general. aims to represent the data in a lower-dimensional space possible to the original dissimilarities. 3 • The input of MDS is a dissimilarity matrix ∆ , and it such that the resulting dissimilarities ˜ ∆ are as close as • ∆ ≈ ˜ ∆ .

  4. Example of dissimilarities • Dissimilaries measure how difgerent two observations are. • Larger disssimilarity, more difgerent. • Therefore, any distance measure can be used as a dissimilarity measure. • Mahalanobis distance. • Driving distance between cities. • Graph-based distance. • Any similarity measure can be turned into a dissimilarity measure using a monotone decreasing transformation. 4 • Euclidean distance in R p . ⇒ 1 − r 2 • E.g. r ij = ij

  5. Two types of MDS • Metric MDS • The embedding in the lower dimensional space uses the same dissimilarity measure as in the original space. • Nonmetric MDS • The embedding in the lower dimensional space only uses the rank information from the original space. 5

  6. Metric MDS–Algorithm Algorithm 6 • Input: An n × n matrix ∆ of dissimilarities. • Output: An n × r matrix ˜ X , with r < p . 1. Create the matrix D containing the square of the entries in ∆ . 2. Create S by centering both the rows and the columns and multiplying by − 1 2 . 3. Compute the eigenvalue decomposition S = U Λ U T . 4. Let ˜ X be the matrix containing the fjrst r columns of Λ 1 / 2 U T .

  7. Example i Delta <- dist (swiss) D <- Delta ^ 2 # Center columns B <- scale (D, center = TRUE, scale = FALSE) # Center rows B <- t ( scale ( t (B), center = TRUE, scale = FALSE)) 7 B <- -0.5 * B

  8. Example ii decomp <- eigen (B) X_tilde <- decomp $ vectors %*% Lambda ^ 0.5 plot (X_tilde) 8 Lambda <- diag ( pmax (decomp $ values, 0))

  9. 9 Example iii 20 0 X_tilde[,2] −20 −40 −60 −60 −40 −20 0 20 40 X_tilde[,1]

  10. Example iv mds <- cmdscale (Delta, k = 2) all.equal (X_tilde[,1 : 2], mds, check.attributes = FALSE) ## [1] TRUE 10

  11. Example v library (tidyverse) # Let's add annotations dimnames (X_tilde) <- list ( rownames (swiss), paste0 ("MDS", seq_len ( ncol (X_tilde)))) rownames_to_column ("District") 11 X_tilde <- as.data.frame (X_tilde) %>%

  12. Example vi X_tilde <- X_tilde %>% mutate (Canton = case_when ( "Neuveville") ~ "Bern", "Sarine", "Veveyse") ~ "Fribourg", "Martigwy", "Monthey", "St Maurice", "Sierre", "Sion") ~ "Valais")) 12 District %in% c ("Courtelary", "Moutier", District %in% c ("Broye", "Glane", "Gruyere", District %in% c ("Conthey", "Entremont", "Herens",

  13. Example vii X_tilde <- X_tilde %>% mutate (Canton = case_when ( !is.na (Canton) ~ Canton, "Le Locle", "Neuchatel", "ValdeTravers", "Val de Ruz") ~ "Neuchatel", "Rive Gauche") ~ "Geneva", District %in% c ("Delemont", "Franches-Mnt", "Porrentruy") ~ "Jura", TRUE ~ "Vaud")) 13 District %in% c ("Boudry", "La Chauxdfnd", District %in% c ("V. De Geneve", "Rive Droite",

  14. Example viii library (ggrepel) ggplot ( aes (MDS1, MDS2)) + geom_point ( aes (colour = Canton)) + geom_label_repel ( aes (label = District)) + theme_minimal () + theme (legend.position = "top") 14 X_tilde %>%

  15. Example ix 15 Bern Geneva Neuchatel Vaud Canton Fribourg Jura Valais Oron Lavaux Aubonne Paysd'enhaut Cossonay Herens Echallens Payerne Avenches Conthey Aigle 20 Moudon Morges Sierre Entremont Rolle Orbe St Maurice Yverdon Neuveville Martigwy Broye Nyone Val de Ruz 0 Glane Monthey Grandson Boudry Veveyse Moutier Sion Delemont ValdeTravers MDS2 Rive Droite Courtelary Gruyere La Vallee −20 Vevey Franches−Mnt Sarine Lausanne Le Locle Porrentruy La Chauxdfnd Neuchatel Rive Gauche −40 −60 V. De Geneve −75 −50 −25 0 25 50 MDS1

  16. Figure 1 16

  17. Another example i 392 1769 0 ## DEN 1209 1769 918 1493 0 1493 392 598 542 ## DCA 918 598 0 853 585 ## ORD 0 853 library (psych) 934 ## BOS 542 1209 934 585 0 ## ATL DEN DCA BOS ORD ATL ## cities[1 : 5, 1 : 5] 17

  18. Another example ii mds <- cmdscale (cities, k = 2) colnames (mds) <- c ("MDS1", "MDS2") mds <- mds %>% as.data.frame %>% rownames_to_column ("Cities") 18

  19. Another example iii mds %>% ggplot ( aes (MDS1, MDS2)) + geom_point () + geom_label_repel ( aes (label = Cities)) + theme_minimal () 19

  20. Another example iv 20 MIA MSY 400 LAX ATL SFO MDS2 0 DEN DCA ORD −400 JFK BOS SEA −1000 0 1000 MDS1

  21. Another example v mds %>% mutate (MDS1 = - MDS1, MDS2 = - MDS2) %>% ggplot ( aes (MDS1, MDS2)) + geom_point () + geom_label_repel ( aes (label = Cities)) + theme_minimal () 21

  22. Another example vi 22 SEA BOS 400 JFK ORD DCA 0 MDS2 DEN SFO ATL −400 LAX MSY MIA −1000 0 1000 MDS1

  23. Why does it work? i • The algorithm may seem like black magic… • Double centering? • Eigenvectors of distances? • Let’s try to justify it. product are related as follows: 23 • Let Y 1 , . . . , Y n be a set of points in R p . • Recall that in R p , the Euclidean distance and the scalar

  24. Why does it work? given by ii 24 d ( Y i , Y j ) 2 = ⟨ Y i − Y j , Y i − Y j ⟩ = ( Y i − Y j ) T ( Y i − Y j ) = Y T i Y i − 2 Y T i Y j + Y T j Y j . • In other words, the scalar product between Y i and Y j is i Y j = − 1 d ( Y i , Y j ) 2 − Y T � � Y T i Y i − Y T j Y j . 2

  25. Why does it work? iii 25 • Let S be the matrix whose ( i, j ) -th entry is Y T i Y j , and note that D is the matrix whose ( i, j ) -th entry is d ( Y i , Y j ) 2 . • Now, assume that the dataset Y 1 , . . . , Y n has sample mean ¯ Y = 0 (i.e. it is centred). The average of the i -th row of D is

  26. Why does it work? iv 26 n n 1 d ( Y i , Y j ) 2 = 1 � � Y T i Y i − 2 Y T i Y j + Y T � � j Y j n n j =1 j =1 n n i Y i − 2 i Y j + 1 = Y T Y T Y T � � j Y j n n j =1 j =1 n Y + 1 i ¯ = Y T i Y i − 2 Y T � Y T j Y j n j =1 n = S ii + 1 � S jj . n j =1

  27. Why does it work? • We can then deduce that the mean of all the entries of v 27 • Similarly, the average of the j -th column of D is given by n n 1 d ( Y i , Y j ) 2 = 1 � � S ii + S jj . n n i =1 i =1 D is given by n n n n 1 d ( Y i , Y j ) 2 = 1 S ii + 1 � � � � S jj . n 2 n n i =1 j =1 i =1 j =1

  28. Why does it work? vi 28 • Putting all of this together, we now have that n j Y j = 1 Y T i Y i + Y T � d ( Y i , Y j ) 2 n j =1 n + 1 � d ( Y i , Y j ) 2 n i =1 n n − 1 � � d ( Y i , Y j ) 2 . n 2 i =1 j =1

  29. Why does it work? vii • In other words , we can recover the scalar products from the square distances through double centering and scaling. • Moreover, since we assumed the data was centred, the covariance matrix. • In this context, up to a constant, MDS and PCA give the same results. • Note : This idea that double centering allows us to go from dissimilaries to scalar products will come back again in the next lecture on kernel methods. 29 SVD of the matrix S is related to the SVD of the sample

  30. Further comments • In PCA, we performed an eigendecomposition of the sample covariance matrix. • In MDS, we performed an eigendecomposition of the doubly centred and scaled matrix of squared distances. • If our dissimilarities are computed using the Euclidean distance, both methods will give the same answer. • BUT : the smallest matrix will be faster to compute and faster to decompose. 30 • This is a p × p matrix. • This is an n × n matrix. • n > p ⇒ PCA ; n < p ⇒ MDS

  31. Stress function i • Nonmetric MDS approaches the problem a bit difgerently. also have an objective function called the stress function . • Recall that our goal is to represent the data in a lower-dimensional space such that the resulting dissimilarities. 31 • We still have the same output ∆ of dissimilarities, but we dissimilarities ˜ ∆ are as close as possible to the original • ∆ ij ≈ ˜ ∆ ij , for all i, j .

  32. Stress function ii • The stress function is defjned as minimise the stress function. • Note that the stress function depends on both the where 32 � i,j =1 w ij (∆ ij − ˜ � n � ∆ ij ) 2 � Stress( ˜ ∆; r ) = , � c • w ij are nonnegative weights; • c is a normalising constant. dimension r of the lower space and the distances ˜ ∆ . • Goal : Find points in R r such that their similarities

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend