Dimensionality Reduction
STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer - - PowerPoint PPT Presentation
STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer - - PowerPoint PPT Presentation
Dimensionality Reduction STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality Reduction Outline Dimensionality Reduction 2 / 24 Dimensionality Reduction High Dimensional Data Modern datasets
Dimensionality Reduction
Outline
Dimensionality Reduction 2 / 24
Dimensionality Reduction
High Dimensional Data
- Modern datasets often have huge numbers of variables
- E.g., images, biomarker data, measurements at
fine-grained time points, social networks, product preferences
- Clustering can be a useful way to find “groups” of similar
- bservations
- However, distance measures have some strange properties
in high dimensions
- Can be useful to try to extract a few dimensions that
carry most of the “signal” 3 / 24
Dimensionality Reduction
Images Have Many Variables
but maybe only a few meaningful “features” 4 / 24
Dimensionality Reduction
High dimensional inputs
Comprehensible arranged this way... 5 / 24
Dimensionality Reduction
“Eigenfaces”
6 / 24
Dimensionality Reduction
Finding the "Main Direction" of Variation
−20 −10 10 20 −20 −10 10 20 MidtermCentered QuizCentered
- ●
- ●
- ●
- 7 / 24
Dimensionality Reduction
Finding the “Eigen-features”
## Here I am pulling out the perpendicular directions in (Midterm,Quiz) ## space that align with the ellipse on the scatterplot. ## If you know some linear algebra: ## These are the eigenvectors of the covariance matrix directions <- select(Scores, Midterm, Quiz) %>% cov() %>% eigen() directions %>% extract2("vectors") %>% round(digits = 2) [,1] [,2] [1,] -0.97 0.24 [2,] -0.24 -0.97 ## Creating two new variables that are a weighted sum and weighted ## difference of the midterm and quiz score, with weights chosen so ## that the new variables are uncorrelated Scores_augmented <- mutate(Scores, V1 = 0.97 * Midterm + 0.24 * Quiz, V2 = 0.24 * Midterm - 0.97 * Quiz)
8 / 24
Dimensionality Reduction
Scatterplots
Scores_augmented %>% select(Final, Midterm, Quiz, V1, V2) %>% plot()
Final
60 90
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- 60
90
- ●
- ●
- ●
- ●
- 60
80
- ●
- ●
- ● ●
- ●
- 60
80
- ●
- ●
- ●
- ●
- ●
- Midterm
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- Quiz
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- 16
20 24
- ●
- ●
- ●
- ●
- ●
- 60
80 100
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- V1
- ●
- ●
- ●
- ●
- 60
90
- ●
- ●
- ● ●
- ●
- ●
- 16
24
- ●
- ●
- ●
- ●
- −1.5
0.0 1.0 −1.5 1.0
V2
9 / 24
Dimensionality Reduction
Scottish Parliament Votes
[C]onsider the Scottish Parliament in 2008. Legislators often vote together in pre-organized blocs, and thus the pattern of “ayes” and “nays” on particular ballots may indicate which members are affiliated (i.e., members of the same political party). To test this idea, you might try clustering the members by their voting record. – MDSR p. 212
10 / 24
Dimensionality Reduction
Scottish Parliament Votes
Figure: Source: MDSR p. 212
11 / 24
Dimensionality Reduction
Visualizing All of the Votes
library(mdsr) Votes %>% mutate(Vote = factor(vote, labels = c("Nay","Abstain","Aye"))) %>% ggplot(aes(x = bill, y = name, fill = Vote)) + geom_tile() + xlab("Ballot") + ylab("Member of Parliament") + scale_fill_manual(values = c("darkgray", "white", "goldenrod")) + scale_x_discrete(breaks = NULL, labels = NULL) + scale_y_discrete(breaks = NULL, labels = NULL)
Ballot Member of Parliament Vote
Nay Abstain Aye
12 / 24
Dimensionality Reduction
Visualizing Two Randomly Selected Votes
Votes %>% filter(bill %in% c("S1M-240.2", "S1M-639.1")) %>% spread(key = bill, value = vote) %>% ggplot(aes(x = `S1M-240.2`, y = `S1M-639.1`)) + geom_point( alpha = 0.7, position = position_jitter(width = 0.1, height = 0.1)) + geom_point(alpha = 0.01, size = 10, color = "red" )
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
S1M−240.2 S1M−639.1
Are there eight clusters of MPs? 13 / 24
Dimensionality Reduction
Two Arbitrary Aggregate Features
Votes_by_half <- Votes %>% mutate( set_num = bill %>% factor() %>% as.numeric(), set = ifelse( set_num < max(set_num) / 2, "First_Half", "Second_Half")) %>% group_by(name, set) %>% summarise(Ayes = sum(vote)) %>% spread(key = set, value = Ayes) Votes_by_half %>% head() # A tibble: 6 x 3 # Groups: name [6] name First_Half Second_Half <chr> <int> <int> 1 Adam, Brian
- 25
- 2
2 Aitken, Bill
- 32
- 17
3 Alexander, Ms Wendy 35 59 4 Baillie, Jackie 43 50 5 Barrie, Scott 48 54 6 Boyack, Sarah 43 54
14 / 24
Dimensionality Reduction
Visualizing these Features
Votes_by_half %>% ggplot(aes(x = First_Half, y = Second_Half)) + geom_point(alpha = 0.3, size = 5)
−25 25 50 −40 40
First_Half Second_Half
Maybe Two Clusters? 15 / 24
Dimensionality Reduction
A More Principled Approach
- Instead of arbitrarily splitting the votes into “first half”
and “second half”, we can extract some “high signal” aggregate features using linear algebra
- Singular Value Decomposition (SVD) takes a matrix
and finds linear combinations of columns (variables) that account for a high degree of variability in the observations 16 / 24
Dimensionality Reduction
Finding the Singular Value Decomposition
Votes_wide <- Votes %>% spread(key = bill, value = vote) vote_svd <- Votes_wide %>% select(-name) %>% svd() voters <- vote_svd %>% extract2("u") %>% extract(,1:5) %>% as.data.frame() Votes_wide %>% select(name) %>% bind_cols(voters) %>% head(n = 10) name V1 V2 V3 V4 1 Adam, Brian -0.05515009 0.14724830 -0.09105145 0.030773115 2 Aitken, Bill -0.03407529 -0.13428794 -0.21188577 -0.004948290 3 Alexander, Ms Wendy 0.09992837 0.02079317 -0.01574811 0.099420388 4 Baillie, Jackie 0.11781863 0.02655227 -0.03901258 0.053832092 5 Barrie, Scott 0.12508230 0.02608650 -0.03699042 0.056785890 6 Boyack, Sarah 0.11819840 0.02162116 -0.03661068 0.032889994 7 Brankin, Rhona 0.10747215 0.01804487 -0.03329344 0.082814861 8 Brown, Robert 0.11428676 0.02745173 -0.02935648 -0.157088404 9 Butler, Bill 0.08482701 0.01554906 -0.03115931 0.008411571 10 Campbell, Colin -0.05226689 0.13664703 -0.07726053 0.056225855 V5 1
- 0.042638572
2 0.007370791 3
- 0.045984299
4 0.028654279 5 0.029540439
17 / 24
Dimensionality Reduction
Clusters in 5D SVD space
clusts <- voters %>% kmeans(centers = 4, nstart = 100) voters <- voters %>% mutate(cluster = clusts %>% extract2("cluster") %>% factor()) ggplot(data = voters, aes(x = V1, y = V2)) + geom_point(aes(x = 0, y = 0), color = "red", shape = 1, size = 7) + geom_point(aes(color = cluster), size = 5, alpha = 0.3) + xlab("Best Vector from SVD") + ylab("Second Best Vector from SVD") + ggtitle("Political Positions of Members of Parliament")
- −0.1
0.0 0.1 −0.05 0.00 0.05 0.10
Best Vector from SVD Second Best Vector from SVD cluster
1 2 3 4
Political Positions of Members of Parliament
Based on the first two directions, there seem to be three clear political "poles"
18 / 24
Dimensionality Reduction
Do Clusters Align With Party?
voters <- voters %>% mutate(name = Votes_wide %>% pull(name)) %>% left_join(Parties, by = c("name" = "name")) tally(party ~ cluster, data = voters) cluster party 1 2 3 4 Member for Falkirk West 1 Scottish Conservative and Unionist Party 20 Scottish Green Party 1 Scottish Labour 0 55 3 Scottish Liberal Democrats 1 16 Scottish National Party 0 36 Scottish Socialist Party 1
- Clusters contain natural political coalitions:
- Conservatives, Labour, SNP, Misc. Left-leaning Parties
- Note that neither SVD or K-means had access to party
labels 19 / 24
Dimensionality Reduction
Ballots as Cases
ballots <- vote_svd %>% extract2("v") %>% extract(,1:5) %>% as.data.frame() Votes %>% select(bill) %>% cbind(ballots) %>% head(n = 10) bill V1 V2 V3 V4 1 S1M-1 -0.003911521 -0.001669706 -0.049765330 -0.073440431 2 S1M-4.1 -0.043320811 0.006895085 -0.043809250 -0.003178252 3 S1M-4.3 -0.028371451 -0.065110505 0.006923082 0.029097429 4 S1M-4 0.035988481 0.031509877 0.015616308 -0.001497610 5 S1M-5 0.035459228 0.031332118 0.015939149 -0.004646468 6 S1M-17 0.043641329 -0.013051370 0.032055199 -0.017684613 7 S1M-29 0.043090317 -0.013249216 0.032432808 -0.020407859 8 S1M-19.2 0.031961580 -0.001497625 -0.047185561 0.004931645 9 S1M-40.1 0.040041750 -0.039834460 -0.022308281 -0.022355226 10 S1M-40.2 0.042608461 -0.009594996 0.037376792 0.026150742 V5 1 0.01365109 2 0.06699053
20 / 24
Dimensionality Reduction
Clustering Ballots
clust_ballots <- kmeans(ballots, centers = 9, nstart = 1000) ballots <- ballots %>% mutate( bill = Votes %>% pull(bill) %>% levels(), cluster = clust_ballots %>% extract2("cluster") %>% factor())
21 / 24
Dimensionality Reduction
Clusters of Ballots
ggplot(data = ballots, aes(x = V1, y = V2)) + geom_point(aes(x = 0, y = 0), color = "red", shape = 1, size = 7) + geom_point(size = 5, alpha = 0.3, aes(color = cluster)) + xlab("Best Vector from SVD") + ylab("Second Best Vector from SVD") + ggtitle("Influential Ballots")
- −0.08
−0.04 0.00 0.04 0.08 −0.050 −0.025 0.000 0.025 0.050
Best Vector from SVD Second Best Vector from SVD cluster
1 2 3 4 5 6 7 8 9
Influential Ballots
22 / 24
Dimensionality Reduction
Reconstructing Votes by Ballot
Votes_svd <- Votes %>% mutate(Vote = factor(vote, labels = c("Nay", "Abstain", "Aye"))) %>% inner_join(ballots, by = "bill") %>% inner_join(voters, by = "name") head(Votes_svd) bill name vote Vote V1.x V2.x V3.x 1 S1M-1 Canavan, Dennis 1 Aye -0.003911521 -0.001669706 -0.049765330 2 S1M-4.1 Canavan, Dennis 1 Aye -0.046135648 0.008353676 -0.040647052 3 S1M-4.3 Canavan, Dennis 1 Aye -0.043388650 0.025912548 -0.003546323 4 S1M-4 Canavan, Dennis
- 1
Nay 0.044187201 -0.018875044 0.014666115 5 S1M-5 Canavan, Dennis
- 1
Nay 0.044091265 -0.017511402 0.020782315 6 S1M-17 Canavan, Dennis
- 1
Nay 0.043177286 -0.016040240 0.028316042 V4.x V5.x cluster.x V1.y V2.y V3.y 1 -0.0734404307 0.01365109 3 -0.03074476 0.1165759 -0.0266422 2 -0.0161246191 0.07078703 9 -0.03074476 0.1165759 -0.0266422 3 0.0032126506 0.05567132 9 -0.03074476 0.1165759 -0.0266422 4 0.0006466422 -0.06460513 5 -0.03074476 0.1165759 -0.0266422 5 -0.0088670061 -0.05732813 5 -0.03074476 0.1165759 -0.0266422 6 0.0463227944 -0.06792476 5 -0.03074476 0.1165759 -0.0266422 V4.y V5.y cluster.y party 1 -0.3021407 0.2797985 4 Member for Falkirk West 2 -0.3021407 0.2797985 4 Member for Falkirk West
23 / 24
Dimensionality Reduction
All Members/Votes, Sorted by 1st SVD Feature
Votes_svd %>% ggplot(aes(x = reorder(bill, V1.x), y = reorder(name, V1.y), fill = Vote)) + geom_tile() + xlab("Ballot") + ylab("Member of Parliament") + scale_fill_manual(values = c("darkgray", "white", "goldenrod")) + scale_x_discrete(breaks = NULL, labels = NULL) + scale_y_discrete(breaks = NULL, labels = NULL)
Ballot Member of Parliament Vote
Nay Abstain Aye