STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer - - PowerPoint PPT Presentation

stat 209 dimensionality reduction
SMART_READER_LITE
LIVE PREVIEW

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer - - PowerPoint PPT Presentation

Dimensionality Reduction STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality Reduction Outline Dimensionality Reduction 2 / 24 Dimensionality Reduction High Dimensional Data Modern datasets


slide-1
SLIDE 1

Dimensionality Reduction

STAT 209 Dimensionality Reduction

November 26, 2019 Colin Reimer Dawson 1 / 24

slide-2
SLIDE 2

Dimensionality Reduction

Outline

Dimensionality Reduction 2 / 24

slide-3
SLIDE 3

Dimensionality Reduction

High Dimensional Data

  • Modern datasets often have huge numbers of variables
  • E.g., images, biomarker data, measurements at

fine-grained time points, social networks, product preferences

  • Clustering can be a useful way to find “groups” of similar
  • bservations
  • However, distance measures have some strange properties

in high dimensions

  • Can be useful to try to extract a few dimensions that

carry most of the “signal” 3 / 24

slide-4
SLIDE 4

Dimensionality Reduction

Images Have Many Variables

but maybe only a few meaningful “features” 4 / 24

slide-5
SLIDE 5

Dimensionality Reduction

High dimensional inputs

Comprehensible arranged this way... 5 / 24

slide-6
SLIDE 6

Dimensionality Reduction

“Eigenfaces”

6 / 24

slide-7
SLIDE 7

Dimensionality Reduction

Finding the "Main Direction" of Variation

−20 −10 10 20 −20 −10 10 20 MidtermCentered QuizCentered

  • 7 / 24
slide-8
SLIDE 8

Dimensionality Reduction

Finding the “Eigen-features”

## Here I am pulling out the perpendicular directions in (Midterm,Quiz) ## space that align with the ellipse on the scatterplot. ## If you know some linear algebra: ## These are the eigenvectors of the covariance matrix directions <- select(Scores, Midterm, Quiz) %>% cov() %>% eigen() directions %>% extract2("vectors") %>% round(digits = 2) [,1] [,2] [1,] -0.97 0.24 [2,] -0.24 -0.97 ## Creating two new variables that are a weighted sum and weighted ## difference of the midterm and quiz score, with weights chosen so ## that the new variables are uncorrelated Scores_augmented <- mutate(Scores, V1 = 0.97 * Midterm + 0.24 * Quiz, V2 = 0.24 * Midterm - 0.97 * Quiz)

8 / 24

slide-9
SLIDE 9

Dimensionality Reduction

Scatterplots

Scores_augmented %>% select(Final, Midterm, Quiz, V1, V2) %>% plot()

Final

60 90

  • 60

90

  • 60

80

  • ● ●
  • 60

80

  • Midterm
  • ●●
  • Quiz
  • ●●
  • 16

20 24

  • 60

80 100

  • V1
  • 60

90

  • ● ●
  • 16

24

  • −1.5

0.0 1.0 −1.5 1.0

V2

9 / 24

slide-10
SLIDE 10

Dimensionality Reduction

Scottish Parliament Votes

[C]onsider the Scottish Parliament in 2008. Legislators often vote together in pre-organized blocs, and thus the pattern of “ayes” and “nays” on particular ballots may indicate which members are affiliated (i.e., members of the same political party). To test this idea, you might try clustering the members by their voting record. – MDSR p. 212

10 / 24

slide-11
SLIDE 11

Dimensionality Reduction

Scottish Parliament Votes

Figure: Source: MDSR p. 212

11 / 24

slide-12
SLIDE 12

Dimensionality Reduction

Visualizing All of the Votes

library(mdsr) Votes %>% mutate(Vote = factor(vote, labels = c("Nay","Abstain","Aye"))) %>% ggplot(aes(x = bill, y = name, fill = Vote)) + geom_tile() + xlab("Ballot") + ylab("Member of Parliament") + scale_fill_manual(values = c("darkgray", "white", "goldenrod")) + scale_x_discrete(breaks = NULL, labels = NULL) + scale_y_discrete(breaks = NULL, labels = NULL)

Ballot Member of Parliament Vote

Nay Abstain Aye

12 / 24

slide-13
SLIDE 13

Dimensionality Reduction

Visualizing Two Randomly Selected Votes

Votes %>% filter(bill %in% c("S1M-240.2", "S1M-639.1")) %>% spread(key = bill, value = vote) %>% ggplot(aes(x = `S1M-240.2`, y = `S1M-639.1`)) + geom_point( alpha = 0.7, position = position_jitter(width = 0.1, height = 0.1)) + geom_point(alpha = 0.01, size = 10, color = "red" )

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

S1M−240.2 S1M−639.1

Are there eight clusters of MPs? 13 / 24

slide-14
SLIDE 14

Dimensionality Reduction

Two Arbitrary Aggregate Features

Votes_by_half <- Votes %>% mutate( set_num = bill %>% factor() %>% as.numeric(), set = ifelse( set_num < max(set_num) / 2, "First_Half", "Second_Half")) %>% group_by(name, set) %>% summarise(Ayes = sum(vote)) %>% spread(key = set, value = Ayes) Votes_by_half %>% head() # A tibble: 6 x 3 # Groups: name [6] name First_Half Second_Half <chr> <int> <int> 1 Adam, Brian

  • 25
  • 2

2 Aitken, Bill

  • 32
  • 17

3 Alexander, Ms Wendy 35 59 4 Baillie, Jackie 43 50 5 Barrie, Scott 48 54 6 Boyack, Sarah 43 54

14 / 24

slide-15
SLIDE 15

Dimensionality Reduction

Visualizing these Features

Votes_by_half %>% ggplot(aes(x = First_Half, y = Second_Half)) + geom_point(alpha = 0.3, size = 5)

−25 25 50 −40 40

First_Half Second_Half

Maybe Two Clusters? 15 / 24

slide-16
SLIDE 16

Dimensionality Reduction

A More Principled Approach

  • Instead of arbitrarily splitting the votes into “first half”

and “second half”, we can extract some “high signal” aggregate features using linear algebra

  • Singular Value Decomposition (SVD) takes a matrix

and finds linear combinations of columns (variables) that account for a high degree of variability in the observations 16 / 24

slide-17
SLIDE 17

Dimensionality Reduction

Finding the Singular Value Decomposition

Votes_wide <- Votes %>% spread(key = bill, value = vote) vote_svd <- Votes_wide %>% select(-name) %>% svd() voters <- vote_svd %>% extract2("u") %>% extract(,1:5) %>% as.data.frame() Votes_wide %>% select(name) %>% bind_cols(voters) %>% head(n = 10) name V1 V2 V3 V4 1 Adam, Brian -0.05515009 0.14724830 -0.09105145 0.030773115 2 Aitken, Bill -0.03407529 -0.13428794 -0.21188577 -0.004948290 3 Alexander, Ms Wendy 0.09992837 0.02079317 -0.01574811 0.099420388 4 Baillie, Jackie 0.11781863 0.02655227 -0.03901258 0.053832092 5 Barrie, Scott 0.12508230 0.02608650 -0.03699042 0.056785890 6 Boyack, Sarah 0.11819840 0.02162116 -0.03661068 0.032889994 7 Brankin, Rhona 0.10747215 0.01804487 -0.03329344 0.082814861 8 Brown, Robert 0.11428676 0.02745173 -0.02935648 -0.157088404 9 Butler, Bill 0.08482701 0.01554906 -0.03115931 0.008411571 10 Campbell, Colin -0.05226689 0.13664703 -0.07726053 0.056225855 V5 1

  • 0.042638572

2 0.007370791 3

  • 0.045984299

4 0.028654279 5 0.029540439

17 / 24

slide-18
SLIDE 18

Dimensionality Reduction

Clusters in 5D SVD space

clusts <- voters %>% kmeans(centers = 4, nstart = 100) voters <- voters %>% mutate(cluster = clusts %>% extract2("cluster") %>% factor()) ggplot(data = voters, aes(x = V1, y = V2)) + geom_point(aes(x = 0, y = 0), color = "red", shape = 1, size = 7) + geom_point(aes(color = cluster), size = 5, alpha = 0.3) + xlab("Best Vector from SVD") + ylab("Second Best Vector from SVD") + ggtitle("Political Positions of Members of Parliament")

  • −0.1

0.0 0.1 −0.05 0.00 0.05 0.10

Best Vector from SVD Second Best Vector from SVD cluster

1 2 3 4

Political Positions of Members of Parliament

Based on the first two directions, there seem to be three clear political "poles"

18 / 24

slide-19
SLIDE 19

Dimensionality Reduction

Do Clusters Align With Party?

voters <- voters %>% mutate(name = Votes_wide %>% pull(name)) %>% left_join(Parties, by = c("name" = "name")) tally(party ~ cluster, data = voters) cluster party 1 2 3 4 Member for Falkirk West 1 Scottish Conservative and Unionist Party 20 Scottish Green Party 1 Scottish Labour 0 55 3 Scottish Liberal Democrats 1 16 Scottish National Party 0 36 Scottish Socialist Party 1

  • Clusters contain natural political coalitions:
  • Conservatives, Labour, SNP, Misc. Left-leaning Parties
  • Note that neither SVD or K-means had access to party

labels 19 / 24

slide-20
SLIDE 20

Dimensionality Reduction

Ballots as Cases

ballots <- vote_svd %>% extract2("v") %>% extract(,1:5) %>% as.data.frame() Votes %>% select(bill) %>% cbind(ballots) %>% head(n = 10) bill V1 V2 V3 V4 1 S1M-1 -0.003911521 -0.001669706 -0.049765330 -0.073440431 2 S1M-4.1 -0.043320811 0.006895085 -0.043809250 -0.003178252 3 S1M-4.3 -0.028371451 -0.065110505 0.006923082 0.029097429 4 S1M-4 0.035988481 0.031509877 0.015616308 -0.001497610 5 S1M-5 0.035459228 0.031332118 0.015939149 -0.004646468 6 S1M-17 0.043641329 -0.013051370 0.032055199 -0.017684613 7 S1M-29 0.043090317 -0.013249216 0.032432808 -0.020407859 8 S1M-19.2 0.031961580 -0.001497625 -0.047185561 0.004931645 9 S1M-40.1 0.040041750 -0.039834460 -0.022308281 -0.022355226 10 S1M-40.2 0.042608461 -0.009594996 0.037376792 0.026150742 V5 1 0.01365109 2 0.06699053

20 / 24

slide-21
SLIDE 21

Dimensionality Reduction

Clustering Ballots

clust_ballots <- kmeans(ballots, centers = 9, nstart = 1000) ballots <- ballots %>% mutate( bill = Votes %>% pull(bill) %>% levels(), cluster = clust_ballots %>% extract2("cluster") %>% factor())

21 / 24

slide-22
SLIDE 22

Dimensionality Reduction

Clusters of Ballots

ggplot(data = ballots, aes(x = V1, y = V2)) + geom_point(aes(x = 0, y = 0), color = "red", shape = 1, size = 7) + geom_point(size = 5, alpha = 0.3, aes(color = cluster)) + xlab("Best Vector from SVD") + ylab("Second Best Vector from SVD") + ggtitle("Influential Ballots")

  • −0.08

−0.04 0.00 0.04 0.08 −0.050 −0.025 0.000 0.025 0.050

Best Vector from SVD Second Best Vector from SVD cluster

1 2 3 4 5 6 7 8 9

Influential Ballots

22 / 24

slide-23
SLIDE 23

Dimensionality Reduction

Reconstructing Votes by Ballot

Votes_svd <- Votes %>% mutate(Vote = factor(vote, labels = c("Nay", "Abstain", "Aye"))) %>% inner_join(ballots, by = "bill") %>% inner_join(voters, by = "name") head(Votes_svd) bill name vote Vote V1.x V2.x V3.x 1 S1M-1 Canavan, Dennis 1 Aye -0.003911521 -0.001669706 -0.049765330 2 S1M-4.1 Canavan, Dennis 1 Aye -0.046135648 0.008353676 -0.040647052 3 S1M-4.3 Canavan, Dennis 1 Aye -0.043388650 0.025912548 -0.003546323 4 S1M-4 Canavan, Dennis

  • 1

Nay 0.044187201 -0.018875044 0.014666115 5 S1M-5 Canavan, Dennis

  • 1

Nay 0.044091265 -0.017511402 0.020782315 6 S1M-17 Canavan, Dennis

  • 1

Nay 0.043177286 -0.016040240 0.028316042 V4.x V5.x cluster.x V1.y V2.y V3.y 1 -0.0734404307 0.01365109 3 -0.03074476 0.1165759 -0.0266422 2 -0.0161246191 0.07078703 9 -0.03074476 0.1165759 -0.0266422 3 0.0032126506 0.05567132 9 -0.03074476 0.1165759 -0.0266422 4 0.0006466422 -0.06460513 5 -0.03074476 0.1165759 -0.0266422 5 -0.0088670061 -0.05732813 5 -0.03074476 0.1165759 -0.0266422 6 0.0463227944 -0.06792476 5 -0.03074476 0.1165759 -0.0266422 V4.y V5.y cluster.y party 1 -0.3021407 0.2797985 4 Member for Falkirk West 2 -0.3021407 0.2797985 4 Member for Falkirk West

23 / 24

slide-24
SLIDE 24

Dimensionality Reduction

All Members/Votes, Sorted by 1st SVD Feature

Votes_svd %>% ggplot(aes(x = reorder(bill, V1.x), y = reorder(name, V1.y), fill = Vote)) + geom_tile() + xlab("Ballot") + ylab("Member of Parliament") + scale_fill_manual(values = c("darkgray", "white", "goldenrod")) + scale_x_discrete(breaks = NULL, labels = NULL) + scale_y_discrete(breaks = NULL, labels = NULL)

Ballot Member of Parliament Vote

Nay Abstain Aye

24 / 24