Visualization Max Turgeon STAT 4690Applied Multivariate Analysis - - PowerPoint PPT Presentation

visualization
SMART_READER_LITE
LIVE PREVIEW

Visualization Max Turgeon STAT 4690Applied Multivariate Analysis - - PowerPoint PPT Presentation

Visualization Max Turgeon STAT 4690Applied Multivariate Analysis Tidyverse For graphics, I personally prefer using ggplot2 than base R functions. Of course, youre free to use whatever you prefer! Therefore, I often use the


slide-1
SLIDE 1

Visualization

Max Turgeon

STAT 4690–Applied Multivariate Analysis

slide-2
SLIDE 2

Tidyverse

  • For graphics, I personally prefer using ggplot2 than base

R functions.

  • Of course, you’re free to use whatever you prefer!
  • Therefore, I often use the tidyverse packages to

prepare data for visualization

  • Great resources:
  • The book R for Data Science
  • RStudio’s cheatsheets

2

slide-3
SLIDE 3

Pipe operator

  • One of the important features of the tidyverse is the

pipe operator %>%

  • It takes the output of a function (or of an expression) and

uses it as input for the next function (or expression)

3

slide-4
SLIDE 4

library(tidyverse) count(mtcars, cyl) # Or with the pipe mtcars %>% count(cyl)

4

slide-5
SLIDE 5

Pipe operator

  • Note that the LHS (mtcars) becomes the fjrst argument
  • f the function appearing on the RHS (count)
  • In more complex examples, where multiple

transformations are applied one after another, the pipe

  • perator improves readability and avoids creating too

many intermediate variables.

5

slide-6
SLIDE 6

Main tidyverse functions

  • mutate: Create a new variable as a function of the other

variables mutate(mtcars, liters_per_100km = mpg/235.215)

  • filter: Keep only rows for which some condition is TRUE

filter(mtcars, cyl %in% c(6, 8))

  • summarise: Apply summary function to some variables.

Often used with group_by. mtcars %>% group_by(cyl) %>% summarise(avg_mpg = mean(mpg))

6

slide-7
SLIDE 7

Data Visualization

7

slide-8
SLIDE 8

Main principles

Why would we want to visualize data?

  • Quality control
  • Identify outliers
  • Find patterns of interest (EDA)

8

slide-9
SLIDE 9

Visualizing multivariate data

  • To start, you can visualize multivariate data one variable

at a time.

  • Therefore, you can use the same visualizing tools you’re

likely familiar with.

9

slide-10
SLIDE 10

Histogram i

library(tidyverse) library(dslabs) dim(olive) ## [1] 572 10

  • live %>%

ggplot(aes(oleic)) + geom_histogram()

10

slide-11
SLIDE 11

Histogram ii

20 40 65 70 75 80 85

  • leic

count

11

slide-12
SLIDE 12

Histogram iii

  • live %>%

ggplot(aes(oleic, fill = region)) + geom_histogram() + theme(legend.position = 'top')

12

slide-13
SLIDE 13

Histogram iv

20 40 65 70 75 80 85

  • leic

count region

Northern Italy Sardinia Southern Italy

13

slide-14
SLIDE 14

Histogram v

# Or with facets

  • live_bg <- olive %>% dplyr::select(-region)
  • live %>%

ggplot(aes(oleic, fill = region)) + geom_histogram(data = olive_bg, fill = 'grey') + geom_histogram() + facet_grid(. ~ region) + theme(legend.position = 'top')

14

slide-15
SLIDE 15

Histogram vi

Northern Italy Sardinia Southern Italy 65 70 75 80 85 65 70 75 80 85 65 70 75 80 85 20 40

  • leic

count region

Northern Italy Sardinia Southern Italy

15

slide-16
SLIDE 16

Density plot i

  • Another way to estimate the density is with kernel density

estimators.

  • Let X1, . . . , Xn be our IID sample. For K a non-negative

function and h > 0 a smoothing parameter, we have ˆ fn(x) = 1 nh

n

i=1

K

(x − Xi

h

)

.

  • Many functions K can be used: gaussian, rectangular,

triangular, Epanechnikov, biweight, cosine or optcosine (e.g. see Wikipedia)

16

slide-17
SLIDE 17

Density plot ii

  • live %>%

ggplot(aes(oleic)) + geom_density()

17

slide-18
SLIDE 18

Density plot iii

0.000 0.025 0.050 0.075 65 70 75 80 85

  • leic

density

18

slide-19
SLIDE 19

Density plot iv

  • live %>%

ggplot(aes(oleic, fill = region)) + geom_density(alpha = 0.5) + theme(legend.position = 'top')

19

slide-20
SLIDE 20

Density plot v

0.0 0.1 0.2 0.3 0.4 65 70 75 80 85

  • leic

density region

Northern Italy Sardinia Southern Italy

20

slide-21
SLIDE 21

ECDF plot i

  • Density plots are “smoothed histograms”
  • The smoothing can hide important details, or even create

artifacts

  • Another way of looking at the distribution: Empirical

CDFs

  • Easily compute/compare quantiles
  • Steepness corresponds to variance

21

slide-22
SLIDE 22

ECDF plot ii

  • live %>%

ggplot(aes(oleic)) + stat_ecdf() + ylab("Cumulative Probability")

22

slide-23
SLIDE 23

ECDF plot iii

0.00 0.25 0.50 0.75 1.00 65 70 75 80 85

  • leic

Cumulative Probability

23

slide-24
SLIDE 24

ECDF plot iv

# You can add a "rug"

  • live %>%

ggplot(aes(oleic)) + stat_ecdf() + geom_rug(sides = "b") + ylab("Cumulative Probability")

24

slide-25
SLIDE 25

ECDF plot v

0.00 0.25 0.50 0.75 1.00 65 70 75 80 85

  • leic

Cumulative Probability

25

slide-26
SLIDE 26

ECDF plot vi

  • live %>%

ggplot(aes(oleic, colour = region)) + stat_ecdf() + ylab("Cumulative Probability") + theme(legend.position = 'top')

26

slide-27
SLIDE 27

ECDF plot vii

0.00 0.25 0.50 0.75 1.00 65 70 75 80 85

  • leic

Cumulative Probability region

Northern Italy Sardinia Southern Italy

27

slide-28
SLIDE 28

Boxplot i

  • Box plots are a simple way to display important quantiles

and identify outliers

  • Components (per Tukey):
  • A box delimiting the fjrst and third quartile;
  • A line indicating the median;
  • Whiskers corresponding to the lowest datum still within

1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile;

  • Any datum that falls outside the whiskers is considered

a (potential) outlier.

28

slide-29
SLIDE 29

Boxplot ii

  • live %>%

ggplot(aes(y = oleic)) + geom_boxplot(x = 0)

29

slide-30
SLIDE 30

Boxplot iii

65 70 75 80 85 −0.4 −0.2 0.0 0.2 0.4

  • leic

30

slide-31
SLIDE 31

Boxplot iv

  • live %>%

ggplot(aes(x = region, y = oleic)) + geom_boxplot()

31

slide-32
SLIDE 32

Boxplot v

65 70 75 80 85 Northern Italy Sardinia Southern Italy

region

  • leic

32

slide-33
SLIDE 33

Boxplot vi

# Add all points on top of boxplots # Note: need to remove outliers or you will get # duplicates

  • live %>%

ggplot(aes(x = region, y = oleic)) + geom_boxplot(outlier.colour = NA) + geom_jitter(width = 0.25, height = 0)

33

slide-34
SLIDE 34

Boxplot vii

65 70 75 80 85 Northern Italy Sardinia Southern Italy

region

  • leic

34

slide-35
SLIDE 35

Bivariate plots

35

slide-36
SLIDE 36

Scatter plot i

  • The plots above displayed information on a single variable

at a time.

  • The simplest way to represent the relationship between

two variables is a scatter plot.

  • Technically still possible with three variables, but typically

more diffjcult to read. stars %>% ggplot(aes(magnitude, temp)) + geom_point()

36

slide-37
SLIDE 37

Scatter plot ii

10000 20000 30000 10

magnitude temp

37

slide-38
SLIDE 38

Scatter plot iii

stars %>% ggplot(aes(magnitude, temp)) + geom_point(aes(colour = type))

38

slide-39
SLIDE 39

Scatter plot iv

10000 20000 30000 10

magnitude temp type

A B DA DB DF F G K M O

39

slide-40
SLIDE 40

Scatter plot v

library(scatterplot3d) greenhouse_gases %>% spread(gas, concentration) %>% with(scatterplot3d(CH4, # x axis CO2, # y axis N2O # z axis ))

40

slide-41
SLIDE 41

Scatter plot vi

600 800 1000 1200 1400 1600 1800 260 270 280 290 300 310 320 260 280 300 320 340 360 380

CH4 CO2 N2O

41

slide-42
SLIDE 42

Bivariate density plot i

stars %>% ggplot(aes(magnitude, temp)) + geom_point(aes(colour = type)) + geom_density_2d()

42

slide-43
SLIDE 43

Bivariate density plot ii

10000 20000 30000 10

magnitude temp type

A B DA DB DF F G K M O

43

slide-44
SLIDE 44

Bagplot i

  • Introduced in 1999 by Rousseuw et al. as a bivariate

generalization of Tukey’s boxplot.

  • Help visualize location, spread, skewness, and identify

potential outliers.

  • Components (details omitted):
  • The bag, a polygon “at the center of the data cloud”

that contains at most 50% of the data points.

  • The fence, corresponding to an infmation of the bag

(typically by a factor of 3). Observations outside the fence are potential outliers.

  • The loop, which is the convex hull of the non-outliers.

44

slide-45
SLIDE 45

Bagplot ii

devtools::source_gist("00772ccea2dd0b0f1745", filename = "000_geom_bag.r") devtools::source_gist("00772ccea2dd0b0f1745", filename = "001_bag_functions.r") stars %>% ggplot(aes(magnitude, temp)) + geom_bag() + theme_bw()

45

slide-46
SLIDE 46

Bagplot iii

+

10000 20000 30000 10

magnitude temp

46

slide-47
SLIDE 47

Bagplot iv

stars %>% ggplot(aes(magnitude, temp)) + geom_bag() + geom_point(aes(colour = type)) + theme_bw()

47

slide-48
SLIDE 48

Bagplot v

+

10000 20000 30000 10

magnitude temp type

A B DA DB DF F G K M O

48

slide-49
SLIDE 49

Bagplot vi

gapminder %>% filter(year == 2012, !is.na(infant_mortality)) %>% ggplot(aes(infant_mortality, life_expectancy)) + geom_bag(aes(fill = continent)) + geom_point(aes(colour = continent)) + theme_bw()

49

slide-50
SLIDE 50

Bagplot vii

+ + + + +

45 55 65 75 25 50 75 100

infant_mortality life_expectancy continent

Africa Americas Asia Europe Oceania

50

slide-51
SLIDE 51

Bagplot viii

gapminder %>% filter(year == 2012, !is.na(infant_mortality)) %>% ggplot(aes(infant_mortality, life_expectancy)) + geom_bag(aes(fill = continent)) + geom_point(aes(colour = continent)) + facet_wrap(~continent) + theme_bw()

51

slide-52
SLIDE 52

Bagplot ix

+ + + + +

Europe Oceania Africa Americas Asia 25 50 75 100 25 50 75 100 25 50 75 100 45 55 65 75 45 55 65 75

infant_mortality life_expectancy continent

Africa Americas Asia Europe Oceania

52

slide-53
SLIDE 53

Beyond two variables

53

slide-54
SLIDE 54

Limitations

  • As we saw, three-dimensional scatter plots can be hard to

interpret.

  • And three-dimensional bagplots would be even harder!
  • Density plots can technically be constructed for any

dimension

  • But as the dimension increases, its performance

decreases rapidly

  • Solution: We can look at each variable marginally and at

each pairwise comparison.

54

slide-55
SLIDE 55

Pairs plot i

  • A pairs plot arranges these univariate summaries and

pairwise comparisons along a matrix.

  • Each variable corresponds to both a row and a column
  • Univariate summaries appear on the diagonal, and

pairwise comparisons ofg the diagonal.

  • Because of symmetry, we often see a difgerent summary
  • f the comparison above and below the diagonal
  • I will show two packages:
  • 1. GGally
  • 2. ggforce

55

slide-56
SLIDE 56

Pairs plot ii

library(GGally)

  • live %>%

dplyr::select(-region, -area) %>% ggpairs

56

slide-57
SLIDE 57

Pairs plot iii

Corr: 0.836 Corr: −0.17 Corr: −0.222 Corr: −0.837 Corr: −0.852 Corr: 0.114 Corr: 0.461 Corr: 0.622 Corr: −0.198 Corr: −0.85 Corr: 0.319 Corr: 0.0931 Corr: 0.0189 Corr: −0.218 Corr: −0.0574 Corr: 0.228 Corr: 0.0855 Corr: −0.041 Corr: −0.32 Corr: 0.211 Corr: 0.62 Corr: 0.502 Corr: 0.416 Corr: 0.14 Corr: −0.424 Corr: 0.089 Corr: 0.578 Corr: 0.329

palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.012.515.0 0.0 0.2 0.4 0.6 0.000.250.500.751.000.0 0.2 0.4 0.6 0.0 0.1 0.2 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.0 12.5 15.0 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6

57

slide-58
SLIDE 58

Pairs plot iv

library(ggforce)

  • live %>%

dplyr::select(-region, -area) %>% ggplot(aes(x = .panel_x, y = .panel_y)) + geom_point() + facet_matrix(vars(everything()))

58

slide-59
SLIDE 59

Pairs plot v

palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.012.515.0 0.0 0.2 0.4 0.6 0.000.250.500.751.000.0 0.2 0.4 0.6 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.0 12.5 15.0 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6

59

slide-60
SLIDE 60

Pairs plot vi

  • live %>%

dplyr::select(-region, -area) %>% ggplot(aes(x = .panel_x, y = .panel_y)) + geom_point() + geom_autodensity() + facet_matrix(vars(everything()), layer.diag = 2)

60

slide-61
SLIDE 61

Pairs plot vii

palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.012.515.0 0.0 0.2 0.4 0.6 0.000.250.500.751.000.0 0.2 0.4 0.6 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.0 12.5 15.0 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6

61

slide-62
SLIDE 62

Pairs plot viii

  • live %>%

dplyr::select(-region, -area) %>% ggplot(aes(x = .panel_x, y = .panel_y)) + geom_point() + geom_autodensity() + geom_density2d() + facet_matrix(vars(everything()), layer.diag = 2, layer.upper = 3)

62

slide-63
SLIDE 63

Pairs plot ix

palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic palmitic palmitoleic stearic

  • leic

linoleic linolenic arachidic eicosenoic 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.012.515.0 0.0 0.2 0.4 0.6 0.000.250.500.751.000.0 0.2 0.4 0.6 6 9 12 15 18 1 2 1.5 2.0 2.5 3.0 3.5 65 70 75 80 85 5.0 7.5 10.0 12.5 15.0 0.0 0.2 0.4 0.6 0.00 0.25 0.50 0.75 1.00 0.0 0.2 0.4 0.6

63